A two-stage octave convolution screen content image compression method based on multi-scale residual and window attention
By using a two-stage octave convolutional network with multi-scale residuals and window attention, the problem of insufficient performance in screen content image encoding is solved, achieving high-quality image compression with lower bitrate.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2023-11-22
- Publication Date
- 2026-06-23
AI Technical Summary
Most existing image compression algorithms are designed for natural scene images and fail to fully consider the unique characteristics of screen content images, resulting in insufficient encoding performance.
A two-stage octave convolutional network based on multi-scale residuals and window attention is used to perform frequency decomposition, extract high-frequency and low-frequency information of screen content images, and combine cascaded multi-scale residual blocks for cross-scale learning to capture high-contrast information.
At the same decoded image quality, a lower encoding bitrate indicates better encoding performance, enabling the achievement of better image quality at a lower bitrate.
Smart Images

Figure CN117544783B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention, belonging to the field of image processing technology. Background Technology
[0002] Currently, the amount of image and video content on the internet is increasing at an astonishing rate year by year, and the rapid development of cloud computing and remote technologies has led to a year-on-year increase in the proportion of screen content images. Especially in recent years, online meetings, remote control and collaboration, live streaming, and cloud gaming have gradually become important means of learning and entertainment in people's daily lives. How to encode and transmit massive amounts of screen content (SC) has become an urgent problem to be solved. Traditional image compression algorithms have undergone decades of development, resulting in many classic coding standards, such as H.264 / AVC, H.265 / HEVC, and H.266 / VVC. In recent years, learning-based image coding algorithms have shown excellent potential, surpassing the latest coding standard VVC in rate-distortion performance. However, most current learning-based image compression algorithms focus on encoding natural images, without considering the characteristics of screen content images. Traditional image coding schemes pay more attention to encoding natural scene content while ignoring screen content, which has significantly different characteristics from the former, including noise-free, high-contrast, and sharp edges.
[0003] Unlike natural scene (NS) images captured by traditional camera equipment, screen content is computer-generated and includes information such as text, tables, graphics, and animations. Therefore, screen content images have different signal characteristics than natural scene images. Screen content often features extremely high or low frequency content, such as large smooth areas and sharp text or edges. Furthermore, screen content is noise-free and contains more repeating patterns and pixels. Therefore, not all image coding techniques designed for camera-captured content are fully applicable to screen content. The latest general-purpose video coding standard VVC and the third-generation video coding standard AVS3 have explored screen content coding (SCC) and developed many coding tools, including Intra Block Copy (IBC), Palette Mode (PLT), Transform Skip Mode (TSM), Intra String Copy, and Deblocking Modifications. These coding tools can effectively improve the coding performance of screen content and further expand the application scope of coding standards. Compared to traditional coding schemes, end-to-end image compression schemes can jointly optimize various modules of the network to improve coding performance, demonstrating superior performance for screen content coding. Currently, end-to-end schemes for screen content coding have not been fully researched. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides a screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention.
[0005] Considering the unique characteristics of screen content images compared to natural scene images, this invention provides an end-to-end compression method for screen content images. It utilizes a two-stage octave convolutional network for frequency decomposition to extract high-frequency and low-frequency information from features. Simultaneously, it employs cascaded multi-scale residual blocks for cross-scale learning and combines a window-based attention module to capture high-contrast information. Experimental results demonstrate the effectiveness of the proposed scheme.
[0006] Terminology Explanation:
[0007] 1. LIC (Learned image compression): An image encoding algorithm based on deep learning.
[0008] 2. NS (Natural Scene): Natural scene content, generated by camera capture.
[0009] 3. SC (Screen Content): Screen content generated by the computer.
[0010] 4. VAE (Variational Autoencoders): A variational autoencoder is a generative model used to learn the latent representation of input data and generate new data samples.
[0011] 5. GDN (Generalized divisive normalization): Generalized divisive normalization operation.
[0012] 6. GoCB (Generalized Octave convolution block): Generalized octave convolution block.
[0013] 7. ToRB (Two-stage Octave Residual block): A two-stage octave convolution residual block.
[0014] 8. CMSRB (Cascaded Multi-scale Residual Blocks): Cascaded multi-scale residual blocks.
[0015] 9. WAM (Window-based Attention Module): A window-based attention mechanism module.
[0016] 10. RB (Residual Block): Residual network block.
[0017] 11. Q (Quantization): Quantization unit is the process of mapping continuous values of a signal to discrete values, and it is an important part of the image coding process.
[0018] 12. AE / AD (Arithmetic Encoding and Arithmetic Decoding): Arithmetic encoding / arithmetic decoding uses entropy encoding to write quantized features into the bitstream at the encoding end and decode them at the decoding end.
[0019] The technical solution of this invention is as follows:
[0020] A screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention includes:
[0021] The preprocessed screen content image is input into the trained end-to-end image compression module to encode and decode the screen content image, and output the reconstructed image.
[0022] Specifically, a two-stage octave convolutional residual block (ToRB) is used for frequency decomposition to extract high-frequency and low-frequency information of features; at the same time, a cascaded multi-scale residual block (CMSRB) is used for cross-scale learning, and a window-based attention (WAM) module is combined to capture high-contrast information.
[0023] According to a preferred embodiment of the present invention, the end-to-end image compression module includes a basic encoder g. a Basic decoder g s , Advanced prior encoder h a 、 Super-prior decoder h s Quantization unit, arithmetic encoder (AE), arithmetic decoder (AD), entropy parameter model, and context model;
[0024] In the encoding of screen content images: the input image x passes through the basic encoder module g. a The latent features y of the image are obtained, including high-frequency features and low-frequency features. The high-frequency features and low-frequency features are quantized separately to obtain the quantized high-frequency features y. H and quantized low-frequency features y L Then use an arithmetic encoder on y H and y L Arithmetic coding is performed to obtain the encoded bitstream of the latent feature y;
[0025] y passes through the hyper-prior encoder h a The super-prior latent features z of the image are obtained, z including super-prior high-frequency features and super-prior low-frequency features. The super-prior high-frequency features and super-prior low-frequency features are quantized separately to obtain the quantized super-prior high-frequency features z. H and quantized prior low-frequency features z L Then use an arithmetic encoder on z H and z L Arithmetic coding is performed to obtain the encoded bitstream of the super-prior latent feature z;
[0026] In the decoding of the screen content image: First, arithmetic decoding is performed on the bitstream of the super-prior latent feature z output by the super-prior encoder to obtain the quantized super-prior high-frequency feature z. H and quantized prior low-frequency features z L , z H and z L The dequantized super-prior high-frequency features are obtained after dequantization. and the dequantized prior low-frequency features and Together they constitute the decoded prior latent features
[0027] After the hyper-prior decoder hs The prior information of the latent feature y is obtained and fed into the entropy parameter model. The entropy model distribution parameters of the latent feature y are learned through the context model. The entropy parameter model, which learns the distribution parameters of the entropy parameter model, consists of three convolutional neural networks with 1×1 kernels. The context model includes a mask convolution with a 5×5 kernel. The output of the context model is then compared with the prior decoder h. s The outputs are fed into the entropy parameter model to obtain the mean and variance parameters of the entropy parameter model distribution. Based on the learned parameter distribution, arithmetic decoding is performed on the bitstream of the latent feature y to obtain the quantized high-frequency feature y. H and quantized low-frequency features y L ;y H and y L After dequantization, the high-frequency features are obtained. and low-frequency characteristics after dequantization and Together they constitute the latent features after decoding.
[0028] Through the basic decoder g s Obtain the reconstructed decoded image
[0029] According to a preferred embodiment of the present invention, the basic encoder module g a It includes a generalized octave convolutional block, four two-stage octave convolutional residual blocks, four cascaded multi-scale residual blocks, and two window-based attention modules. The cascaded multi-scale residual blocks use convolutional kernels of different sizes to extract multi-scale information, and the window-based attention modules include window attention blocks (WB) and residual blocks (RB).
[0030] According to a preferred embodiment of the present invention, the priori encoder h a It includes three two-stage octave convolutional residual blocks, where the latter two two-stage octave convolutional residual blocks use the LReLU function as the activation function.
[0031] According to a preferred embodiment of the present invention, the advanced prior decoder h s Includes three two-stage octave convolutional residual blocks, and a super-prior decoder h. s Symmetric, where convolution uses transposed convolution.
[0032] According to a preferred embodiment of the present invention, the basic decoder g s It includes a generalized octave convolutional block, four two-stage octave convolutional residual blocks, four concatenated multi-scale residual blocks, and two window-based attention modules; along with the basic encoder g. a Symmetric, where convolution uses transposed convolution.
[0033] According to a preferred embodiment of the present invention, frequency decomposition is performed using two-stage octave convolutional residual blocks to extract high-frequency and low-frequency information of features; including:
[0034] For the input original image x, a high-frequency feature with the same resolution as the original image is first obtained through a generalized octave convolutional neural network (GoCB). A low-frequency feature with half the resolution. in Representing spatial dimensions, W, H, and C H This represents the width, height, and number of channels of the feature map; where C H = (1-α)C and C L =αC, where α represents the ratio of channel allocation to input features; the specific implementation process is shown in equations (1), (2), (3), and (4):
[0035]
[0036]
[0037]
[0038]
[0039] In equations (1), (2), (3), and (4), Y H→H Y represents the characteristic of information being updated from high frequency to high frequency. L→L Y represents the characteristic of information being updated from low frequency to low frequency. L→H Y represents the characteristic of information conversion from low frequency to high frequency. H→L This indicates the characteristic of information conversion from high frequency to low frequency. and Y represents the high-frequency and low-frequency characteristics of the first-stage output. H and Y L This represents the high-frequency and low-frequency features output in the second stage; the function f(·; W) represents a convolution operation with parameter W, while ↑ and ↓ correspond to upsampling and downsampling convolutions with a stride of S, respectively. st (·) represents a skip connection with a convolution stride of 2, W H Indicates from To Y H The convolutional network parameters, W L Indicates from To Y L The parameters of the convolutional network are given by S2, where S2 represents a convolution stride of 2.
[0040] Y H Y LIt refers to the high-frequency and low-frequency information of the obtained features.
[0041] According to a preferred embodiment of the present invention, cross-scale learning using cascaded multi-scale residual blocks includes:
[0042] The cascaded multiscale residual block consists of two skip-connected multiscale residual blocks (MSRBs);
[0043] In each multi-scale residual block, features are extracted using two branches with different convolutional kernel sizes, and then concatenated for feature fusion. Specifically, the first branch uses a 3×3 convolution to extract features, and the second branch uses a 5×5 convolution to extract features. The features from the two branches are then interacted and fed into a 3×3 convolution and a 5×5 convolution, respectively. The resulting features are concatenated and fed into a 1×1 convolution. The output is then concatenated with a shortcut from the original input features to obtain the output of a multi-scale residual block.
[0044] The final feature output is obtained by using a cascaded multiscale residual block, which consists of two multiscale residual blocks cascaded together.
[0045] According to a preferred embodiment of the present invention, capturing high-contrast information by incorporating a window-based attention module includes:
[0046] More bits are allocated to complex, high-contrast regions using a window-based attention module, while bits are reserved in simple, low-contrast regions, as shown in equations (5), (6), and (7):
[0047]
[0048]
[0049]
[0050] First, for the input window features, cross-channel convolution transformations of θ, φ, and g are performed respectively. The outputs of θ and φ are calculated using f(·), and the output of g is calculated using g(·). Then, the outputs of f(·) and g(·) are multiplied together and C(X) is used. k Normalization yields By cross-channel convolution W z After transformation and The final output vector is obtained by performing shortcut connections.
[0051] in, and C(X) represents the i-th and j-th elements in the k-th window of the input features.k ) is the normalization factor, X k This represents the k-th window feature of the input features. The output at position i represents the weighted average obtained by transforming the features at positions i and j. θ and φ are the cross-channel transformations of a 1×1 convolution kernel, f(·) is the embedded Gaussian function, g(·) represents the convolution operation, and W... z To perform a linear 1×1 convolution on all channels, the final output vector is obtained after shortcut concatenation.
[0052] According to a preferred embodiment of the present invention, in the end-to-end image compression module, the loss function is expressed as shown in equation (8):
[0053]
[0054] Where λ represents the Lagrange multiplier, and the code rate R includes: high-frequency features y H entropy Low-frequency characteristics y L entropy Super-prior high-frequency features z H entropy and the prior low-frequency feature z L entropy distortion Represents the input image x and the reconstructed image The reconstruction error.
[0055] A computer device includes a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of a screen content image compression method based on a two-stage octave convolution with multi-scale residuals and window attention.
[0056] A computer-readable storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the steps of a screen content image compression method based on a two-stage octave convolution with multi-scale residuals and window attention.
[0057] The beneficial effects of this invention are as follows:
[0058] Compared with other methods, under the same objective evaluation index PSNR of decoded images, the coding bitrate of the present invention is lower, indicating that the method proposed in this invention has better coding performance, that is, better image quality can be obtained with a lower bitrate. Attached Figure Description
[0059] Figure 1 This is the overall flowchart of the screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention, as proposed in this invention.
[0060] Figure 2 This is a network structure diagram of the two-stage octave convolutional residual block used in this invention.
[0061] Figure 3 This is a network structure diagram of the cascaded multi-scale residual blocks used in this invention.
[0062] Figure 4 This is a network structure diagram of the window-based attention mechanism module used in this invention.
[0063] Figure 5 This is a comparison of the rate-distortion performance of the present invention with different encoding schemes on the SCID dataset.
[0064] Figure 6 This is a comparison of the rate-distortion performance of the present invention with different coding schemes on the SIQAD dataset. Detailed Implementation
[0065] The present invention will be further defined below with reference to the accompanying drawings and embodiments, but is not limited thereto.
[0066] Example 1
[0067] A screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention includes:
[0068] The screen content images are preprocessed (the preprocessing process involves collecting 2000 high-resolution screen content images in the training set, randomly dividing them into 256×256 image blocks, resulting in 33460 256×256 image blocks) and then input into the trained end-to-end image compression module to encode and decode the screen content images, outputting the reconstructed image. The image is then fed into the trained end-to-end image compression network, which consists of an encoder and a decoder. The encoder encodes the image into a bitstream, and the decoder decodes the encoded bitstream into the reconstructed image.
[0069] Specifically, a two-stage octave convolutional residual block (ToRB) is used for frequency decomposition to extract high-frequency and low-frequency information of features; at the same time, a cascaded multi-scale residual block (CMSRB) is used for cross-scale learning, and a window-based attention (WAM) module is combined to capture high-contrast information.
[0070] This invention leverages the advantages of two-stage octave convolutional residual blocks, cascaded multi-scale residual blocks, and window-based attention modules to further optimize the compression performance of screen content images.
[0071] Example 2
[0072] The screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention, as described in Example 1, differs in that:
[0073] like Figure 1 As shown, the end-to-end image compression module includes a basic encoder g a Basic decoder g s , Advanced prior encoder h a 、 Super-prior decoder h s Quantization unit, arithmetic encoder (AE) (entropy encoding), arithmetic decoder (AD) (entropy decoding), entropy parameter model, and context model;
[0074] In the encoding of screen content images: the input image x passes through the basic encoder module g. a The latent features y of the image are obtained, including high-frequency features and low-frequency features. The high-frequency features and low-frequency features are quantized separately to obtain the quantized high-frequency features y. H and quantized low-frequency features y L Then use an arithmetic encoder on y H and y L Arithmetic coding is performed to obtain the encoded bitstream of the latent feature y;
[0075] y passes through the hyper-prior encoder h a The super-prior latent features z of the image are obtained, z including super-prior high-frequency features and super-prior low-frequency features. The super-prior high-frequency features and super-prior low-frequency features are quantized separately to obtain the quantized super-prior high-frequency features z. H and quantized prior low-frequency features z L Then use an arithmetic encoder on z H and z L Arithmetic coding is performed to obtain the encoded bitstream of the super-prior latent feature z;
[0076] In the decoding of the screen content image: First, arithmetic decoding is performed on the bitstream of the super-prior latent feature z output by the super-prior encoder to obtain the quantized super-prior high-frequency feature z. H and quantized prior low-frequency features z L , z H and z L The dequantized super-prior high-frequency features are obtained after dequantization. and the dequantized prior low-frequency features and Together they constitute the decoded prior latent features
[0077] After the hyper-prior decoder h sThe prior information of the latent feature y is obtained and fed into the entropy parameter model. The entropy model distribution parameters of the latent feature y are learned through the context model. The entropy parameter model, which learns the distribution parameters of the entropy parameter model, consists of three convolutional neural networks with 1×1 kernels. The context model includes a mask convolution with a 5×5 kernel. The output of the context model is then compared with the prior decoder h. s The outputs are fed into the entropy parameter model to obtain the mean and variance parameters of the entropy parameter model distribution. Based on the learned parameter distribution, arithmetic decoding is performed on the bitstream of the latent feature y to obtain the quantized high-frequency feature y. H and quantized low-frequency features y L ;y H and y L After dequantization, the high-frequency features are obtained. and low-frequency characteristics after dequantization and Together they constitute the latent features after decoding.
[0078] Through the basic decoder g s Obtain the reconstructed decoded image
[0079] Basic encoder module g a It includes a generalized octave convolutional block, four two-stage octave convolutional residual blocks, four concatenated multi-scale residual blocks, and two window-based attention modules. The generalized octave convolutional block has a kernel size of 5×5 and a stride of 1. The two-stage octave convolutional residual block has a kernel size of 5×5 and a stride of 2. The concatenated multi-scale residual blocks use convolutional kernels of different sizes to extract multi-scale information, with kernel sizes of 3×3 and 5×5, and a stride of 1. The window-based attention modules include window attention blocks (WB) and residual blocks (RB). The window attention blocks have a kernel size of 1×1 and a stride of 1, while the residual blocks have a kernel size of 3×3 and a stride of 1.
[0080] The hyper-prior encoder h a It includes three two-stage octave convolutional residual blocks with a kernel size of 5×5. The first two-stage octave convolutional residual block has a stride of 1, and the latter two two-stage octave convolutional residual blocks have a stride of 2. The latter two two-stage octave convolutional residual blocks use the LReLU function as the activation function.
[0081] With the hyper-prior encoder h a Similarly, the super-prior decoder h s Includes three two-stage octave convolutional residual blocks, and a super-prior decoder h. s Symmetric, where convolution uses transposed convolution.
[0082] With the basic encoder g a Similarly, the basic decoder g s It includes a generalized octave convolutional block, four two-stage octave convolutional residual blocks, four concatenated multi-scale residual blocks, and two window-based attention modules; along with the basic encoder g. a Symmetric, where convolution uses transposed convolution.
[0083] Frequency decomposition is performed using two-stage octave convolutional residual blocks to extract high-frequency and low-frequency information from features; including:
[0084] like Figure 2 As shown. Light gray arrows represent information transmission from high frequency to low frequency, and dark gray arrows represent information transmission from low frequency to high frequency. Dashed arrows represent skip connections with convolution strides used to match dimensionality. Inspired by the division of image information into high-frequency and low-frequency information, the learned features can also be divided into high-frequency and low-frequency features according to their frequency. An octave convolutional network can be used to learn high-frequency and low-frequency features separately. For the input original image x, a high-frequency feature with the same resolution as the original image is first obtained through a generalized octave convolutional neural network (GoCB). A low-frequency feature with half the resolution. in Representing spatial dimensions, W, H, and C H This represents the width, height, and number of channels of the feature map; where C H = (1-α)C and C L =αC, where α represents the ratio of channel allocation to input features; for a two-stage octave convolutional residual block (ToRB), such as Figure 2 As shown, the input is X H and X L The output features go through two stages. First, they pass through an octave convolutional network to obtain the first stage output. and and The second stage output Y is obtained by passing it through another convolutional network. H and Y L When the step size s is set to 2, the output feature Y H and Y L The feature size is X of the input feature. H and X L Half of, that is and The specific implementation process is shown in equations (1), (2), (3), and (4):
[0085]
[0086]
[0087]
[0088]
[0089] In equations (1), (2), (3), and (4), Y H→H Y represents the characteristic of information being updated from high frequency to high frequency. L→L Y represents the characteristic of information being updated from low frequency to low frequency. L→H Y represents the characteristic of information conversion from low frequency to high frequency. H→L This indicates the characteristic of information conversion from high frequency to low frequency. and Y represents the high-frequency and low-frequency characteristics of the first-stage output. H and Y L This represents the high-frequency and low-frequency features output in the second stage; the function f(·; W) represents a convolution operation with parameter W, while ↑ and ↓ correspond to upsampling and downsampling convolutions with a stride of S, respectively. st (·) represents a skip connection with a convolution stride of 2, W H Indicates from To Y H The convolutional network parameters, W L Indicates from To Y L The parameters of the convolutional network, S2 represents the convolution stride of 2; Y H Y L It refers to the high-frequency and low-frequency information of the obtained features.
[0090] Cross-scale learning is performed using cascaded multi-scale residual blocks, including:
[0091] Generalized split normalization (GDN) and inverse generalized split normalization (IGDN) are commonly used techniques in learning image coding algorithms. However, to explore alternative nonlinear transformations to GDN / IGDN, research has shown that stacked residual networks can further enhance nonlinearity, thereby improving rate-distortion performance. This invention also employs stacked residual networks to implement nonlinear transformations. Unlike existing techniques, to effectively handle different frequency features of screen content, this invention proposes a cascaded multi-scale residual block. By introducing cascaded multi-scale residual blocks and using convolutional kernels of different sizes, the learning ability of features at different scales is enhanced. To further enhance nonlinear representation capabilities, two multi-scale residual blocks are cascaded. The network architecture of the cascaded multi-scale residual block is as follows: Figure 3 As shown, the cascaded multiscale residual block includes two skip-connected multiscale residual blocks (MSRBs);
[0092] In each multi-scale residual block, features are extracted using two branches with different convolutional kernel sizes, and then concatenated for feature fusion. Specifically, the first branch uses a 3×3 convolution to extract features, and the second branch uses a 5×5 convolution to extract features. The features from the two branches are then interacted and fed into a 3×3 convolution and a 5×5 convolution, respectively. The resulting features are concatenated and fed into a 1×1 convolution. The output is then concatenated with a shortcut from the original input features to obtain the output of a multi-scale residual block.
[0093] The final feature output is obtained by using a concatenated multiscale residual block, which consists of two multiscale residual blocks concatenated together. For example... Figure 1 As shown, in the basic encoder and basic decoder networks, the CMSRB module is used to replace the GDN / IGDN layer and added to the output of each two-stage octave convolutional residual block (ToRB).
[0094] Combining window-based attention modules to capture high-contrast information includes:
[0095] Attention mechanisms have been widely applied in computer vision in recent years, achieving significant results in various computer vision tasks such as object detection and recognition, and are now showing broad application prospects in image compression. Compared with natural scene images, screen content images have higher image contrast. For high-contrast regions, this invention utilizes a window-based attention block to achieve a more significant advantage. This innovation allows more bits to be allocated to complex high-contrast regions while retaining bits in simple low-contrast regions. The structure of the window-based attention block is as follows: Figure 4 As shown,
[0096] Figure 4 The left side of the image shows a schematic diagram of the network architecture of a window-based attention module; in the window-based attention module, the input feature is represented as f. in The output feature is represented as f out Specifically, f in The network first passes through two branches. One branch consists of three residual blocks (RBs), and the other branch consists of a window block (WB), three residual blocks (RBs), a 1×1 convolutional kernel, and a sigmoid layer. The outputs of these two layers are multiplied together and then multiplied by an input f. in The shortcuts are concatenated and summed to obtain the final output f. out The window block (WB) divides the feature map into M×M non-overlapping sub-window blocks.
[0097] More bits are allocated to complex, high-contrast regions using a window-based attention module, while bits are reserved in simple, low-contrast regions, as shown in equations (5), (6), and (7):
[0098]
[0099]
[0100]
[0101] First, for the input window features, cross-channel convolution transformations of θ, φ, and g are performed respectively. The outputs of θ and φ are calculated using f(·), and the output of g is calculated using g(·). Then, the outputs of f(·) and g(·) are multiplied together and C(X) is used. k Normalization yields By cross-channel convolution W z After transformation and The final output vector is obtained by performing shortcut connections.
[0102] in, and C(X) represents the i-th and j-th elements in the k-th window of the input features. k ) is the normalization factor, X k This represents the k-th window feature of the input features. The output at position i represents the weighted average obtained by transforming the features at positions i and j. θ and φ are the cross-channel transformations of a 1×1 convolution kernel, f(·) is the embedded Gaussian function, g(·) represents the convolution operation, and W... z To perform a linear 1×1 convolution on all channels, the final output vector is obtained after shortcut concatenation.
[0103] The loss function needs to consider the joint optimization of bit rate R and distortion D. Here, it is regarded as a rate-distortion optimization problem based on Lagrange multipliers. In the end-to-end image compression module, the loss function is expressed as shown in Equation (8):
[0104]
[0105] Where λ represents the Lagrange multiplier, and the code rate R includes: high-frequency features y H entropy Low-frequency characteristics y L entropy Super-prior high-frequency features z H entropy and the prior low-frequency feature z L entropy distortion Represents the input image x and the reconstructed image The reconstruction error.
[0106] This embodiment integrates some existing screen content datasets and simultaneously collects some web page images, mobile device images, and game images online to construct a screen content image dataset containing more than 2,000 images, covering various content types, including text, charts, animations, web content, mobile content, game videos, etc. During training, the images in the training set are randomly divided into 256×256 image blocks, the network batch size is set to 8, the Adam optimizer is used, and training is performed for 400 epochs. The initial learning rate is 1e-4, and the learning rate is reduced to 1e-5 after the 300th epoch. The mean squared error (MSE) is used as the optimization metric for the model, and the rate-distortion optimization parameter λ is set to {0.0018, 0.0035, 0.0067, 0.013, 0.025, 0.0483}, and the channel allocation ratio α for octave convolution is set to 0.5. Regarding channel settings, the codec has 192 convolutional channels at low bitrates and 320 channels at high bitrates. The model's performance was evaluated using the SCID and SIQAD datasets.
[0107] The experiment compared the rate-distortion (RD) performance of the method of this invention with other deep learning-based image coding methods. Figure 5 and Figure 6 The coding performance of different methods was compared on the SCID and SIQAD datasets. It can be seen that, compared with other methods, under the same objective evaluation metric of PSNR for decoded images, the coding bitrate of this invention is lower, indicating that the proposed method has better coding performance, that is, better image quality can be obtained with a lower bitrate.
[0108] Example 3
[0109] A computer device includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the steps of the screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention as described in Embodiment 1 or 2.
[0110] Example 4
[0111] A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention as described in Embodiment 1 or 2.
Claims
1. A screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention, characterized in that, include: The preprocessed screen content image is input into the trained end-to-end image compression module to encode and decode the screen content image, and output the reconstructed image. Specifically, frequency decomposition is performed using two-stage octave convolutional residual blocks to extract high-frequency and low-frequency information of features; at the same time, cross-scale learning is performed using cascaded multi-scale residual blocks, and a window-based attention module is combined to capture high-contrast information. Frequency decomposition is performed using two-stage octave convolutional residual blocks to extract high-frequency and low-frequency information from features, including: For the input original image First, a high-frequency feature with the same resolution as the original image is obtained through a generalized octave convolutional neural network. A low-frequency feature with half the resolution. ,in Indicates spatial dimension, , and This represents the width, height, and number of channels of the feature map; where, and , This represents the ratio of channel allocation to input features; the specific implementation process is shown in equations (1), (2), (3), and (4): In equations (1), (2), (3), and (4), This indicates a feature where information is updated from high frequency to high frequency. This indicates a feature where information is updated from low frequency to low frequency. This indicates the characteristic of information conversion from low frequency to high frequency. This indicates the characteristic of information conversion from high frequency to low frequency. and This represents the high-frequency and low-frequency characteristics of the first-stage output. and This represents the high-frequency and low-frequency characteristics of the second-stage output; the function... The parameter is The convolution operation, and and Corresponding to Perform upsampling and downsampling convolutions with a stride of [size missing]. This indicates a skip connection with a convolution stride of 2. Indicates from arrive The parameters of the convolutional network, Indicates from arrive The parameters of the convolutional network, This indicates that the convolution stride is 2; , It refers to the high-frequency and low-frequency information of the obtained features; Cross-scale learning is performed using cascaded multi-scale residual blocks, including: A cascaded multiscale residual block consists of two skip-connected multiscale residual blocks. In each multi-scale residual block, features are extracted using two branches with different convolutional kernel sizes, and then concatenated for feature fusion. Specifically, the first branch uses a 3×3 convolution to extract features, and the second branch uses a 5×5 convolution to extract features. The features from the two branches are then interacted and fed into a 3×3 convolution and a 5×5 convolution, respectively. The resulting features are concatenated and fed into a 1×1 convolution. The output is then concatenated with a shortcut from the original input features to obtain the output of a multi-scale residual block. The final feature output is obtained by using a cascaded multiscale residual block, which consists of two multiscale residual blocks cascaded together.
2. The screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention as described in claim 1, characterized in that, The end-to-end image compression module includes a basic encoder. Basic decoder Advanced prior encoder Super-prior decoder Quantization unit, arithmetic encoder (AE), arithmetic decoder (AD), entropy parameter model, and context model; In the encoding of screen content images: Input image After the basic encoder module Obtain the latent features of the image , This includes high-frequency features and low-frequency features. The high-frequency features and low-frequency features are quantized separately to obtain the quantized high-frequency features. and quantized low-frequency features Then use an arithmetic encoder to... and Arithmetic encoding is used to obtain latent features. The encoded bitstream; After the priori encoder Obtain the super-prior latent features of the image , This includes both high-frequency and low-frequency prior features. The high-frequency and low-frequency prior features are quantized separately to obtain the quantized high-frequency prior features. Quantized prior low-frequency features Then use an arithmetic encoder to... and Arithmetic encoding is used to obtain the advanced prior latent features. The encoded bitstream; In decoding the screen content image: First, the super-prior latent features output by the super-prior encoder are processed. Arithmetic decoding of the bitstream yields quantized super-prior high-frequency features. Quantized prior low-frequency features , and The dequantized super-prior high-frequency features are obtained after dequantization. and the dequantized prior low-frequency features , and Together they constitute the decoded prior latent features ; Based on the learned parameter distribution, for latent features Arithmetic decoding is performed on the bitstream to obtain the quantized high-frequency features. and quantized low-frequency features ; and After dequantization, the high-frequency features are obtained. and low-frequency characteristics after dequantization , and Together they constitute the latent features after decoding. ; Through the basic decoder Obtain the reconstructed decoded image .
3. The screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention as described in claim 2, characterized in that, Basic encoder module It includes a generalized octave convolution block, four two-stage octave convolution residual blocks, four concatenated multi-scale residual blocks, and two window-based attention modules; Cascaded multi-scale residual blocks use convolutional kernels of different sizes to extract multi-scale information, and the window-based attention module includes window attention blocks and residual blocks; Advanced prior encoder It includes three two-stage octave convolutional residual blocks, where the latter two two-stage octave convolutional residual blocks use the LReLU function as the activation function.
4. The screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention as described in claim 2, characterized in that, Super-prior decoder Includes three two-stage octave convolutional residual blocks, and a super-prior decoder. Symmetrical, where convolution uses transposed convolution; Basic Decoder It includes a generalized octave convolutional block, four two-stage octave convolutional residual blocks, four concatenated multi-scale residual blocks, and two window-based attention modules; along with the basic encoder. Symmetric, where convolution uses transposed convolution.
5. The screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention as described in claim 1, characterized in that, Combining window-based attention modules to capture high-contrast information includes: More bits are allocated to complex, high-contrast regions using a window-based attention module, while bits are reserved in simple, low-contrast regions, as shown in equations (5), (6), and (7): First, for the input window features, respectively through , and Cross-channel convolution transformation, for and Output utilization Perform calculations for Output utilization Perform the calculations, and then... and After multiplying the outputs, use Normalization yields , Through cross-channel convolution After transformation and The final output vector is obtained by performing shortcut connections. ; in, and The first feature represents the input feature. The first window The and the first One element, As the normalization factor, The first feature represents the input feature. Window features, Indicates the first The output at the nth position is based on the nth position. and the The weighted average is obtained by transforming the features at each location. and The convolution kernel is Cross-channel transformation of convolution, For embedded Gaussian functions, This represents the convolution operation. To perform linearization on all channels Convolution, ultimately yielding the output vector after shortcut concatenation. .
6. A screen content image compression method based on two-stage octave convolution with multi-scale residuals and window attention, as described in any one of claims 2-5, characterized in that... In the end-to-end image compression module, the loss function is expressed as shown in equation (8): in, Represents Lagrange multipliers, code rate Includes: high-frequency features entropy Low-frequency characteristics entropy High-frequency features beyond prior knowledge entropy and prior low-frequency features entropy ;distortion Indicates the input image and reconstructed image The reconstruction error.
7. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the screen content image compression method based on two-stage octave convolution according to any one of claims 1-6.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the screen content image compression method based on two-stage octave convolution according to any one of claims 1-6.