A method and system for garbage image super-resolution of lightweight transformer
By employing a lightweight Transformer-based super-resolution method for junk images, utilizing hierarchical feature extraction and residual fusion architecture, and incorporating prior knowledge, the problem of low image super-resolution and high computational complexity is solved, achieving efficient image quality restoration.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YANGTZE RIVER DELTA RES INST OF NPU TAICANG
- Filing Date
- 2022-12-28
- Publication Date
- 2026-06-12
AI Technical Summary
Existing image super-resolution methods suffer from high computational complexity, slow training convergence, and applicability only to single-size input images, making it difficult to effectively improve the resolution and quality of junk images.
A lightweight Transformer-based super-resolution method for junk images is proposed. By combining prior knowledge with a hierarchical feature extraction module and a residual fusion architecture, it achieves hierarchical fusion and efficient restoration of image detail features.
While reducing computational complexity, it significantly improves the performance of image super-resolution, making it suitable for the identification and classification of garbage images and possessing significant application value.
Smart Images

Figure CN115936992B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image processing and computer vision technology, specifically relating to a lightweight Transformer-based super-resolution method and system for junk images. Background Technology
[0002] Image super-resolution is an important branch of computer vision, referring to the reconstruction of high-resolution images with rich details and clear textures from low-resolution images as input. Image super-resolution has wide applications in fields such as medical diagnosis, security monitoring, and video restoration. For example, in the medical field, high-quality images can help doctors accurately detect diseases; in the monitoring and security field, it can improve image quality and obtain more valuable image information. Therefore, image super-resolution is of great significance in both academic research and industrial applications.
[0003] To address the single-image super-resolution problem, researchers have developed various methods based on low-level visual task degradation models, building upon the aforementioned models. Super-resolution methods that do not use neural networks are referred to as traditional methods. Generally, traditional single-image super-resolution can be categorized into three types: methods based on the image's intrinsic information, methods based on prior knowledge, and machine learning methods. However, traditional methods suffer from drawbacks in image processing, such as overly complex models, high computational costs, and the need for manual parameter setting during the super-resolution process. Deep learning methods have emerged as a significant solution to these problems. In 2016, researchers proposed the Super-Resolution Convolutional Neural Network (SRCNN), marking the first application of deep learning to image super-resolution reconstruction. Compared to traditional methods, this model not only automatically learns parameters but also employs a simpler optimization algorithm and a more lightweight model structure. Despite containing only one preprocessing layer and three convolutional layers, this network model demonstrates superior learning capabilities compared to some popular machine learning methods in image super-resolution, proving the superiority of Convolutional Neural Networks (CNNs) in handling image super-resolution tasks. The SRCNN network structure is shown below. Figure 1 As shown.
[0004] While SRCNN successfully introduced deep learning techniques to the single-image super-resolution problem, it still has three limitations:
[0005] First, the model relies on contextual information from small image regions;
[0006] Second, the model training converges too slowly;
[0007] Third, the model is only applicable to input images of a single size.
[0008] To address the aforementioned issues, researchers proposed the Very Deep Super-Resolution Convolutional Network (VDSR). For the first problem of SRCNN, VDSR effectively utilizes contextual information from large image regions by employing multiple cascaded small filters within its deep network structure. For the second problem, VDSR utilizes a residual structure and uses adjustable gradient clipping to increase the learning rate and accelerate convergence. For the third problem, the model achieves multi-scale modeling by setting a scaling factor, allowing model parameters to be shared across all predefined scaling factors. The VDSR network structure is shown below. Figure 2 As shown. Overall, VDSR introduces residual learning into the field of image super-resolution, using residual learning and an extremely high learning rate to achieve rapid optimization of deep networks, maximizing convergence speed and ensuring training stability through gradient clipping. Summary of the Invention
[0009] The technical problem to be solved by this invention is to provide a lightweight Transformer super-resolution method and system for junk images, which addresses the shortcomings of the prior art. This method solves the technical problems of low resolution and poor image quality in junk images, and effectively improves the performance of image super-resolution networks while reducing the computational complexity of the model and accelerating the training process.
[0010] The present invention adopts the following technical solution:
[0011] A lightweight Transformer-based super-resolution method for garbage images includes the following steps:
[0012] By performing layered extraction on low-resolution images and adding residual information, image detail feature information at four different levels is obtained;
[0013] Image detail features at four different levels are processed through convolution and lightweight operations. Prior knowledge is then hierarchically fused into the image detail features at the four different levels. The features are then merged and output using a residual fusion architecture.
[0014] The output image detail features are reconstructed pixel by pixel to obtain the predicted high-resolution image.
[0015] Specifically, obtaining image detail feature information at four different levels involves:
[0016] The hierarchical feature extraction module outputs image detail feature information at different levels. The feature fusion block combines four types of image features and fuses prior information with image features to obtain four different levels of image detail feature information.
[0017] Furthermore, the hierarchical feature extraction module comprises four levels, as detailed below:
[0018] The first level consists of 8 Transformer L1 blocks. According to the data processing order, the first 4 Transformer L1 blocks perform preliminary extraction and are then processed by the PCA module before being input into the second level. The last 4 Transformer L1 blocks are used to extract the image features again after adding residual information and then output them to the feature fusion block.
[0019] The second level includes 6 Transformer L2 blocks. The first 3 Transformer L2 blocks are used to extract features from the upper level and continue to output them downwards. The last 3 Transformer L2 blocks are used to further process the features after merging the Transformer L1 and Transformer L3 residuals and output them to the feature fusion block.
[0020] The third layer includes four Transformer L3 blocks. The first two Transformer L3 blocks are used to extract features from the upper level and then input them into the fourth layer through a PCA module and a 5×5 convolution. The last two Transformer L3 blocks are output to the feature fusion block after processing the fusion residual information.
[0021] The fourth level includes two Transformer L4 blocks, which are used to extract the feature information after fusing the residuals of the first and third levels and output it to the feature fusion block.
[0022] Furthermore, in the first, second, third, and fourth levels, the number of Transformer blocks gradually increases from bottom to top.
[0023] Furthermore, in the fourth layer, residual connections are used to obtain fused image feature information. After layer normalization, this information is input into a multi-head attention mechanism, and the output feature map... for:
[0024]
[0025] Where X is the input feature map. The tensors of the original size are reshaped to obtain the Query, Key, and Value matrices, respectively, W. p It is a 1×1 convolution, where α is a learnable scale parameter;
[0026] Output image features F l for:
[0027] F l ∈R H / 8×W / 8×8C
[0028] Where R is the feature map, C is the number of channels in the feature map, H is the height of the feature map, and W is the width of the feature map.
[0029] Specifically, the feature merging output using the residual fusion architecture is as follows:
[0030] The input features are initially extracted through different convolutions while keeping the feature map dimensions consistent. The initially extracted feature information is then input into the prior knowledge transformation block PITM to fuse prior information and obtain a prior information set. Adaptive weight allocation is performed on the prior information set through 1*1 convolution and softmax layers, and the prior information set is further fused with image features according to the corresponding weights.
[0031] Furthermore, the prior information ψ is modeled from a pair of affine transformation parameters (γ,β) through the mapping function M:ψ→(γ,β), which is calculated as follows:
[0032]
[0033] (γ,β)=M(ψ)
[0034] Where (γ,β) represent the prior information obtained through convolution, and x represents the processed prior information pair. G represents the set of prior information after processing. θ This involves performing two different convolution operations on (γ,β).
[0035] Furthermore, the images are further fused according to the corresponding weights and features, as follows:
[0036] (F|γ,β)=w1γ⊙F+w2β
[0037] Where F represents the input image feature information, ⊙ represents the Hadamard product, w1 represents the γ weight, and w2 represents the β weight.
[0038] Specifically, low-resolution images are obtained by performing bicubic interpolation downsampling on the original image.
[0039] Secondly, embodiments of the present invention provide a lightweight Transformer-based super-resolution system for garbage images, comprising:
[0040] The extraction module performs layered extraction on low-resolution images and adds residual information to obtain image detail feature information at four different levels.
[0041] The merging module processes image detail features at four different levels through convolution and lightweight operations, then integrates prior knowledge into the image detail features at four different levels, and finally merges the features using a residual fusion architecture.
[0042] The reconstruction module performs pixel reconstruction on the output image detail features to obtain the predicted high-resolution image.
[0043] Compared with the prior art, the present invention has at least the following beneficial effects:
[0044] A lightweight Transformer-based super-resolution method for garbage images improves the super-resolution performance of garbage images while reducing computational complexity, achieving an efficient lightweight super-resolution network suitable for applications such as garbage image recognition and classification, and showing significant improvements in various application fields.
[0045] Furthermore, the hierarchical feature extraction module outputs image detail feature information at different levels. The feature fusion block combines four types of image features to effectively integrate feature information under different magnification conditions. Prior information is fused with image features, and the prior information is used to improve the recovery of detail information by image features, resulting in four different levels of image detail feature information.
[0046] Furthermore, a layered Transformer module is used to achieve feature depth extraction. At the same time, the combined residuals of different layers enable efficient utilization of shallow features and alleviate the problems of gradient vanishing and gradient descent.
[0047] Furthermore, in the first, second, third, and fourth levels, the number of Transformer blocks gradually increases from bottom to top. For shallow features, the feature map size is larger, and a larger number of Transformer blocks can achieve efficient extraction; for deep, detailed features, the feature map size is smaller, and a smaller number of Transformer blocks can achieve good feature extraction results.
[0048] Furthermore, the lightweight Transformer-based super-resolution method for garbage images is characterized in that, in the fourth layer, residual connections are used to obtain fused image feature information, which is then input into a multi-head attention mechanism after layer normalization, and the output feature map is... By employing a multi-head attention mechanism in a highly parallel processing environment, the image feature extraction effect and computational efficiency are improved.
[0049] Furthermore, the initially extracted feature information is input into the prior knowledge transformation block PITM to fuse prior information and obtain a prior information set. Adaptive weight allocation is performed on the prior information set through 1*1 convolution and softmax layer, and the prior information set is further fused with image features according to the corresponding weights. The prior knowledge transformation block is used to fuse image features with prior information, which effectively improves the image feature recovery effect. At the same time, the residual fusion architecture effectively improves the image super-resolution quality by combining shallow original input and alleviates the gradient explosion problem.
[0050] Furthermore, by performing affine transformations on the prior information and using operations such as translation, rotation, scaling, and shearing to obtain (γ,β), and then combining it with the feature map through convolution operations, the utilization efficiency of the prior information and the fusion result with image features are effectively improved.
[0051] Furthermore, by performing Hadamard product and summation operations on (γ,β) after affine transformation and the feature map, compared with traditional prior information fusion operations, prior information can be effectively combined with the feature map, which significantly improves the restoration and enhancement of image feature details and textures.
[0052] It is understandable that the beneficial effects of the second aspect mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here.
[0053] In summary, this invention effectively improves the performance of image super-resolution networks while reducing model computational complexity and accelerating the training process, making it suitable for applications in aerospace exploration, medical diagnosis, and disaster relief.
[0054] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description
[0055] Figure 1 This is a schematic diagram of the SRCNN network structure;
[0056] Figure 2 This is a schematic diagram of the VDSR network structure;
[0057] Figure 3 This is a schematic diagram of the Transformer structure;
[0058] Figure 4 This is an overall flowchart of the present invention;
[0059] Figure 5 This is a network structure diagram of the present invention;
[0060] Figure 6 This is a diagram of the lightweight Transformer L1 architecture;
[0061] Figure 7This is a diagram of the lightweight Transformer L2 architecture;
[0062] Figure 8 This is a diagram of the lightweight Transformer L3 architecture;
[0063] Figure 9 This is a diagram of the lightweight Transformer L4 architecture;
[0064] Figure 10 Diagram of the PITM network structure;
[0065] Figure 11 The images are before and after the model was downsampled, where (a) is the original high-resolution garbage image and (b) is the downsampled low-resolution garbage image.
[0066] Figure 12 Comparison of images of different sizes output by HFE, where (a) represents a 256×256 image size, (b) represents a 128×128 image size, and (c) represents a 64×64 image size;
[0067] Figure 13 High-resolution images predicted solely through HFE;
[0068] Figure 14 High-resolution images predicted by HFE+EFFM. Detailed Implementation
[0069] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0070] In the description of this invention, it should be understood that the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0071] It should also be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.
[0072] It should also be further understood that the term "and / or" as used in this specification and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. Additionally, the character " / " in this document generally indicates that the preceding and following objects have an "or" relationship.
[0073] It should be understood that although terms such as first, second, third, etc., may be used in the embodiments of the present invention to describe the preset range, these preset ranges should not be limited to these terms. These terms are only used to distinguish the preset ranges from one another. For example, without departing from the scope of the embodiments of the present invention, the first preset range may also be referred to as the second preset range, and similarly, the second preset range may also be referred to as the first preset range.
[0074] Depending on the context, the word "if" as used here can be interpreted as "when," "when," "in response to determination," or "in response to detection." Similarly, depending on the context, the phrase "if determination" or "if detection (of the stated condition or event)" can be interpreted as "when determination," "in response to determination," "when detection (of the stated condition or event)," or "in response to detection (of the stated condition or event)."
[0075] The accompanying drawings illustrate various structural schematic diagrams according to embodiments disclosed in this invention. These drawings are not to scale, and some details have been enlarged for clarity, and some details may have been omitted. The shapes of the various regions and layers shown in the drawings, as well as their relative sizes and positional relationships, are merely exemplary and may deviate from reality due to manufacturing tolerances or technical limitations. Furthermore, those skilled in the art can design regions / layers with different shapes, sizes, and relative positions as needed.
[0076] Please see Figure 3This paper introduces the Transformer into the field of image processing, proposing the Vision Transformer (Vit). Experiments demonstrate that Viit achieves more ideal image processing results, and with the same number of parameters, the Transformer is more computationally efficient than convolutional modules. The Vit model introduces the concept of an image patch for the first time. A patch consists of P×P pixels. By flattening the patch, it is transformed into a fixed-length feature vector using a projection layer. Finally, it is input into the Transformer's Encoder structure, similar to word vectors representing a token (word) in NLP. Unlike CNNs, which capture local information within the convolutional window, the Transformer uses attention to capture the correlation between global contextual information. First, the input patch is regularized, then a multi-head attention mechanism is used to further extract image features, increasing the network's stability and robustness. Then, residual connections are used to merge the original input, which is then fed into a multilayer perceptron through a regularization layer. After merging the secondary residuals, the image features are output.
[0077] This invention comprehensively utilizes an improved Transformer and residual connections for image super-resolution. To reduce the computational complexity of the Transformer, it alternates between axial attention modules and multi-head attention modules, reducing the original secondary computational complexity while maintaining feature extraction quality. To further acquire shallow features and image detail information, a four-level hierarchical feature extraction structure effectively balances these two aspects. Finally, to further improve the image super-resolution effect, prior image information is incorporated during feature extraction. This invention employs an improved Transformer, residual connections, and prior image information to enhance network performance, effectively improving the efficiency of image super-resolution.
[0078] This invention provides a lightweight Transformer-based super-resolution method for junk images, comprising a lightweight Transformer block, a prior knowledge transformation block, and a feature fusion block. The lightweight Transformer block implements a multi-scale hierarchical network to acquire image detail features at different levels, improving image quality. It also integrates hierarchical features using residual operations to enhance the effect of shallow features and obtain more detailed information. The prior knowledge transformation block adds prior information to the structural information obtained at different scales and layers, improving the robustness of image super-resolution. Based on the concept of binocular complementarity in biotechnology, the feature fusion block fuses the parallel prior knowledge transformation block through residual learning operations to extract complementary features and improve the pixel count of the predicted image. Finally, a high-quality image is reconstructed through a convolution. Specifically:
[0079] The lightweight Transformer block in the first part is an improvement on the standard Transformer block. This invention uses different connection methods for the axial attention module, multi-head attention module and feedforward network to obtain four different Transformer blocks, thereby reducing the original secondary computational complexity while ensuring the quality of feature extraction.
[0080] The second part is the Prior Information Transpose Module (PITM). This module uses dynamic convolution to efficiently extract input features, and at the same time, it uses a combination of convolution and pooling to generate two different prior knowledge, γ and β. By adaptively assigning weights to the two prior information, it achieves efficient fusion of image features and prior information.
[0081] The third part is the Efficient Feature Fusion Module (EFFM). EFFM fuses four levels of feature information with prior knowledge through operations such as 3D convolution, 5×5 convolution, and pooling in a hierarchical and weighted manner. It also incorporates Principal Component Analysis (PCA) for data dimensionality reduction and Dropout to improve training speed, further optimizing network performance. Finally, the features are merged to output EFFM. This invention improves the super-resolution performance of garbage images while reducing computational complexity, achieving a highly efficient and lightweight super-resolution network.
[0082] Therefore, this invention can be widely applied to applications such as garbage image recognition and classification. Through super-resolution processing of garbage images, it has significant improvements in various application fields. Since image super-resolution is constantly influencing and promoting the progress of related fields such as aerospace exploration, medical diagnosis and disaster relief, this invention has important research and practical significance.
[0083] Please see Figure 4 This invention provides a lightweight Transformer-based super-resolution method for garbage images, comprising the following steps:
[0084] S1. Perform bicubic downsampling on the input image to obtain a low-resolution image;
[0085] The original image is downsampled using bicubic interpolation to obtain a low-resolution image, which is then input into the super-resolution model proposed in this invention to obtain the predicted high-resolution image.
[0086] The high- and low-resolution images selected for this model are both 512×512, and the dataset was downloaded and collected independently.
[0087] S2. Input the low-resolution image obtained in step S1 into the Hierarchical Feature Extraction (HFE) module. Through hierarchical extraction and addition of residual information, obtain image detail feature information at four different levels.
[0088] While Transformer has achieved superior performance compared to traditional convolutional modules in image processing, its model complexity and computational cost are relatively high. Therefore, this invention employs different connection methods for the axial attention module, multi-head attention module, and feedforward network to obtain four different Transformer blocks. This reduces the original secondary computational complexity while maintaining feature extraction quality, achieving lightweight model processing. Furthermore, to improve super-resolution performance, this invention integrates prior information into the image features.
[0089] The network designed in this invention consists of two stages: a hierarchical feature extraction module (HFE) and a feature fusion block (EFFM).
[0090] HFE is responsible for outputting image detail feature information at different levels, while EFFM is responsible for combining four types of image features and fusing prior information with image features to further predict high-resolution images.
[0091] Please see Figure 5 The first-stage hierarchical feature extraction module HFE consists of 20 improved lightweight Transformer blocks, 3 PCA modules, and multiple convolutions.
[0092] The first level of HFE consists of 8 Transformer L1 blocks. According to the order of data processing, the first 4 blocks perform preliminary extraction and are processed by the PCA module before being input into the next level. The last 4 blocks extract the image features again after adding residual information and output them to EFFM.
[0093] The second level consists of 6 Transformer L2 blocks. The first 3 blocks extract features from the upper level and continue to output them downwards. The last 3 blocks further process the features after merging the original L1 and L3 residuals and output them to EFFM.
[0094] The third layer consists of four Transformer L3 blocks. The first two blocks extract features from the upper layer and then input them into the fourth layer via a PCA module and a 5×5 convolution. The last two blocks process the residual information and output them to the EFFM.
[0095] The last layer consists of only two Transformer L4 blocks. This layer extracts the feature information after fusing the residuals of the first and third layers and outputs it to EFFM.
[0096] The degenerate image I∈R of the input modelH×W×3 First, low-level image features are obtained through a single convolution:
[0097] F o ∈R H×E×C
[0098] Where C represents the channel, H represents the height of the feature, and W represents the width of the feature.
[0099] These shallow features F o The Transformer block structure, after four levels of improvement, is transformed into deep features:
[0100] F d ∈R H×W×2C
[0101] Each layer contains multiple Transformer blocks. To ensure efficient feature extraction, the number of Transformer blocks in each layer gradually increases from bottom to top. Starting with a low-resolution input, the hierarchical structure reduces the spatial size of image features layer by layer while expanding the number of channels to ensure that more image details can be captured. The final image features output by Transformer L4 layer are:
[0102] F l ∈R H / 8×W / 8×8C
[0103] Furthermore, residual connections are added to the corresponding Transformer blocks in the same layer. When the image features pass through the hierarchical model from bottom to top for the second time, the convolutional layer is applied to the refined features to generate a residual image.
[0104] R∈R H×W×3
[0105] And add the degraded image to it. This allows shallow feature information to be effectively utilized.
[0106] In the improved Transformer block, the input image features are X∈R h×w×c X ri,rj ,X ci,cj ∈R c Let X be the average feature vectors of the i-th and j-th rows and i-th and j-th columns of image feature X, respectively. The axial attention score based on rows and columns is calculated as follows:
[0107]
[0108]
[0109] Among them, W rq Wrk W cq W ck These represent the trainable parameters of Query and Key in the row and column, respectively. This represents the relative positional encoding value between lines. The values represent the relative positions between columns, λ1, μ1, λ2, and μ2 represent the coefficients controlling the calculation size in the row and column, respectively, and b r1 b r2 b c1 b c2 These represent the offset parameters in the row and column, respectively.
[0110] Compared to the standard self-attention mechanism, the quadratic complexity is O(n). 2 In comparison, the complexity of axial attention is only O(2n). 3 / 2 This significantly reduces the computational cost of the original Transformer block.
[0111] Subsequently, residual connections are used to obtain the fused image feature information, which is then normalized by layers and input into a multi-head attention mechanism. The calculation method is as follows:
[0112]
[0113]
[0114] Among them, X and These are the feature maps for the input and output, respectively. The tensors of the original size are reshaped to obtain the Query, Key, and Value matrices, respectively, W. p It is a 1×1 convolution, where α is a learnable scale parameter used to control the size of the dot product between K and Q before applying the softmax function.
[0115] Four lightweight Transformer module structures are as follows: Figures 6 to 9 As shown.
[0116] Please see Figure 6 It adopts a parallel two-branch structure. The first branch uses row attention and column attention serially connected and is processed by dropout respectively. Finally, it is processed by regularization and feedforward mechanism before output.
[0117] Please see Figure 7 First, row attention and column attention are connected in parallel. After pooling and regularization, they are input into the multi-head attention mechanism. Finally, after regularization and feedforward, they are output. Each module uses residual connections.
[0118] Please see Figure 8First, row attention and multi-head attention are connected in parallel. Then, column attention and multi-head attention are connected in parallel. Finally, the output is processed through regularization and feedforward mechanism.
[0119] Please see Figure 9 It sequentially connects layer regularization, row attention, layer regularization, column attention, layer regularization and multi-head attention mechanisms, and finally outputs after layer regularization and feedforward mechanism.
[0120] Please see Figure 10 The image features and prior information are simultaneously input into the PITM module. The prior information is obtained by passing through a convolutional layer and affine transformation to obtain (γ, β), and then the respective weights are obtained by using softmax. At the same time, the image features are processed by dynamic convolution and then the ⊙ and + operations are performed with γ and β in sequence to obtain the output that fuses the prior information.
[0121] S3. Input the four types of image detail feature information obtained in step S2 into EFFM, process the image detail feature information through convolution and lightweight operations, then integrate the prior knowledge into the four types of image detail feature information in a hierarchical manner, and use the residual fusion architecture to merge and output the features.
[0122] To further obtain high-quality super-resolution images, this invention designs EFFM to process the image features output by HFE, as follows:
[0123] First, the input features are initially extracted using different convolutions, while maintaining a consistent feature map dimension. Then, the feature information is input into a PITM to fuse prior information; the prior information ψ is modeled by a pair of affine transformation parameters (γ, β) through a mapping function M: ψ → (γ, β); the function is calculated as follows:
[0124]
[0125] (γ,β)=M(ψ)
[0126] Where (γ,β) represent the prior information obtained through convolution, and x represents the processed prior information pair. G represents the set of prior information after processing. θ This involves performing two different convolution operations on (γ,β).
[0127] After obtaining the set of prior information, adaptive weights are assigned to the prior information pairs through 1*1 convolution and softmax layers. The prior information pairs are then further fused with image features according to the corresponding weights, as follows:
[0128] (F|γ,β)=w1γ⊙F+w2β
[0129] Where F represents the input image feature information, ⊙ represents the Hadamard product, w1 represents the γ weight, and w2 represents the β weight.
[0130] S4. Reconstruct the pixels of the image detail feature information output in step S3 to obtain the predicted high-resolution image.
[0131] In another embodiment of the present invention, a lightweight Transformer-based super-resolution system for junk images is provided. This system can be used to implement the aforementioned lightweight Transformer-based super-resolution method for junk images. Specifically, the lightweight Transformer-based super-resolution system for junk images includes an extraction module, a merging module, and a reconstruction module.
[0132] The extraction module performs layered extraction on the low-resolution image and adds residual information to obtain image detail feature information at four different levels.
[0133] The merging module processes image detail features at four different levels through convolution and lightweight operations, then integrates prior knowledge into the image detail features at four different levels, and finally merges the features using a residual fusion architecture.
[0134] The reconstruction module performs pixel reconstruction on the output image detail features to obtain the predicted high-resolution image.
[0135] In another embodiment of the present invention, a terminal device is provided, comprising a processor and a memory. The memory stores a computer program, which includes program instructions. The processor executes the program instructions stored in the computer storage medium. The processor may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing and control core of the terminal, suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions to implement a corresponding method flow or corresponding function. The processor described in this embodiment of the present invention can be used for the operation of a lightweight Transformer-based super-resolution method for junk images, including:
[0136] Low-resolution images are extracted hierarchically and residual information is added to obtain image detail features at four different levels. These features are then processed through convolution and lightweight operations. Prior knowledge is then hierarchically fused into these features, and the features are merged and output using a residual fusion architecture. Finally, the output image detail features are reconstructed pixel by pixel to obtain the predicted high-resolution image.
[0137] In another embodiment of the present invention, a storage medium is also provided, specifically a computer-readable storage medium (memory). This computer-readable storage medium is a memory device in a terminal device used to store programs and data. It is understood that the computer-readable storage medium here can include both the built-in storage medium in the terminal device and extended storage media supported by the terminal device. The computer-readable storage medium provides storage space that stores the terminal's operating system. Furthermore, this storage space also stores one or more instructions suitable for loading and execution by a processor. These instructions can be one or more computer programs (including program code). It should be noted that the computer-readable storage medium here can be high-speed RAM or non-volatile memory, such as at least one disk storage device.
[0138] One or more instructions stored in a computer-readable storage medium can be loaded and executed by a processor to implement the corresponding steps of the lightweight Transformer-based garbage image super-resolution method in the above embodiments; one or more instructions in the computer-readable storage medium are loaded by the processor and executed as follows:
[0139] Low-resolution images are extracted hierarchically and residual information is added to obtain image detail features at four different levels. These features are then processed through convolution and lightweight operations. Prior knowledge is then hierarchically fused into these features, and the features are merged and output using a residual fusion architecture. Finally, the output image detail features are reconstructed pixel by pixel to obtain the predicted high-resolution image.
[0140] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0141] Example
[0142] The original input image size is 512×512. In order to extract more image detail information, this invention uses an efficient hierarchical feature extraction module to halve the H and W of the image layer by layer. Therefore, the sizes of the subsequent three layers of images are 256×256, 128×128, and 64×64, respectively.
[0143] Please see Figures 11 to 14 This invention uses vivid images to reflect the change process, as detailed below:
[0144] Will Figure 11 (a) The original high-resolution garbage image is used as input to the HFE network to predict the high-resolution image; first, a low-resolution image is obtained by downsampling, such as... Figure 11 As shown in (b).
[0145] Please see Figure 12 HFE utilizes a four-layer lightweight Transformer module to... Figure 11 (b) The low-resolution image is processed to obtain image detail features of sizes 256×256, 128×128, and 64×64, which facilitates the extraction of image detail information and further recovery of high-resolution images.
[0146] Please see Figure 13 To further verify the super-resolution performance of EFFM after fusing prior information, pixel reconstruction was performed only on the image features output by HFE to obtain the predicted image. Compared with the original low-resolution image, the predicted image recovered some detail information, but the overall effect was not ideal. Therefore, the EFFM network was reintegrated into the overall model, with the four-layer extraction results of HFE input into EFFM. Prior information from the garbage image was weighted and fused, and the results were merged step by step. Finally, the image features processed by EFFM were merged and input into the pixel reconstruction module, and upsampling was performed to obtain the predicted high-quality image, such as... Figure 14 As shown.
[0147] In summary, this invention presents a lightweight Transformer-based super-resolution method and system for junk images. It utilizes deep neural networks to handle image super-resolution tasks, achieving significant improvements. Leveraging the superior performance of the Transformer in image processing, its internal structure is further improved, ensuring image processing quality while reducing computational complexity, thus meeting the lightweight requirements of numerous mobile devices. Furthermore, the four-level hierarchical feature extraction structure not only captures more image details but also utilizes residual structures to incorporate image features from shallow models, further enhancing network performance. In addition, to improve image super-resolution quality, this invention embeds prior information into the EFFM (Enhanced Feature Extraction Function). Different convolutional layers process this prior information, generating γ and β information, which are then combined to perform a fusion operation on the input features. Moreover, the EFFM employs a multi-layer feature fusion structure, fusing weighted prior information into the four outputs of the HFE (Hypertext Extraction Function) and then combining them into a single output. This invention is based on an improvement of the standard Transformer block, and combines PCA and convolution to form HFE. Simultaneously, EFFM integrates prior information layer by layer into the image feature information, which not only improves the super-resolution quality of junk images but also reduces the computational complexity of the original Transformer-based image processing model, achieving lightweight processing of the super-resolution model. Therefore, this invention has good application value and research significance.
[0148] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0149] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0150] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0151] In the embodiments provided by this invention, it should be understood that the disclosed devices / terminals and methods can be implemented in other ways. For example, the device / terminal embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0152] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0153] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0154] If the integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium can be appropriately added or removed according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media do not include electrical carrier signals and telecommunication signals.
[0155] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0156] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0157] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0158] The above content is only for illustrating the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. Any modifications made to the technical solution based on the technical concept proposed in this invention shall fall within the scope of protection of the claims of this invention.
Claims
1. A lightweight Transformer-based super-resolution method for garbage images, characterized in that, Includes the following steps: Low-resolution images are extracted in layers and residual information is added to obtain image detail feature information at four different levels, specifically: The hierarchical feature extraction module outputs image detail feature information at different levels. The feature fusion block combines four types of image features and fuses prior information with image features to obtain four different levels of image detail feature information. The hierarchical feature extraction module consists of four levels, as follows: The first level consists of 8 Transformer L1 blocks. According to the data processing order, the first 4 Transformer L1 blocks perform preliminary extraction and are then processed by the PCA module before being input into the second level. The last 4 Transformer L1 blocks are used to extract the image features again after adding residual information and then output them to the feature fusion block. The second level includes 6 Transformer L2 blocks. The first 3 Transformer L2 blocks are used to extract features from the upper level and continue to output them downwards. The last 3 Transformer L2 blocks are used to further process the features after merging the Transformer L1 and Transformer L3 residuals and output them to the feature fusion block. The third layer includes four Transformer L3 blocks. The first two Transformer L3 blocks are used to extract features from the upper level and then input them into the fourth layer through a PCA module and a 5×5 convolution. The last two Transformer L3 blocks are output to the feature fusion block after processing the fusion residual information. The fourth level includes two Transformer L4 blocks, which are used to extract the feature information after fusing the residuals of the first and third levels and output it to the feature fusion block; Image detail features at four different levels are processed through convolution and lightweight operations. Prior knowledge is then hierarchically fused into the image detail features at the four different levels. The features are then merged and output using a residual fusion architecture. The output image detail features are reconstructed pixel by pixel to obtain the predicted high-resolution image.
2. The lightweight Transformer super-resolution method for garbage images according to claim 1, characterized in that, In the first, second, third, and fourth levels, the number of Transformer blocks gradually increases from bottom to top.
3. The lightweight Transformer super-resolution method for garbage images according to claim 1, characterized in that, In the fourth layer, residual connections are used to obtain fused image feature information. After layer normalization, the information is input into a multi-head attention mechanism, and the output feature map is... for: in, Given the input feature map, The Query, Key, and Value matrices are obtained by reshaping the tensor of the original size. For 1×1 convolution, It is a learnable scale parameter; Output image features for: in, For feature maps, The number of feature map channels. For the height of the feature map, The width of the feature map.
4. The lightweight Transformer super-resolution method for garbage images according to claim 1, characterized in that, The specific steps for merging features using a residual fusion architecture are as follows: The input features are initially extracted through different convolutions while keeping the feature map dimensions consistent. The initially extracted feature information is then input into the prior knowledge transformation block PITM to fuse prior information and obtain a prior information set. Adaptive weight allocation is performed on the prior information set through 1*1 convolution and softmax layers, and the prior information set is further fused with image features according to the corresponding weights.
5. The lightweight Transformer super-resolution method for garbage images according to claim 4, characterized in that, Prior information It is composed of a pair of affine transformation parameters Through mapping function The modeling yields the following function calculation: in, These are obtained from prior information through convolution operations. For the processed prior information pairs, This represents the set of prior information after processing. For Two different convolution operations were performed.
6. The lightweight Transformer super-resolution method for garbage images according to claim 4, characterized in that, The image features are further fused according to the corresponding weights, as follows: in, For the input image feature information, For Hadama accumulation, for Weights, for Weights.
7. The lightweight Transformer super-resolution method for garbage images according to claim 1, characterized in that, Low-resolution images are obtained by downsampling the original image using bicubic interpolation.
8. A lightweight Transformer-based super-resolution system for garbage images, characterized in that, include: The extraction module performs layered extraction on the low-resolution image and adds residual information to obtain image detail feature information at four different levels, specifically: The hierarchical feature extraction module outputs image detail feature information at different levels. The feature fusion block combines four types of image features and fuses prior information with image features to obtain four different levels of image detail feature information. The hierarchical feature extraction module consists of four levels, as follows: The first level consists of 8 Transformer L1 blocks. According to the data processing order, the first 4 Transformer L1 blocks perform preliminary extraction and are then processed by the PCA module before being input into the second level. The last 4 Transformer L1 blocks are used to extract the image features again after adding residual information and then output them to the feature fusion block. The second level includes 6 Transformer L2 blocks. The first 3 Transformer L2 blocks are used to extract features from the upper level and continue to output them downwards. The last 3 Transformer L2 blocks are used to further process the features after merging the Transformer L1 and Transformer L3 residuals and output them to the feature fusion block. The third layer includes four Transformer L3 blocks. The first two Transformer L3 blocks are used to extract features from the upper level and then input them into the fourth layer through a PCA module and a 5×5 convolution. The last two Transformer L3 blocks are output to the feature fusion block after processing the fusion residual information. The fourth level includes two Transformer L4 blocks, which are used to extract the feature information after fusing the residuals of the first and third levels and output it to the feature fusion block; The merging module processes image detail features at four different levels through convolution and lightweight operations, then integrates prior knowledge into the image detail features at four different levels, and finally merges the features using a residual fusion architecture. The reconstruction module performs pixel reconstruction on the output image detail features to obtain the predicted high-resolution image.