An image super-resolution reconstruction method based on a multi-scale content-aware mixer
The image super-resolution reconstruction method using a multi-scale content-aware mixer solves the problems of high computational complexity and rigid resource allocation in traditional models, achieving efficient image reconstruction on hardware devices and improving the accuracy and computational efficiency of image reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2026-03-08
- Publication Date
- 2026-06-23
AI Technical Summary
Existing deep learning-based image super-resolution models suffer from high computational complexity, large memory consumption, and difficulty in handling large images. Furthermore, traditional acceleration frameworks suffer from path rigidity, decision rigidity, and lack of discriminability, making them difficult to deploy on actual hardware devices.
An image super-resolution reconstruction method based on a multi-scale content-aware mixer is adopted. By introducing a feature enhancement mechanism, a multi-scale content-aware mechanism, and a dynamic multi-scale large kernel attention mechanism, it achieves accurate classification of image regions and on-demand allocation of computing resources, including shallow feature extraction, feature pyramid enhancement, multi-scale content-aware prediction, and adaptive dynamic path selection.
While ensuring reconstruction quality, it significantly reduces computational complexity and memory usage, achieving more refined allocation of computing resources and more efficient inference acceleration, thereby improving the accuracy and computational efficiency of image reconstruction.
Smart Images

Figure CN122265033A_ABST
Abstract
Description
Technical Field
[0001] The embodiments of this application relate to the field of image processing technology, and in particular to an image super-resolution reconstruction method based on a multi-scale content-aware mixer. Background Technology
[0002] Mobile devices, drones, and microsatellites are limited by size, weight, and power consumption, making it difficult to incorporate large-size, high-precision optical lenses and sensors. Super-resolution technology offers a low-cost software solution that can significantly improve image quality without upgrading hardware. In fields such as medical imaging, public safety monitoring, and satellite remote sensing, it is often necessary to recover lesions, facial features, or ground details from blurry or low-quality original images. In these scenarios, re-acquiring images is often prohibitively costly, making super-resolution reconstruction a necessary means to obtain high-value information. Traditional super-resolution reconstruction methods heavily rely on interpolation or a large amount of prior knowledge, resulting in limited reconstruction effects and difficulty in recovering complex high-frequency texture details. With the rise of deep learning, deep learning-based super-resolution methods, with their powerful feature learning capabilities, no longer require excessive explicit prior knowledge, and the visual quality and objective metrics of the reconstructed images are significantly better than traditional methods. However, in practical applications, limitations in computational resources and algorithm deployment remain challenges.
[0003] The goal of single-image super-resolution is to reconstruct a super-resolution image from a degraded low-resolution image. This technique is widely used in fields such as medical imaging, security monitoring, and satellite imagery. However, current deep learning-based super-resolution models, while pursuing high performance, face challenges such as high computational complexity, large memory consumption, and difficulty in handling large-size images (such as 4K and 8K), making them difficult to deploy practically on hardware devices.
[0004] Although some research teams have proposed "lightweight networks" and "acceleration frameworks" to address the above challenges, the proposed solutions still have the following core problems.
[0005] First, the path is rigid. Traditional acceleration frameworks usually make routing decisions in units of fixed-size grids (such as 16×16 or 32×32), assuming uniform complexity within the block, and the decision mechanism is often based on simple hard thresholds, which seriously lacks flexibility.
[0006] Second, rigid decision-making. Traditional acceleration framework strategies often rely on simple, low-capacity models or hard thresholds to make binary decisions that are either / or, which leads to inflexible and inaccurate classification.
[0007] Third, lack of discriminability. Current lightweight networks typically perform the same computation graph for all image regions, regardless of whether the image region is flat or has complex textures. The network consumes almost the same FLOPs (Floating Point Operations Per second) and memory, and cannot dynamically accelerate for simple content, which leads to unavoidable computational waste. Summary of the Invention
[0008] To address the aforementioned technical issues, embodiments of this application propose an image super-resolution reconstruction method based on a multi-scale content-aware mixer. By introducing a feature enhancement mechanism, a multi-scale content-aware mechanism, and a dynamic multi-scale large kernel attention mechanism, it achieves accurate classification of image regions and on-demand allocation of computing resources, significantly reducing computational complexity and memory usage while ensuring reconstruction quality.
[0009] To achieve the above objectives, embodiments of this application propose an image super-resolution reconstruction method based on a multi-scale content-aware mixer, implemented using an adaptive processing mechanism. The method includes the following steps: shallow feature extraction of the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map; feature enhancement based on feature pyramids and attention mechanisms on the shallow feature map to obtain a deep feature map containing rich multi-scale spatial information; multi-scale content-aware prediction based on the deep feature map, generating guiding information including a window classification binary mask and window size to guide computational allocation; wherein the window classification binary mask classifies different image regions in the deep feature map into easy or hard classes; based on the guiding information, different image regions are assigned to different computational paths for processing; the feature maps output by each computational path are recombined and fused, and then the resolution of the recombined and fused feature map is enlarged to the target resolution to finally obtain a high-resolution image.
[0010] To achieve the above objectives, embodiments of this application also propose an image super-resolution reconstruction system based on a multi-scale content-aware mixer, used to implement the image super-resolution reconstruction method based on a multi-scale content-aware mixer as described above. The system includes: a shallow feature extraction module, used to extract shallow features from the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map; a feature enhancement module, used to enhance the shallow feature map based on feature pyramids and attention mechanisms to obtain a deep feature map containing rich multi-scale spatial information; a multi-scale content-aware prediction module, used to perform multi-scale content-aware prediction based on the deep feature map, generating guiding information including a window classification binary mask and window size to guide computation allocation, the window classification binary mask classifying different image regions in the deep feature map into easy or hard classes; an adaptive dynamic path selection module, used to allocate different image regions to different computation paths for processing based on the guiding information; and an image reconstruction module, used to reassemble and fuse the feature maps output by each computation path, and then enlarge the resolution of the reassembled and fused feature map to the target resolution to finally obtain a high-resolution image.
[0011] To achieve the above objectives, embodiments of this application also propose an electronic device, including: a processor and a memory, wherein the memory stores instructions executable by the processor, and the processor is configured to execute the instructions such that the electronic device can implement an image super-resolution reconstruction method based on a multi-scale content-aware mixer as described above.
[0012] To achieve the above objectives, embodiments of this application also propose a computer-readable storage medium storing a computer program that, when executed by a processor, enables an image super-resolution reconstruction method based on a multi-scale content-aware mixer as described above.
[0013] Optionally, shallow feature extraction is performed on the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map, including: receiving the low-resolution image to be reconstructed, performing shallow feature extraction on the low-resolution image through a convolutional layer, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map.
[0014] Optionally, a feature enhancement encoder based on a feature pyramid structure is used to enhance the shallow feature map, and feature enhancement based on a feature pyramid and attention mechanism is performed on the shallow feature map to obtain a deep feature map containing rich multi-scale spatial information. This includes: dividing the shallow feature map by channel, capturing features of different scales through max pooling layers with different downsampling factors, and then using nearest neighbor interpolation to restore the features of different scales to their original size to generate an adaptive spatial attention map; and performing multi-receptive field fusion on the spatial attention map in the channel dimension, performing depthwise convolution operations through branches of three different receptive fields, and then concatenating the outputs of the branches of different receptive fields through a concatenation operation to obtain a deep feature map containing rich multi-scale spatial information.
[0015] Optionally, the feature enhancement encoder consists of a spatially adaptive feature modulation unit and a convolutional channel mixer, with a layer normalization operation preceding both the spatially adaptive feature modulation unit and the convolutional channel mixer.
[0016] Optionally, multi-scale content-aware prediction is performed based on deep feature maps to generate guiding information for computational allocation, including a window classification binary mask and window size. This includes: learning the texture information of the deep feature maps; generating a window classification binary mask based on the complexity of the texture information; classifying different image regions in the deep feature maps into simple or hard classes; performing differentiable sampling using the Gumbel Softmax strategy for the position of each prediction window; predicting the window size with the highest probability from a set of preset window sizes; and generating spatial attention weights for simple paths and channel attention weights for channel fusion.
[0017] Optionally, based on the guidance information, different image regions are assigned to different computational paths for processing. This includes: for image regions of the easy class, they are assigned to the easy path for processing, using basic convolution operations and combining spatial attention weights and channel attention weights for lightweight feature correction; for image regions of the difficult class, they are assigned to the difficult path for processing, using a multi-scale large kernel attention mechanism with dynamic selection. Image regions marked as difficult by the window classification binary mask are input into large kernel convolutions of different scales, which are decomposed into depthwise convolutions, depthwise convolutions with dilation factors, and pointwise convolutions. This reduces the computational load while simulating a large receptive field. The Sigmoid activation function generates dynamically selected weights, and the outputs of large kernel convolutions of different scales are weighted and fused to flexibly capture long-range dependencies.
[0018] Optionally, the feature maps output by each computation path are recombined and fused, and the resolution of the recombined and fused feature maps is then enlarged to the target resolution to obtain a high-resolution image. This includes: recombining and fusion the feature maps output by the simple path and the difficult path, using the PixelShuffle unit to enlarge the resolution of the recombined and fused feature maps to the target resolution, and finally outputting a high-resolution image corresponding to the low-resolution image to be reconstructed through a convolutional layer.
[0019] The image super-resolution reconstruction method based on a multi-scale content-aware mixer proposed in this application has the following advantages compared with traditional image super-resolution reconstruction methods.
[0020] First, it breaks the limitations of rigid paths, enabling more refined allocation of computational resources. Unlike traditional techniques (such as ClassSR) that use fixed-size grids for routing decisions, this application introduces a scale prediction head in a multi-scale content-aware predictor. This mechanism can dynamically predict and generate non-uniform grids based on the complexity of local image textures. This non-uniform partitioning can more closely fit irregular object boundaries, avoiding the computational waste or loss of detail caused by the "one-size-fits-all" approach in traditional methods, thus exhibiting better generalization ability in complex scenes.
[0021] Second, this application improves the accuracy of routing decisions and solves the problem of rigid decision-making. Traditional acceleration frameworks (such as SMSR) often rely on shallow features or simple gradient thresholds for "easy / hard" binary classification, resulting in low classification accuracy. This application introduces a lightweight feature enhancement module before the predictor, capturing deep semantic features through feature pyramids and multi-scale convolutions, and using the enhanced deep features to guide the predictor in generating masks and weights. This application effectively improves reconstruction accuracy, demonstrating that deep feature-based decision-making is more accurate and robust than traditional shallow feature-based decision-making.
[0022] Third, while reducing computational complexity, the ability to capture long-range dependencies is maintained. Unlike methods such as SwinIR or ELAN that rely on computationally intensive Transformer layers, this application proposes a multi-scale large-kernel attention mechanism with dynamic selection. It utilizes a combination of "depthmography, dilated convolution, and pointwise convolution" to simulate large convolution kernels, replacing the traditional self-attention mechanism. The adaptive allocation mechanism proposed in this application effectively reduces computational redundancy. Compared to the static computation mode of using large-kernel convolution or Transformer on the entire image, this application automatically switches to small window and shallow layer computation when processing flat regions, which significantly reduces the overall computational load and achieves efficient inference acceleration while ensuring performance. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments or related technologies of this application, the accompanying drawings used in the description of the embodiments or related technologies of this application will be briefly introduced below. Obviously, the following drawings are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. The drawings described herein are only used to explain this application and are not intended to limit this application.
[0024] Figure 1 This is a flowchart of an image super-resolution reconstruction method based on a multi-scale content-aware mixer provided in one embodiment of this application; Figure 2 This is a detailed schematic diagram of an image super-resolution reconstruction method based on a multi-scale content-aware mixer provided in one embodiment of this application; Figure 3 This is a structural diagram of a feature enhancement encoder provided in one embodiment of this application; Figure 4 This is a structural diagram of an image super-resolution reconstruction system based on a multi-scale content-aware mixer provided in another embodiment of this application; Figure 5 This is a structural diagram of an electronic device provided in another embodiment of this application. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the various embodiments of this application will be described in detail below with reference to the accompanying drawings. Those skilled in the art will understand that many technical details have been presented in the embodiments of this application to facilitate better understanding. However, the technical solutions claimed in this application can be implemented even without these technical details and various variations and modifications based on the following embodiments. The division of the following embodiments is for ease of description and should not constitute any limitation on the specific implementation of this application. The following embodiments can be combined with and referenced by each other without contradiction.
[0026] Currently, research teams have achieved image super-resolution reconstruction by applying different processing methods to images of varying complexity. The first method utilizes classification routing, the second utilizes sparse computation, and the third utilizes dynamic depth.
[0027] K.Yu et al. proposed a classification-based super-resolution acceleration framework. This method first crops the large input image into fixed-size image patches. Using a lightweight classifier, these patches are divided into three levels based on their reconstruction difficulty: easy, medium, and hard. The patches of different categories are then fed into three computationally intensive network branches for processing. The easy branch contains only a few convolutional layers, while the hard branch uses a deep, large network, thus reducing the overall average computational cost. While this approach considers the varying reconstruction difficulty of different regions in the image, it suffers from a serious "path rigidity" problem. Using fixed-size image segmentation results in overly coarse classification granularity. If only a small portion of a patch is textured, and the rest is flat, it may still be classified as hard, leading to wasted computation.
[0028] Bohong Chen et al. proposed a super-resolution acceleration strategy utilizing spatial pruning. This method dynamically learns a binary mask during network feature extraction. This mask is used to mark which regions in the feature map are texture-rich, important locations ("1") and which are flat, redundant locations ("0"). The network only performs convolution operations on regions marked with "1" in the mask, while regions marked with "0" are skipped or simply interpolated. This method focuses computational resources from the entire image to high-frequency detail regions, significantly reducing the number of floating-point operations. However, this method's decisions are often based on binary decisions, i.e., either / or, and also suffer from "path rigidity." Furthermore, this method primarily focuses on whether to perform computation, lacking an adjustment to the computational paradigm.
[0029] Y. Zhang et al. designed a super-resolution acceleration algorithm based on a depth-adaptive strategy. This algorithm inserts policy modules or gating mechanisms between residual blocks in the backbone network. For each image region, the network evaluates its current feature recovery quality in real time. If a region is already well recovered in a shallow layer, the computation for that region terminates early, and the result is output directly; only complex texture regions that are extremely difficult to recover are sent to the deepest layer of the network to complete the full computation path. Although this method achieves dynamic depth adjustment, it typically uses standard 3×3 convolutions within each layer, resulting in a fixed receptive field. For large objects or repetitive textures requiring long-range dependencies, simply increasing the depth is less effective than increasing the size of the convolutional kernel.
[0030] The goal of single-image super-resolution is to reconstruct high-resolution images from degraded low-resolution images. However, when processing large-sized images in real-world scenarios, computational complexity and memory consumption become major bottlenecks. Current content-aware routing acceleration frameworks suffer from path rigidity. Specifically, this rigidity is spatially manifested in the use of fixed, regular grids for routing decisions, crudely assuming uniform complexity for all pixels within an image patch and ignoring scale differences in local image features. To address this issue, this application proposes a multi-scale content-aware mixer with a scale prediction head. The scale prediction head in the predictor dynamically predicts the optimal window size (8×8 or 16×16) for each location, generating a non-uniform grid. This allows for dynamic partitioning of the feature map based on the predicted scale, breaking the limitations of traditional fixed grids and achieving finer adaptive control over inference computation.
[0031] To achieve efficient image super-resolution, we need to accurately distinguish between simple and complex regions in an image in order to allocate computational resources as needed. However, traditional decision-making mechanisms typically rely on simple, low-capacity models or hard thresholds to judge shallow features. This binary decision-making based on shallow features lacks an understanding of deep semantics, leading to inaccurate classification results and rigid decision-making. To address this issue, this application introduces an extremely lightweight feature enhancement module before the predictor. This module enhances shallow features using feature pyramids and multi-scale depthwise convolutions, capturing deep feature information at different scales. The enhanced deep feature maps are then used to guide the predictor in generating more accurate masks, offsets, and attention weights, thereby significantly improving the accuracy of classifying simple and complex regions in the input image.
[0032] Current lightweight super-resolution networks often suffer from "non-discriminatory" computation, meaning that regardless of whether the input image content is flat or textured, the network executes the exact same computation graph, resulting in significant computational waste when processing simple content. Although some acceleration methods attempt to address this issue, most still directly employ computationally intensive Transformer layers or standard attention mechanisms when processing regions deemed difficult, leading to persistently high overall FLOPs and making hardware deployment impractical. To address this problem, this application proposes a multi-scale large-kernel attention mechanism with dynamic selection to replace traditional Transformer layers. Leveraging the extreme low-level optimization of convolution operations in modern hardware, multi-scale large-kernel convolution maintains a large receptive field to capture long-range dependencies while flexibly selecting convolution kernels through dynamic gating, reducing the computational complexity of processing complex regions and achieving the optimal balance between large-size image restoration quality and inference speed.
[0033] This application combines a lightweight feature enhancement module with a dynamic multi-scale large-kernel attention mechanism to perform adaptive super-resolution reconstruction of single images based on a multi-scale content-aware mixer architecture. This application aims to overcome the path rigidity and computational non-discrimination bottlenecks of traditional acceleration frameworks by achieving on-demand allocation of computational resources through dynamic window partitioning. This has a significant positive impact on significantly reducing model computation while ensuring image reconstruction quality and promoting the widespread application of super-resolution technology on resource-constrained hardware.
[0034] One embodiment of this application proposes an image super-resolution reconstruction method based on a multi-scale content-aware mixer. The implementation details of the image super-resolution reconstruction method based on a multi-scale content-aware mixer proposed in this embodiment are described in detail below. The following implementation details are provided for ease of understanding and are not necessary for implementing this solution.
[0035] This embodiment leverages the characteristic that different feature regions require models of varying complexity. Through a multi-scale content-aware predictor, it accurately captures local feature differences and global complexity distribution of the input image, generating guiding information including window classification masks and window sizes. Based on the predictor's classification of window size and texture information, lightweight convolutional operations are allocated to simple scenes, while multi-scale large-kernel convolutions are allocated to complex scenes with sparse textures. This achieves rational allocation and utilization of resources, effectively reducing the model's FLOPs.
[0036] The specific process of the image super-resolution reconstruction method based on a multi-scale content-aware mixer proposed in this embodiment can be described as follows: Figure 1 As shown, its specific details are as follows: Figure 2 As shown, the method includes: Step 11: Perform shallow feature extraction on the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain the initial shallow feature map.
[0037] In the specific implementation, after obtaining the low-resolution image to be reconstructed, shallow feature extraction can be performed on the low-resolution image to be reconstructed, thereby mapping the low-resolution image from the pixel space to the feature space and obtaining the initial shallow feature map.
[0038] like Figure 2 As shown, shallow feature extraction is implemented by a shallow feature extraction module. The shallow feature extraction module receives the low-resolution image to be reconstructed and performs shallow feature extraction on the low-resolution image through a convolutional layer. The low-resolution image is mapped from the pixel space to the feature space to obtain the initial shallow feature map, thus preparing for subsequent feature enhancement and deep feature interaction.
[0039] Step 12: Perform feature enhancement on the shallow feature map based on feature pyramid and attention mechanism to obtain a deep feature map containing rich multi-scale spatial information.
[0040] In practice, after obtaining the shallow feature map, feature enhancement based on feature pyramids and attention mechanisms can be performed on the shallow feature map to obtain a deep feature map containing rich multi-scale spatial information.
[0041] like Figure 2 As shown, feature enhancement is specifically implemented by a feature enhancement encoder based on a feature pyramid structure. The feature enhancement encoder first divides the shallow feature map by channel, then captures features at different scales using max-pooling layers with different downsampling factors. Next, it uses nearest-neighbor interpolation to restore the features at different scales to their original size, generating an adaptive spatial attention map. Then, it performs multi-receptive field fusion on the spatial attention map along the channel dimension, performing depthwise convolution operations through three branches of different receptive fields. Finally, it concatenates the outputs of the branches of different receptive fields to obtain a deep feature map containing rich multi-scale spatial information.
[0042] In one example, the specific structure of the feature enhancement encoder is as follows: Figure 3 As shown, the feature enhancement encoder consists of a spatial adaptive feature modulation unit (SAFM) and a convolutional channel mixer (CCM), with layer normalization operations preceding both the SAFM and CCM.
[0043] Step 13: Perform multi-scale content-aware prediction based on deep feature maps to generate guiding information for computation allocation, including window classification binary masks and window sizes. The window classification binary mask classifies different image regions in the deep feature maps into easy or hard classes.
[0044] In the specific implementation, after obtaining the deep feature map, multi-scale content-aware prediction can be performed based on the deep feature map, generating guiding information for computation allocation, including a window classification binary mask and window size. Among them, the window classification binary mask classifies different image regions in the deep feature map into easy or hard classes. The easy class corresponds to flat regions (such as the sky and smooth background), and the hard class corresponds to complex regions.
[0045] like Figure 2As shown, multi-scale content-aware prediction is implemented by a multi-scale content-aware predictor. The multi-scale content-aware predictor first learns the texture information of the deep feature map, and generates a window classification binary mask based on the complexity of the texture information. This mask classifies different image regions in the deep feature map into easy or hard classes. Then, for each prediction window position, it uses the Gumbel Softmax strategy for differentiable sampling, predicting the window size with the highest probability from a pre-defined set of window sizes. This allows the network to use small windows in flat areas and large windows in complex areas, achieving non-uniform grid partitioning. Simultaneously, the multi-scale content-aware predictor also generates spatial attention weights for simple paths and channel attention weights for channel fusion.
[0046] The core principle of the multi-scale content-aware prediction proposed in this embodiment lies in abandoning the rigid mode of "fixed partitioning" and "static computation" in traditional super-resolution networks. Instead, it adopts an adaptive processing mechanism of "feature enhancement perception, dynamic scale prediction, and on-demand computation allocation," thereby achieving the goal of allocating computational resources on demand and dynamically adjusting processing paths. The scale prediction head in the multi-scale content-aware predictor pre-sets multiple sets of prediction window sizes. For each location in the image, the scale prediction head calculates the probability of generating M candidate sizes and selects the size with the highest probability as the target size. Based on the prediction results, the system dynamically divides the feature map into a non-uniform grid. Flat areas may be divided into small windows, while complex areas requiring long-range dependencies are divided into large windows, thus breaking the spatial path rigidity and achieving on-demand resource allocation.
[0047] Step 14: Based on the guidance information, different image regions are assigned to different computation paths for processing.
[0048] In the specific implementation, after obtaining the guidance information, different image regions can be assigned (routed) to different computing paths for processing based on the guidance information.
[0049] For simple image regions, this embodiment selects to assign them to simple paths for processing, uses basic convolution operations, and combines spatial attention weights and channel attention weights for lightweight feature correction.
[0050] For image regions classified as difficult, this embodiment selects to assign them to difficult paths for processing. That is, it adopts a multi-scale large kernel attention mechanism with dynamic selection. The image regions marked as difficult by the window classification binary mask are input into large kernel convolutions of different scales, which are decomposed into depthwise convolutions, depthwise convolutions with dilation factors, and pointwise convolutions. While reducing the amount of computation, it simulates a large receptive field. The Sigmoid activation function generates dynamic selection weights, and the outputs of large kernel convolutions of different scales are weighted and fused to flexibly capture long-range dependencies.
[0051] Step 15: Reassemble and fuse the feature maps output by each computation path, and then enlarge the resolution of the reassembled and fused feature maps to the target resolution to finally obtain a high-resolution image.
[0052] In the specific implementation, after each computing path has been processed, the feature maps output by each computing path can be recombined and fused, and then the resolution of the recombined and fused feature maps can be enlarged to the target resolution to finally obtain a high-resolution image.
[0053] like Figure 2 As shown, the recombination and fusion are specifically implemented by the image reconstruction module. The image reconstruction module recombines and fuses the feature maps output by the simple path and the hard path, uses the PixelShuffle unit to upscale the resolution of the recombined and fused feature map to the target resolution, and finally outputs a high-resolution image corresponding to the low-resolution image to be reconstructed through a convolutional layer.
[0054] The image super-resolution reconstruction method based on a multi-scale content-aware mixer proposed in this embodiment has the following advantages compared with traditional image super-resolution reconstruction methods.
[0055] First, it breaks the limitations of rigid paths, enabling more refined allocation of computational resources. Unlike traditional techniques (such as ClassSR) that use fixed-size grids for routing decisions, this embodiment introduces a scale prediction head in a multi-scale content-aware predictor. This mechanism can dynamically predict and generate non-uniform grids based on the complexity of local image textures. This non-uniform partitioning can more closely fit irregular object boundaries, avoiding the computational waste or loss of detail caused by the "one-size-fits-all" approach in traditional methods, thus exhibiting superior generalization ability in complex scenes.
[0056] Secondly, it improves the accuracy of routing decisions and solves the problem of rigid decision-making. Traditional acceleration frameworks (such as SMSR) often rely on shallow features or simple gradient thresholds for "easy / hard" binary classification, resulting in low classification accuracy. This embodiment introduces a lightweight feature enhancement module before the predictor, capturing deep semantic features through feature pyramids and multi-scale convolutions, and using the enhanced deep features to guide the predictor in generating masks and weights. Experimental results show that after adding the feature enhancement module, on the Urban100 dataset (magnification ×4), this embodiment achieves a PSNR of 26.73dB and an SSIM of 0.8045. Compared with similar lightweight models SwinIR-light (PSNR 26.47dB) and ELAN-light (PSNR 26.54dB), this embodiment shows an improvement of 0.26dB and 0.19dB in reconstruction accuracy, respectively. This proves that decision-making based on deep features is more accurate and robust than traditional decision-making based on shallow features.
[0057] Third, while reducing computational complexity, the ability to capture long-range dependencies is maintained. Unlike methods such as SwinIR or ELAN that rely on computationally intensive Transformer layers, this embodiment proposes a multi-scale large kernel attention mechanism with dynamic selection. It uses a combination of "depthmography, dilated convolution, and pointwise convolution" to simulate large convolution kernels, replacing the traditional self-attention mechanism. The adaptive allocation mechanism proposed in this embodiment effectively reduces computational redundancy. Compared to the static computation mode of using large kernel convolution or Transformer on the entire graph, this embodiment automatically switches to small window and shallow layer computation when processing flat regions, which significantly reduces the overall computational load (by approximately 15G), achieving efficient inference acceleration while maintaining performance.
[0058] The steps described above are merely for clarity in describing the technical solution. In actual implementation, they can be combined into one step, or certain steps can be broken down into multiple steps, as long as they involve the same logical relationship, they are all within the scope of protection of this application. Any insignificant modifications or designs added to the algorithm or process, as long as they do not change the core of the algorithm or process, are also within the scope of protection of this application.
[0059] Another embodiment of this application proposes an image super-resolution reconstruction system based on a multi-scale content-aware mixer, used to implement the image super-resolution reconstruction method based on a multi-scale content-aware mixer as described in the above method embodiment. The details of the image super-resolution reconstruction system based on a multi-scale content-aware mixer proposed in this embodiment are described in detail below. The following content is only for the convenience of understanding and is not necessary for implementing this solution.
[0060] The specific structure of the image super-resolution reconstruction system based on a multi-scale content-aware mixer proposed in this embodiment can be as follows: Figure 4 As shown, it includes: a shallow feature extraction module 21, a feature enhancement module 22, a multi-scale content-aware prediction module 23, an adaptive dynamic path selection module 24, and an image reconstruction module 25.
[0061] The shallow feature extraction module 21 is used to perform shallow feature extraction on the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map.
[0062] The feature enhancement module 22 is used to perform feature enhancement on the shallow feature map based on feature pyramid and attention mechanism to obtain a deep feature map containing rich multi-scale spatial information.
[0063] The multi-scale content-aware prediction module 23 is used to perform multi-scale content-aware prediction based on deep feature maps, and generates guiding information, including window classification binary mask and window size, to guide the calculation allocation. The window classification binary mask classifies different image regions in the deep feature map into easy or hard classes.
[0064] The adaptive dynamic path selection module 24 is used to assign different image regions to different computational paths for processing based on guidance information.
[0065] The image reconstruction module 25 is used to reconstruct and fuse the feature maps output by each calculation path, and then enlarge the resolution of the reconstructed and fused feature maps to the target resolution to finally obtain a high-resolution image.
[0066] It is worth noting that all modules involved in this embodiment are logical modules. In practical applications, a logical module can be a physical module, a part of a physical module, or an organic combination of multiple physical modules. Furthermore, to highlight the innovative aspects of this application, this embodiment does not introduce modules that are not closely related to solving the technical problems proposed in this application. However, this does not mean that other modules are absent from this embodiment.
[0067] It is not difficult to see that this embodiment is a system embodiment corresponding to the above method embodiments, and this embodiment can be implemented in conjunction with the above method embodiments. The relevant technical details and technical effects mentioned in the above method embodiments are still valid in this embodiment, and will not be repeated here to reduce repetition. Accordingly, the relevant technical details mentioned in this embodiment can also be applied to the above method embodiments.
[0068] Another embodiment of this application provides an electronic device, such as Figure 5As shown, it includes a processor 31 and a memory 32. The memory 32 stores instructions that the processor 31 can execute. When the processor 31 is configured to execute the instructions, the electronic device can implement an image super-resolution reconstruction method based on a multi-scale content-aware mixer as described in the above method embodiment.
[0069] The memory and processor are connected via a bus, which includes any number of interconnecting buses and bridges. The bus can connect various circuits of one or more processors and memories, as well as other circuits such as peripherals, voltage regulators, and power management circuits—all well-known in the art and therefore not described further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver can be a single component or multiple components, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. Data processed by the processor is transmitted over the wireless medium via an antenna, which also receives and transmits data to the processor.
[0070] The processor manages the bus and handles general processing, providing various functions, including but not limited to timing, peripheral interfaces, voltage regulation, power management, and other control functions. Memory, on the other hand, is used to store data used by the processor during operation.
[0071] Another embodiment of this application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, enables an image super-resolution reconstruction method based on a multi-scale content-aware mixer as described in the above method embodiments.
[0072] That is, those skilled in the art will understand that all or part of the steps in the above method embodiments can be implemented by a program instructing related hardware. The program is stored in a storage medium and includes several instructions to cause a device (such as a microcontroller, chip, etc.) or processor to execute all or part of the steps of the method described in the method embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory, random access memory, magnetic disks, or optical disks.
[0073] It will be understood by those skilled in the art that the above embodiments are specific implementations of this application, and various changes in form and detail can be made in practical applications without departing from the spirit and scope of this application. For those skilled in the art, several improvements and modifications can be made without departing from the principles of this application, and these improvements and modifications are also considered to be within the scope of protection of this application.
Claims
1. An image super-resolution reconstruction method based on a multi-scale content-aware mixer, implemented using an adaptive processing mechanism, characterized in that... The method includes: Shallow feature extraction is performed on the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map. We perform feature enhancement based on feature pyramids and attention mechanisms on shallow feature maps to obtain deep feature maps containing rich multi-scale spatial information. Multi-scale content-aware prediction is performed based on deep feature maps, generating guiding information, including window classification binary masks and window sizes, to guide computational allocation; wherein, the window classification binary mask classifies different image regions in the deep feature map into easy or hard classes; Based on the guidance information, different image regions are assigned to different computational paths for processing; The feature maps output from each computation path are recombined and fused, and then the resolution of the recombined and fused feature maps is enlarged to the target resolution to finally obtain a high-resolution image.
2. The image super-resolution reconstruction method based on a multi-scale content-aware mixer according to claim 1, characterized in that, Shallow feature extraction is performed on the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map, including: The system receives a low-resolution image to be reconstructed and performs shallow feature extraction on the low-resolution image through a convolutional layer. The low-resolution image is mapped from the pixel space to the feature space to obtain an initial shallow feature map.
3. The image super-resolution reconstruction method based on a multi-scale content-aware mixer according to claim 2, characterized in that, A feature enhancement encoder based on a feature pyramid structure is used to enhance the shallow feature map, and then feature enhancement based on feature pyramids and attention mechanisms is performed on the shallow feature map to obtain a deep feature map containing rich multi-scale spatial information, including: The shallow feature map is divided by channel, and features of different scales are captured by max pooling layers with different downsampling factors. Then, the nearest neighbor interpolation method is used to restore the features of different scales to their original size, generating an adaptive spatial attention map. The spatial attention map is fused with multiple receptive fields in the channel dimension. Depthwise convolution is performed on the branches of three different receptive fields, and then the outputs of the branches of different receptive fields are concatenated by a concatenation operation to obtain a deep feature map containing rich multi-scale spatial information.
4. The image super-resolution reconstruction method based on a multi-scale content-aware mixer according to claim 3, characterized in that, The feature enhancement encoder consists of a spatial adaptive feature modulation unit and a convolutional channel mixer, with layer normalization operations preceding both the spatial adaptive feature modulation unit and the convolutional channel mixer.
5. The image super-resolution reconstruction method based on a multi-scale content-aware mixer according to claim 3, characterized in that, Multi-scale content-aware prediction based on deep feature maps generates guiding information for computational allocation, including a window classification binary mask and window size. Learn the texture information of deep feature maps, generate window classification binary masks based on the complexity of the texture information, and classify different image regions in the deep feature maps into easy or hard classes; For the location of each prediction window, the Gumbel Softmax strategy is used to perform differentiable sampling, predicting the window size with the highest probability from the set of preset window sizes, thus achieving non-uniform grid partitioning. Generate spatial attention weights for simple paths and channel attention weights for channel fusion.
6. The image super-resolution reconstruction method based on a multi-scale content-aware mixer according to claim 5, characterized in that, Based on the guidance information, different image regions are assigned to different computational paths for processing, including: For simple image regions, they are assigned to simple paths for processing, using basic convolution operations, and combined with spatial attention weights and channel attention weights for lightweight feature correction. For image regions classified as difficult, they are assigned to difficult paths for processing. A multi-scale large kernel attention mechanism with dynamic selection is used. Image regions marked as difficult by the window classification binary mask are input into large kernel convolutions of different scales, which are decomposed into depthwise convolutions, depthwise convolutions with dilation factors, and pointwise convolutions. This reduces the amount of computation while simulating a large receptive field. The Sigmoid activation function generates dynamically selected weights, and the outputs of large kernel convolutions of different scales are weighted and fused to flexibly capture long-range dependencies.
7. The image super-resolution reconstruction method based on a multi-scale content-aware mixer according to claim 6, characterized in that, The feature maps output from each computation path are recombined and fused, and then the resolution of the recombined and fused feature maps is enlarged to the target resolution to finally obtain a high-resolution image, including: The feature maps output from the simple path and the hard path are recombined and fused. The resolution of the recombined and fused feature map is enlarged to the target resolution using the PixelShuffle unit. Finally, a high-resolution image corresponding to the low-resolution image to be reconstructed is output through a convolutional layer.
8. An image super-resolution reconstruction system based on a multi-scale content-aware mixer, used to implement the image super-resolution reconstruction method based on a multi-scale content-aware mixer as described in any one of claims 1 to 7, characterized in that, The system includes: The shallow feature extraction module is used to extract shallow features from the low-resolution image to be reconstructed, mapping the low-resolution image from pixel space to feature space to obtain an initial shallow feature map. The feature enhancement module is used to perform feature enhancement on shallow feature maps based on feature pyramids and attention mechanisms to obtain deep feature maps containing rich multi-scale spatial information. The multi-scale content-aware prediction module is used to perform multi-scale content-aware prediction based on deep feature maps. It generates guiding information, including window classification binary mask and window size, to guide the calculation allocation. The window classification binary mask classifies different image regions in the deep feature map into easy or hard classes. The adaptive dynamic path selection module is used to assign different image regions to different computational paths for processing based on guidance information; The image reconstruction module is used to reassemble and fuse the feature maps output by each computation path, and then enlarge the resolution of the reassembled and fused feature maps to the target resolution, finally obtaining a high-resolution image.
9. An electronic device, characterized in that, include: The processor and memory, wherein the memory stores instructions executable by the processor, and the processor is configured to, when executing the instructions, enable the electronic device to implement an image super-resolution reconstruction method based on a multi-scale content-aware mixer as described in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it can implement an image super-resolution reconstruction method based on a multi-scale content-aware mixer as described in any one of claims 1 to 7.