Remote sensing image super-resolution reconstruction method, device and equipment based on feature modulation

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a feature modulation method, combined with a multispectral self-modulation feature aggregation module and a denoising sensing network, the problems of resolution limitation and noise interference in remote sensing image reconstruction are solved, achieving efficient and accurate image reconstruction suitable for low-power devices.

CN122048671BActive Publication Date: 2026-06-26XIAMEN UNIV OF TECH

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: XIAMEN UNIV OF TECH
Filing Date: 2026-04-17
Publication Date: 2026-06-26

AI Technical Summary

Technical Problem

Existing methods for super-resolution reconstruction of remote sensing images suffer from limitations in resolution, blurred boundaries of ground features, and difficulty in maintaining spectral consistency. Furthermore, models based on self-attention mechanisms have high computational complexity and are difficult to deploy on low-power devices.

Method used

A feature-based modulation approach is adopted, which combines a cascaded multispectral self-modulation feature aggregation module and a denoising sensing part convolutional feedforward network with global and local feature capture branches to perform feature modulation and denoising processing, thereby reducing computational complexity and improving image reconstruction accuracy.

Benefits of technology

It significantly improves the accuracy and detail restoration of remote sensing image reconstruction, effectively suppresses complex noise interference, optimizes the allocation of computing resources, enhances the ability to model multispectral band correlation, and is suitable for deployment in low-power devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122048671B_ABST

Patent Text Reader

Abstract

The application provides a feature modulation-based remote sensing image super-resolution reconstruction method, device and equipment, and relates to the technical field of image processing and deep learning.The application obtains a low-resolution remote sensing image and extracts a shallow feature map; a plurality of enhanced feature mixing modules (EFMB) are used for step-by-step extraction and feature modulation of deep features, wherein the EFMB realizes decoupling enhancement of spatial and spectral features through a spectral attention mechanism and parallel global and local feature capturing branches; the global branch simulates global modeling by using downsampling and variance statistics, and the local branch strengthens texture through an edge enhancement operator; a denoising perception partial convolution feedforward network (DNFN) performs denoising processing on partial channels; and finally, a high-resolution image is output through image reconstruction.The application can effectively solve the problems of blurred boundaries and poor spectral consistency of remote sensing images, significantly reduces the deep network calculation overhead, improves the reconstruction accuracy, and is beneficial to low-power device deployment.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of image processing and deep learning technology, and more specifically, to a method, apparatus, and device for super-resolution reconstruction of remote sensing images based on feature modulation. Background Technology

[0002] Remote sensing imagery has significant applications in land monitoring, agricultural yield estimation, ecological environment assessment, and military reconnaissance. However, limited by objective factors such as imaging sensor accuracy, orbital altitude, lighting conditions, and transmission bandwidth, acquired remote sensing images often have limited resolution, making it difficult to meet the needs of downstream refined analysis. Super-resolution reconstruction technology aims to recover high-resolution images from low-resolution images, thereby improving the clarity of ground feature boundaries and texture details, which is of great significance for the refined interpretation and target detection of remote sensing images.

[0003] In the history of technological development, traditional methods such as bicubic interpolation only calculate pixel values based on a fixed spatial neighborhood, failing to effectively recover high-frequency details. With the rise of deep learning, image super-resolution methods based on convolutional neural networks have significantly improved reconstruction accuracy. In recent years, Transformer-based super-resolution methods have achieved remarkable success due to their self-attention mechanism's ability to explore non-local information. However, the crucial dot-product self-attention mechanism consumes significant computational resources, limiting its application in low-power devices. Furthermore, the low-pass characteristic of the self-attention mechanism restricts its ability to capture local details, often resulting in relatively smooth reconstruction results.

[0004] While deep learning provides powerful feature representation capabilities for super-resolution of remote sensing images, there are significant differences between remote sensing images and natural images. Remote sensing images typically contain richer spectral dimensions and more complex noise patterns. For example, although multispectral or hyperspectral images show strong correlations across different bands, the noise levels and imaging conditions vary across each band. This often results in unsatisfactory performance when traditional natural image super-resolution models are directly applied. Furthermore, edge and texture information plays a crucial role in remote sensing image interpretation; clear edges help distinguish ground feature boundaries and improve visual quality. However, deep networks are prone to over-smoothing during processing, leading to blurred edges.

[0005] In view of the above, this application is hereby submitted. Summary of the Invention

[0006] The present invention aims to provide a method, apparatus and device for super-resolution reconstruction of remote sensing images based on feature modulation, in order to solve the technical problems of limited resolution, blurred ground object boundaries and difficulty in maintaining spectral consistency in the imaging, transmission and storage of remote sensing images in existing methods, as well as the high computational complexity and difficulty in deploying existing deep learning models, especially transformer models based on self-attention mechanisms, on low-power edge devices.

[0007] To solve the above-mentioned technical problems, the present invention is achieved through the following technical solution:

[0008] A feature-modulated remote sensing image super-resolution reconstruction method includes:

[0009] S1, acquire low-resolution remote sensing images and extract shallow feature maps;

[0010] S2, the shallow feature map is fed into a series of cascaded enhanced feature mixing modules (EFMB) for stepwise extraction and modulation of deep features; the enhanced feature mixing module (EFMB) includes a multispectral self-modulation feature aggregation module (MSFA) and a denoising sensing part convolutional feedforward network (DNFN), and a residual connection is introduced between MSFA and DNFN.

[0011] S3, Inside MSFA, the shallow feature map is adjusted by the convolutional layer to adjust the number of channels, and then divided into input items of the global feature capture branch and input items of the local feature capture branch according to the channel dimension to obtain the corresponding branch output. The branch output is then fused and dimensionally transformed to obtain the MSFA output feature.

[0012] S4, the MSFA output feature input to DNFN, is divided into a first sub-feature block and a second sub-feature block according to the channel to perform a partial channel denoising mechanism, the first sub-feature block is denoised and then the second sub-feature block is spliced together to output the modulated depth feature;

[0013] S5, the channel dimension of the depth feature is converted to a preset size and amplified by pixel shuffling, and then superimposed with the shallow features transmitted by the global residual connection to obtain a high-resolution remote sensing image.

[0014] Preferably, in the process of extracting the shallow feature map, a convolutional layer with a preset kernel size, preset stride and preset padding parameters is used for initial mapping to convert the number of channels of the input low-resolution remote sensing image into a preset number of channels, and a nonlinear activation function is used to enhance the feature expression capability to obtain a shallow feature map containing basic spatial information.

[0015] Preferably, within the MSFA, the global feature capture branch and the local feature capture branch are configured in parallel;

[0016] The specific process for generating the MSFA output features is as follows:

[0017] The shallow feature map is subjected to L2 norm normalization, and the number of channels is adjusted using a 1×1 convolutional layer, dividing the channel dimension into input terms for the global feature capture branch. and the input terms of the local feature capture branch The expression is:

[0018] ;

[0019] in, Indicates a channel splitting operation; Represents a 1×1 convolution; This is a shallow feature map; Represents the L2 norm;

[0020] In the global feature capture branch, the input term is calculated first. The variance statistic in the spatial dimension, used to perceive the complexity of ground features, is expressed as follows:

[0021] ;

[0022] in, for Spatial dimension variance characteristics; for The total number of pixels; for The Middle The value of each pixel; for The average pixel value;

[0023] Calculate input items Attention weights for each channel and After performing element-wise multiplication fusion, and then processing it through downsampling and depthwise separable convolution, non-local structural information is generated.

[0024] Then, after fusing the variance statistics and the nonlocal structure information to generate a spatially modulated signal, the data is aggregated. Obtain global features;

[0025] Input items The main feature branch and edge enhancement branch of the local feature capture branch are respectively input to extract local main features and edge features, and the local main features and edge features are weighted and fused to output local features;

[0026] The global features and the local features are added element-wise, and a 1×1 convolutional layer is used to perform dimensionality transformation to output the MSFA output features.

[0027] Preferably, in calculating input items When calculating the attention weights for each channel, the spectral attention module is used.

[0028] The spectral attention module consists of a global average pooling layer, a first convolutional layer, and a second convolutional layer;

[0029] Through the global average pooling layer Each channel is compressed into a scalar to achieve spatial information aggregation;

[0030] Compressed Important features are filtered out by introducing non-linearity through the first convolutional layer and using the ReLU activation function;

[0031] The selected key features are then restored to the original number of channels through a second convolutional layer;

[0032] Finally, the normalized attention weights for each channel are generated using the Sigmoid activation function.

[0033] Preferably, a learnable scaling factor is introduced into the spatially modulated signal during the fusion generation process, and its expression is:

[0034] ;

[0035] in, It is a spatially modulated signal; It is a 1×1 convolution; This is non-local structural information; This is an element-wise multiplication operation; , This is a learnable scaling factor.

[0036] Preferably, the main feature branch is a convolutional multilayer perceptron architecture;

[0037] In the main feature branch, information is first aggregated in the local spatial neighborhood of the input feature through depthwise separable convolution. Then, the channel dimension is expanded to the preset hidden layer dimension using 1×1 convolution and nonlinear transformation is performed in combination with the GELU activation function. Finally, the number of channels is restored by 1×1 convolution to obtain the local main feature.

[0038] The edge enhancement branch is used to extract image edges or local high-frequency information of the input features. It captures the spatial gradient changes in each channel of the input features through depthwise separable convolution, then performs cross-channel information integration using 1×1 convolution, and finally uses the Tanh hyperbolic tangent activation function for mapping to obtain edge features.

[0039] Preferably, in the denoising perception part of the convolutional feedforward network DNFN, the MSFA output features are expanded in channel dimension by combining 1×1 convolution with the GELU activation function to generate high-dimensional hidden layer features.

[0040] The hidden layer features are divided into first sub-feature blocks according to the channel dimension. Second sub-feature block ;in, The number of channels is set to one-quarter of the total number of channels of the hidden layer features, so as to perform partial channel denoising and balance noise suppression and detail preservation.

[0041] Denoising convolutional blocks combined with the GELU activation function are used to... The denoising process is performed; wherein the denoising convolutional block consists of cascaded denoising convolutional layers, batch normalization layers, and activation functions;

[0042] first, The local structure and noise patterns are initially extracted through a 3×3 denoising convolutional layer. Then, each channel is normalized and nonlinearly transformed through a batch normalization layer and the GELU activation function. Finally, the features are refined through a 3×3 denoising convolutional layer to obtain the denoised features.

[0043] Features after denoising After concatenating the channel dimensions, the original input dimensions are restored through a 1×1 convolutional layer, and the modulated depth features are output.

[0044] Preferably, during the pixel shuffling process, the high-dimensional channel data of the depth features are filled into the spatial neighborhood block according to a preset arrangement rule.

[0045] The present invention also provides a remote sensing image super-resolution reconstruction device based on feature modulation, comprising:

[0046] The shallow feature extraction unit is used to acquire low-resolution remote sensing images and extract shallow feature maps.

[0047] The feature modulation unit is used to feed the shallow feature map into a series of cascaded enhanced feature mixing modules (EFMB) for stepwise extraction and feature modulation of deep features. The enhanced feature mixing module (EFMB) includes a multispectral self-modulated feature aggregation module (MSFA) and a denoising sensing part convolutional feedforward network (DNFN), and a residual connection is introduced between MSFA and DNFN.

[0048] The MSFA dual-branch capture unit is used to divide the shallow feature map into input items for the global feature capture branch and input items for the local feature capture branch according to the channel dimension after adjusting the number of channels using a convolutional layer within the MSFA, thereby obtaining the corresponding branch outputs. The branch outputs are then fused and dimensionally transformed to obtain the MSFA output features.

[0049] The DNFN partial denoising unit is used to input the MSFA output feature into the DNFN, and divides it into a first sub-feature block and a second sub-feature block according to the channel to perform a partial channel denoising mechanism. After denoising the first sub-feature block, it is spliced with the second sub-feature block to output the modulated depth feature.

[0050] The image reconstruction unit is used to convert the channel dimension of the depth features into a preset size and amplify it through pixel shuffling, and then superimpose it with the shallow features transmitted by the global residual connection to obtain a high-resolution remote sensing image.

[0051] The present invention also provides a feature-modulation-based remote sensing image super-resolution reconstruction device, including a processor and a memory, wherein the memory stores a computer program that can be executed by the processor to implement the feature-modulation-based remote sensing image super-resolution reconstruction method described above.

[0052] The present invention also provides a computer-readable storage medium storing computer-readable instructions, which, when executed by a processor of the device on which the computer-readable storage medium is located, implement the feature modulation-based remote sensing image super-resolution reconstruction method described above.

[0053] In summary, compared with the prior art, the present invention has the following beneficial effects:

[0054] First, this invention significantly improves the accuracy and detail restoration capabilities of remote sensing image reconstruction. By constructing a multispectral self-modulation feature aggregation module and employing parallel global and local feature capture branches, it achieves simultaneous capture of non-local structural information and local high-frequency details. The global branch utilizes spatial dimension variance statistics to perceive the complexity of ground features, while the local edge enhancement branch strengthens the sharpness of ground feature boundaries through specific gradient extraction operators, effectively solving the over-smoothing problem that traditional deep learning models often encounter in remote sensing image processing. Under a preset computational budget, the peak signal-to-noise ratio and structural similarity of the reconstructed images are significantly improved.

[0055] Second, it effectively suppresses complex noise interference unique to remote sensing images. Addressing the prevalent non-uniform noise in remote sensing images, this invention proposes a denoising-sensory part convolutional feedforward network. By performing asymmetric channel segmentation on features and only performing refined denoising convolution operations on a portion of these features, noise suppression and feature preservation along the channel dimension are decoupled. This avoids excessive smoothing caused by full-channel denoising and significantly reduces the computational cost of convolution. This design not only effectively filters out sensor noise during multispectral imaging but also prevents the denoising process from inadvertently damaging normal ground textures, ensuring the fidelity of the reconstruction results. Consequently, the reconstructed image exhibits higher accuracy in subsequent target detection and ground feature interpretation tasks.

[0056] Third, this invention optimizes the allocation of computing resources and improves the deployment performance of low-power devices. By introducing depthwise separable convolution, partial convolution strategies, and dimensionality control techniques for pre-sized convolution, this invention significantly reduces the number of model parameters and computational load. Compared to traditional transformer models based on dot product self-attention, the computational complexity of this invention increases linearly with image spatial resolution. This enables the solution to run efficiently on low-power embedded devices or satellite vehicle-mounted processing terminals with limited computing resources, meeting the engineering requirements for real-time super-resolution processing of remote sensing images.

[0057] Fourth, it enhances the modeling capability for multispectral band correlations. The spectral attention module of this invention can automatically learn the importance weights between different bands in remote sensing images. Through adaptive feature modulation of the spectral dimension, it fully exploits the strong correlations between multispectral data. This feature modulation-based mechanism enables the model to recover spectral characteristics more accurately than traditional natural image super-resolution models when processing remote sensing images with rich spectral information, reducing spectral distortion and providing high-quality basic data support for subsequent refined interpretation tasks. Attached Figure Description

[0058] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained from these drawings without creative effort.

[0059] Figure 1 This is a schematic diagram of the overall process of a feature-modulated remote sensing image super-resolution reconstruction method provided in Example 1.

[0060] Figure 2 This is a detailed flowchart illustrating a feature-modulated remote sensing image super-resolution reconstruction method provided in Example 1.

[0061] Figure 3 This is a schematic diagram of the overall structural framework of the feature modulation network provided in Example 1.

[0062] Figure 4 This is a schematic diagram of the internal structure of the Multispectral Self-Modulation Feature Aggregation Module (MSFA) provided in Example 1.

[0063] Figure 5 This is a schematic diagram of the internal structure of the Denoising Sensing Part Convolutional Feedforward Network (DNFN) provided in Example 1.

[0064] Figure 6 This is a comparison image of the local visual effects of super-resolution reconstruction of typical features in remote sensing images provided in Example 1.

[0065] Figure 7 This is a schematic diagram of a remote sensing image super-resolution reconstruction device based on feature modulation, provided in Embodiment 2.

[0066] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Detailed Implementation

[0067] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0068] Example 1

[0069] Embodiment 1 of the present invention provides a method for super-resolution reconstruction of remote sensing images based on feature modulation, which can be implemented by a remote sensing image super-resolution reconstruction device based on feature modulation (hereinafter referred to as reconstruction device), and in particular, executed by one or more processors within the reconstruction device.

[0070] In this embodiment, the reconstruction device may be an electronic device equipped with a processor, which carries a computer program for the feature modulation-based remote sensing image super-resolution reconstruction method and the computer program can be executed, such as a computer, smartphone, smart tablet, workstation, etc., which are not limited here.

[0071] The core of this invention lies in achieving accurate reconstruction of complex ground features in remote sensing images with extremely low computational overhead through a meticulously designed feature modulation architecture. In practical applications, such as land resource monitoring, disaster assessment, or urban planning, remote sensing images often suffer from insufficient resolution due to long imaging distances and significant environmental interference. This invention effectively utilizes the spectral correlation and spatial structure priors of remote sensing images through shallow feature extraction, feature modulation, and image reconstruction steps.

[0072] like Figures 1-2 As shown, a method for super-resolution reconstruction of remote sensing images based on feature modulation includes steps S1 to S5.

[0073] S1: Acquire low-resolution remote sensing images and extract shallow feature maps.

[0074] In this step, raw low-resolution remote sensing images are first received from an external storage device or a real-time satellite transmission link. The data format of the low-resolution remote sensing images includes, but is not limited to, TIFF, GEOTIFF, or HDF5, which encompasses the characteristics of multi-source remote sensing data and specifically involves the preprocessing of the multi-source remote sensing data. The source remote sensing data includes multispectral or hyperspectral imagery.

[0075] During preprocessing, the system performs radiometric calibration, atmospheric correction, and geometric fine correction on the input raw data stream to eliminate the influence of sensor noise and atmospheric scattering on the reflectivity of ground objects. The remote sensing image data includes high-resolution optical images, synthetic aperture radar images, and hyperspectral images, with a temporal resolution of no less than 15 days and a spatial resolution better than 2 meters. Multi-temporal image alignment is achieved through control point-based geometric correction and radiometric normalization processing.

[0076] In the process of extracting the shallow feature map, a convolutional layer with a preset kernel size, preset stride and preset padding parameters is used for initial mapping. For example, a convolutional layer with a kernel size of 3×3, a stride of 1 and padding of 1 is used to perform the operation, converting the number of channels of the input low-resolution remote sensing image (such as 4 channels: red, green, blue and near-infrared) into a preset number of channels, such as 64 channels or 128 channels. Then, a non-linear activation function is used to enhance the feature expression capability and obtain a shallow feature map containing basic spatial information.

[0077] Compared to the traditional ReLU activation function, the GELU activation function has a non-zero gradient in the negative region, and its smoothness helps to retain more ground feature edge information and weak spectral differences in the early stages of feature extraction.

[0078] S2, the shallow feature map is fed into a series of cascaded enhanced feature mixing modules (EFMB) for step-by-step extraction and modulation of deep features; the enhanced feature mixing module (EFMB) includes a multispectral self-modulation feature aggregation module (MSFA) and a denoising sensing part convolutional feedforward network (DNFN), and a residual connection is introduced between MSFA and DNFN.

[0079] The Enhanced Feature Mixing Block (EFMB), as the core computing component of the entire network, is responsible for the step-by-step extraction and modulation of deep features. The number of cascaded EFMBs is adaptively adjusted according to the reconstruction ratio and computing resources, preferably between 16 and 32 cascaded modules. In embedded processing terminals with limited hardware resources, the number of cascaded modules can be set to 16; while in high-performance ground processing centers, it can be set to 32 or more to pursue the ultimate reconstruction accuracy.

[0080] Each Enhanced Feature Mixing Module (EFMB) consists of two sub-modules: a Multi-Spectral Self-Modulation Feature Aggregation Module (MSFA) and a Denoising-Aware Partial Convolutional Feedforward Network (DNFN). These two sub-modules are arranged in series to jointly perform deep feature processing. To ensure the stability of the network during training at extremely deep structures, each EFMB includes residual connection structures. Specifically, the input features of the module are element-wise added to the output features processed by MSFA and DNFN.

[0081] The characteristic modulation part can be represented as:

[0082] ;

[0083] ;

[0084] in, For MSFA output features; This is a shallow feature map; The modulated depth features; , These represent the processing of the Multispectral Self-Modulation Feature Aggregation Module (MSFA) and the Denoising Sensing Part Convolutional Feedforward Network (DNFN), respectively.

[0085] This identity mapping path allows gradients to directly penetrate modules during backpropagation, effectively alleviating the gradient vanishing problem during deep network training and ensuring that shallow spatial information is effectively transmitted in deep networks, preventing the loss of image structural information.

[0086] refer to Figure 3 The overall structure of the network follows the design principles of residual learning, setting up global residual connections between the input and output ends. This allows the network to primarily learn the residual information between high-resolution and low-resolution images during training, i.e., the high-frequency details of the images, thereby reducing the learning difficulty and accelerating convergence.

[0087] S3, Inside MSFA, the shallow feature map is adjusted by the number of channels using a convolutional layer, and then divided into input items for the global feature capture branch and input items for the local feature capture branch according to the channel dimension to obtain the corresponding branch outputs. The branch outputs are then fused and dimensionally transformed to obtain the MSFA output features.

[0088] refer to Figure 4 Within MSFA, the Global Feature Extraction Branch (GFEB) and the Local Feature Extraction Branch (LFEB) are set up in parallel. The original intention of this parallel architecture is to simulate the human visual system's ability to perceive the macroscopic structure and microscopic details of an image in a coordinated manner.

[0089] The specific process for generating the MSFA output features is as follows:

[0090] First, L2 norm normalization is performed on the shallow feature map. This process standardizes the energy distribution of the feature map in the channel dimension by calculating the energy distribution, so that the modulus of the feature vector is uniform, thereby enhancing the robustness of the model to different lighting conditions and radiation intensities.

[0091] Then a 1×1 convolutional layer ( Adjust the number of channels to split the channel dimension into input terms for the global feature capture branch. and the input terms of the local feature capture branch The expression is:

[0092] ;

[0093] in, This is a shallow feature map. , Represent real numbers, These represent the image's height, width, and number of channels, respectively. Represents the L2 norm; This represents a 1×1 convolutional layer; This indicates a channel splitting operation.

[0094] The segmentation operation typically employs a channel-equalization strategy; that is, if the total number of channels is 128, then the input item... and input items Each has 64 channels. This even distribution strategy ensures that global and local features contribute equally to the subsequent fusion.

[0095] The global feature capture branch in this embodiment includes low-frequency feature extraction and global feature modulation, simulating global non-local information interaction without using explicit self-attention. While directly using the standard self-attention mechanism can model global relationships, its computational complexity is high, which is too costly for high-resolution remote sensing images. The global feature capture branch in this embodiment, through downsampling, variance statistics, and learnable modulation, can simulate a global perception capability similar to attention with negligible computational cost.

[0096] In the global feature capture branch, the input term is calculated first. The variance statistic in the spatial dimension, used to perceive the complexity of ground features, is expressed as follows:

[0097] ;

[0098] in, for Spatial dimension variance characteristics; for The total number of pixels; for The Middle The value of each pixel; for The average pixel value.

[0099] Variance statistics reflect the texture richness of an image region. For regions with clear feature boundaries and complex textures (such as urban building complexes or fragmented farmland), the variance value is larger; while for regions with smooth textures (such as large areas of water or deserts), the variance value is smaller.

[0100] Calculate input items Attention weights for each channel and Perform element-wise multiplication fusion, followed by downsampling and depthwise separable convolution (e.g., a 3×3 depthwise separable convolutional layer). After processing, nonlocal structural information is generated.

[0101] In calculating input items When calculating the attention weights for each channel, the spectral attention module is used; the spectral attention module consists of a global average pooling layer, a first convolutional layer, and a second convolutional layer.

[0102] Through the global average pooling layer Each channel is compressed into a scalar to achieve spatial information aggregation;

[0103] Compressed By introducing nonlinearity through the first convolutional layer (such as a 1×1 convolution) and using the ReLU activation function, important features (such as important spectral feature bands) are filtered out.

[0104] The selected key features are then restored to the original number of channels through a second convolutional layer (such as a 1×1 convolution).

[0105] Finally, the sigmoid activation function is used to generate normalized attention weights for each channel (with values ranging from 0 to 1).

[0106] The expression for the channel attention weights can be:

[0107] ;

[0108] in, Indicates channel attention weights; Use the Sigmoid activation function; for Activation function; This is a global average pooling operation.

[0109] These weights are related to the original input items. Element-wise multiplication is performed to enhance the features of important spectral bands and suppress redundant noise information.

[0110] The features after channel attention weighting are downsampled, for example, by 8x downsampling, to reduce the spatial scale of the feature map and expand the receptive field, resulting in low-frequency features of H / 8×W / 8×C. Subsequently, depthwise separable convolution is applied, including channel-wise and pointwise convolution, which effectively captures large-scale spatial dependencies while significantly reducing the number of parameters and computational cost. The processed features are then upsampled (e.g., by bilinear interpolation) to restore them to their original size, yielding nonlocal structural information. .

[0111] Then, after fusing the variance statistics and the nonlocal structure information to generate a spatially modulated signal, the data is aggregated. Obtain global features.

[0112] To achieve an adaptive balance between structural and texture information under different terrain scenarios, a learnable scaling factor is introduced into the spatial modulation signal during the fusion generation process, the expression of which is:

[0113] ;

[0114] in, It is a spatially modulated signal; It is a 1×1 convolution; This is non-local structural information; This is an element-wise multiplication operation; , This is a learnable scaling factor, a scaling parameter that is automatically updated via backpropagation during network training.

[0115] The global feature capture branch uses the variance statistic as a spatial modulation guidance signal to enhance the nonlocal structural information, as expressed in:

[0116] ;

[0117] in, For global features; This indicates the nearest neighbor upsampling operation; This represents the GELU activation function.

[0118] In the local feature capture branch, the input item The main feature branch and edge enhancement branch of the local feature capture branch are respectively input to extract local main features and edge features, and the local main features and edge features are weighted and fused to output local features.

[0119] The main feature branch adopts a convolutional multilayer perceptron architecture. In this main feature branch, information from the local spatial neighborhood of the input features is first aggregated using depthwise separable convolutions (e.g., 3×3 depthwise separable convolutions). Then, 1×1 convolutions are used to expand the channel dimension to a preset hidden layer dimension (e.g., to twice the original number of channels), thus obtaining the features. The system then performs a nonlinear transformation using the GELU activation function, and finally restores the channel count using a 1×1 convolution to obtain the local backbone features. .

[0120] The expression for the above process is:

[0121] ;

[0122] in, It is a local backbone feature; Use the GELU activation function; It is a 3×3 depthwise separable convolution.

[0123] The edge enhancement branch is used to extract image edges or local high-frequency information of the input features. It captures the spatial gradient changes within each channel of the input features through depthwise separable convolution, then performs cross-channel information integration using 1×1 convolution, and finally maps the edge features using the Tanh hyperbolic tangent activation function. .

[0124] The expression for the above process is:

[0125] ;

[0126] in, Edge features; It is the Tanh hyperbolic tangent activation function.

[0127] ;

[0128] in, Local features; These are learnable parameters used to control the edge enhancement intensity.

[0129] The global features and local features are added element-wise, and a 1×1 convolutional layer is used for dimensionality transformation to output the MSFA output features, expressed as:

[0130] ;

[0131] in, For MSFA output features; It is a 1×1 convolution; This is a global feature.

[0132] S4, the MSFA output feature is input to DNFN, divided into a first sub-feature block and a second sub-feature block according to the channel to perform a partial channel denoising mechanism, after the first sub-feature block is denoised, the second sub-feature block is spliced together, and the modulated depth feature is output.

[0133] In the process of remote sensing image formation, sensor thermal noise, quantum noise, and dislocation noise during transmission often interfere with the representation of ground features. Traditional fully convolutional feedforward networks often transform all features indiscriminately, which not only incurs high computational costs but also easily obscures the true texture.

[0134] refer to Figure 5 In the denoising sensing part of the convolutional feedforward network (DNFN), the MSFA output features are processed by a 1×1 convolution ( Combine the GELU activation function to expand the channel dimension and generate high-dimensional hidden layer features (e.g., 4 times the input dimension).

[0135] The hidden layer features are divided into first sub-feature blocks according to the channel dimension. Second sub-feature block ;in, The number of channels is set to one-quarter of the total number of channels of the hidden layer features to perform partial channel denoising, balancing noise suppression and detail preservation.

[0136] Its expression is:

[0137] ;

[0138] in, This indicates segmentation by channel; This represents the GELU activation function; For MSFA output features; It is an L2 norm.

[0139] Denoising convolutional blocks combined with the GELU activation function are used to... The denoising process is performed; wherein the denoising convolutional block consists of cascaded denoising convolutional layers, batch normalization layers, and activation functions;

[0140] first, Through a 3×3 noise-reducing convolutional layer ( The local structure and noise patterns are initially extracted. Then, each channel is normalized and nonlinearly transformed by batch normalization layer and GELU activation function. Finally, the features are refined by a 3×3 denoising convolutional layer to obtain the denoised features.

[0141] This partial convolution strategy ensures that the network performs complex denoising operations on only a portion of the features, while retaining another portion of the features. This preserves the original state of features, thereby reducing overall computational overhead while maintaining feature diversity.

[0142] Features after denoising After concatenating the channels, a 1×1 convolutional layer is used to restore the original input dimension, outputting the modulated depth features. Its expression is:

[0143] ;

[0144] in, This indicates a channel merging operation, also known as a splicing operation. This represents the GELU activation function; This represents a denoising convolution.

[0145] In the Denoising Perception Convolutional Feedforward Network (DNFN), the core design principle is to "only allow 1 / 4 of the channels to participate in denoising". The aim is to balance noise suppression and detail preservation, and to avoid excessive denoising that leads to overly smoothed images and loss of high-frequency textures.

[0146] Traditional denoising networks perform denoising processing on all feature channels uniformly, which has two key problems:

[0147] (1) Some channels themselves do not contain noise (or are key detail features, such as the edges and texture features of land features in remote sensing images). Forced noise reduction will erase effective information and cause the image to become blurry.

[0148] (2) Using all channels for denoising will significantly increase the computational load, and the mixed processing of noise and clean features will reduce the denoising accuracy.

[0149] Therefore, DNFN adopts a "channel-specific differentiated processing" strategy, allowing only some channels to undertake the noise reduction task, while the remaining channels retain the original features.

[0150] Denoising is performed only on the noisy channels (1 / 4) while the clean channels (3 / 4) are retained. This effectively suppresses noise while preserving key information such as ground feature details and textures in the remote sensing image. The complex denoising operation is performed on only 1 / 4 of the channels, which significantly reduces the computational load compared to full-channel denoising and improves the model's inference speed. It also avoids interference from a single noise pattern on all features and improves the network's adaptability to different noise intensities and types.

[0151] Of course, other proportions of noisy channels can be selected for noise reduction depending on the actual situation, and no restrictions are imposed here.

[0152] S5, the channel dimension of the depth feature is converted to a preset size and amplified by pixel shuffling, and then superimposed with the shallow features transmitted by the global residual connection to obtain a high-resolution remote sensing image.

[0153] During the pixel shuffling process, the high-dimensional channel data of the depth features are filled into the spatial neighborhood block according to a preset arrangement rule.

[0154] The image reconstruction unit performs this process, which includes channel enhancement, pixel shuffling, and global residual fusion. The depth features are first enhanced by a 3×3 convolutional layer, increasing the number of channels to a squared factor of a preset ratio. For example, for a 4x super-resolution task, if the original number of channels is 64, the number of channels is increased to 64×4×4=1024 channels. Subsequently, a pixel shuffling layer rearranges the elements of the channel dimension to the spatial dimension. The logic is to fill every 16 elements of the 1024 channels into a 4×4 local pixel block, thereby enlarging the image size by a factor of 4 in both the height and width directions.

[0155] The magnified feature map is not directly used as output, but is superimposed on the shallow features passed through global residual connections. Specifically, the shallow feature map is processed by bicubic interpolation or transposed convolution to reach the same size as the reconstructed feature map, and then the two are added element-wise. The design of the global residual connections ensures that the reconstruction network only needs to learn the residual information between the low-resolution and high-resolution images (i.e., the lost high-frequency details), which greatly reduces the learning difficulty and ensures that the output image is consistent with the original low-resolution image in terms of color and low-frequency structure. Finally, the system outputs a high-resolution remote sensing image that restores high-frequency details, has sharp edges, and extremely low noise levels.

[0156] The method proposed in this invention demonstrates significant technical advantages in practical operation. Referring to the comparison results of experimental metrics for the seven algorithms in Table 1, the model of this invention achieves higher peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) on standard remote sensing datasets such as RS-T1 and RS-T2, with a significantly lower parameter count than existing models such as IDN and FeNet. This means that this invention greatly reduces the demand for computational resources while ensuring reconstruction accuracy.

[0157] Table 1. Performance comparison results of the present invention with six other existing algorithms.

[0158]

[0159] In Table 1, the number of model parameters, in thousands (K), is used to measure the size / complexity of the model. The smaller the value, the lighter the model.

[0160] Multiply-accumulate operation complexity, unit: billion (G), measures computational complexity / inference speed; the smaller the value, the lower the computational cost.

[0161] RS-T1 and RS-T2 are two remote sensing image test datasets. Each cell is formatted as PSNR / SSIM: PSNR (Peak Signal-to-Noise Ratio): in dB. The higher the value, the better the image reconstruction quality and the less distortion. SSIM (Structural Similarity): ranges from 0 to 1. The higher the value, the more complete the image structure and texture are preserved, and the better the visual effect.

[0162] SRCNN (Super-Resolution Convolutional Neural Network) is a pioneering work in deep learning super-resolution. It is the first end-to-end CNN super-resolution model that uses only 3 convolutional layers to achieve "feature extraction-nonlinear mapping-reconstruction", making its structure extremely simple. With only 57K parameters, it is the lightest baseline model among all the comparison methods. However, it has a small receptive field, weak nonlinear ability, and the worst performance. It is only used as a transitional baseline between traditional methods and deep learning methods.

[0163] VDSR (Very Deep Super-Resolution) is a classic improvement model of early deep super-resolution, and it was the first to introduce residual learning into the super-resolution task. It stacks 20 convolutional layers and uses residual learning to solve the problem of training deep networks, significantly improving reconstruction accuracy; however, its parameter count (666K) and computational cost (612.6G) are far higher than other methods, resulting in extremely low cost-effectiveness. The model is bulky and has slow inference speed, making it unsuitable for lightweight deployments.

[0164] IDN (Information Distillation Network) is a representative of information distillation architecture and an early exploration of lightweight super-resolution. It splits features by "information distillation units" to process high-frequency details and low-frequency information separately, compressing the number of parameters while ensuring performance. With 553K parameters and 32.3G of computation, it belongs to the medium complexity model.

[0165] FeNet (Feature Enhancement Network) is a lightweight super-resolution model that enhances the extraction of key features through multi-scale feature fusion and channel attention mechanisms. With 366K parameters and 20.4G of computation, it achieves good performance among lightweight models, balancing model size and reconstruction effect, and is one of the mainstream competitors in lightweight super-resolution.

[0166] LBNet (Local-Binary Network / Local-Global Feature Network) is a high-performance super-resolution model that fuses local and global features. It extracts both local texture and global contextual information, and enhances feature representation through a dual-branch structure. With 742K parameters (the largest in the entire dataset) and 38.9G of computation, it achieves near-state-of-the-art performance at the cost of high model complexity. It offers high reconstruction accuracy, especially excelling in preserving structural details. However, the model is bulky and has slow inference speed, making it unsuitable for real-time deployment.

[0167] SMFANet (Spectral-Spatial Multi-Head Feature Attention Network) is a lightweight multispectral attention super-resolution model designed for multispectral remote sensing images. Through a dual spectral-spatial attention mechanism, it achieves near-state-of-the-art performance with extremely low parameter count (197K) and computational cost (11G). Its extreme lightweight design and fast inference speed make it a benchmark model for lightweight remote sensing super-resolution.

[0168] Ours (the method of this invention, a remote sensing image super-resolution network based on feature modulation) innovatively designs a multispectral self-modulation feature aggregation module (MSFA) + a 1 / 4-channel denoising perception part convolutional feedforward network (DNFN) to simulate global attention with extremely low computational cost, while achieving accurate denoising and detail preservation. With 251K parameters and 15G computation, it achieves an optimal balance between performance and efficiency on two remote sensing datasets. It achieves the highest SSIM score and the strongest structure / texture preservation capability on the RS-T2 dataset, while also being lightweight and highly efficient inference.

[0169] like Figure 6 In the visual effect comparison shown, the red boxes mark the detailed areas of the corresponding locations for magnified comparison. The reconstructed features such as houses, roads, and vegetation in this invention have sharper edges and more natural textures, effectively solving the blurring and loss of detail problems common in traditional methods.

[0170] HR (High Resolution): As a benchmark for effect, it has the clearest details and the sharpest edges. It can be seen that the parking lines are straight, the edges are clear, the lines are of uniform thickness, and there is no blurring or sticking.

[0171] Bicubic interpolation / SRCNN (Super-Resolution Convolutional Neural Network) / IDN (Information Distillation Network): Parking space lines are severely blurred, the lines are faint, and there are even instances of lines sticking together and loss of details, making it impossible to clearly distinguish the boundaries of parking spaces.

[0172] LBNet (Local-Global Feature Network) / FeNet (Feature Enhancement Network) / SMFANet (Spectral-Spatial Multi-Head Feature Attention Network): The clarity of parking lines is improved, but there are still issues such as blurred edges, lines that are not straight enough, and some lines that are distorted.

[0173] The method of this invention produces parking space lines with sharp edges and straight lines, perfectly restoring the details of the true value, with the strongest anti-blurring ability and a visual effect far exceeding other comparison algorithms.

[0174] This invention reduces computational complexity from the quadratic level of the image size to the linear level by replacing the complex self-attention mechanism with a lightweight scheme based on downsampling modulation and variance statistics. Simultaneously, the extensive application of depthwise separable convolutions allows the model to be easily integrated into UAV-based processing platforms or on-orbit processing units of remote sensing satellites. When processing large-scale remote sensing imagery, this efficient architecture can significantly shorten processing time and enhance the real-time application value of remote sensing data.

[0175] In summary, compared with the prior art, the present invention has the following beneficial effects:

[0176] This invention constructs a complete and efficient super-resolution reconstruction system for remote sensing images through the synergistic effects of shallow feature extraction, multi-dimensional feature modulation, and perceptual denoising. This system not only accurately restores the spatial details of ground features but also rigorously maintains spectral consistency, providing high-quality foundational data for subsequent quantitative remote sensing tasks such as target detection and ground feature classification. Its lightweight design makes this technology promising for broad engineering applications in edge computing and mobile remote sensing monitoring.

[0177] Example 2

[0178] like Figure 7 As shown, the second embodiment of the present invention also provides a remote sensing image super-resolution reconstruction device based on feature modulation, comprising:

[0179] The shallow feature extraction unit is used to acquire low-resolution remote sensing images and extract shallow feature maps.

[0180] The feature modulation unit is used to feed the shallow feature map into a series of cascaded enhanced feature mixing modules (EFMB) for stepwise extraction and feature modulation of deep features. The enhanced feature mixing module (EFMB) includes a multispectral self-modulated feature aggregation module (MSFA) and a denoising sensing part convolutional feedforward network (DNFN), and a residual connection is introduced between MSFA and DNFN.

[0181] The MSFA dual-branch capture unit is used to divide the shallow feature map into input items for the global feature capture branch and input items for the local feature capture branch according to the channel dimension after adjusting the number of channels using a convolutional layer within the MSFA, thereby obtaining the corresponding branch outputs. The branch outputs are then fused and dimensionally transformed to obtain the MSFA output features.

[0182] The DNFN partial denoising unit is used to input the MSFA output feature into the DNFN, and divides it into a first sub-feature block and a second sub-feature block according to the channel to perform a partial channel denoising mechanism. After denoising the first sub-feature block, it is spliced with the second sub-feature block to output the modulated depth feature.

[0183] The image reconstruction unit is used to convert the channel dimension of the depth features into a preset size and amplify it through pixel shuffling, and then superimpose it with the shallow features transmitted by the global residual connection to obtain a high-resolution remote sensing image.

[0184] Example 3

[0185] The third embodiment of the present invention also provides a remote sensing image super-resolution reconstruction device based on feature modulation, which includes a memory and a processor. The memory stores a computer program, which can be executed by the processor to realize the remote sensing image super-resolution reconstruction method based on feature modulation as described above.

[0186] Example 4

[0187] The fourth embodiment of the present invention also provides a computer-readable storage medium storing computer-readable instructions, which, when executed by a processor of the device on which the computer-readable storage medium is located, implement the feature-modulation-based remote sensing image super-resolution reconstruction method described above.

[0188] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for super-resolution reconstruction of remote sensing images based on feature modulation, characterized in that, include: S1, acquire low-resolution remote sensing images and extract shallow feature maps; S2, the shallow feature map is fed into a series of cascaded enhanced feature mixing modules (EFMB) for stepwise extraction and modulation of deep features; the enhanced feature mixing module (EFMB) includes a multispectral self-modulation feature aggregation module (MSFA) and a denoising sensing part convolutional feedforward network (DNFN), and a residual connection is introduced between MSFA and DNFN. S3, Inside MSFA, the shallow feature map is adjusted by the convolutional layer to adjust the number of channels, and then divided into input items of the global feature capture branch and input items of the local feature capture branch according to the channel dimension to obtain the corresponding branch output. The branch output is then fused and dimensionally transformed to obtain the MSFA output feature. S4, the MSFA output feature input to DNFN, is divided into a first sub-feature block and a second sub-feature block according to the channel to perform a partial channel denoising mechanism, the first sub-feature block is denoised and then the second sub-feature block is spliced together to output the modulated depth feature; S5, the channel dimension of the depth feature is converted to a preset size and amplified by pixel shuffling, and then superimposed with the shallow features transmitted by the global residual connection to obtain a high-resolution remote sensing image.

2. The method for super-resolution reconstruction of remote sensing images based on feature modulation according to claim 1, characterized in that... In the process of extracting the shallow feature map, a convolutional layer with a preset kernel size, preset stride and preset padding parameters is used for initial mapping to convert the number of channels of the input low-resolution remote sensing image into a preset number of channels, and a nonlinear activation function is used to enhance the feature expression capability to obtain a shallow feature map containing basic spatial information.

3. The method for super-resolution reconstruction of remote sensing images based on feature modulation according to claim 1, characterized in that... Within MSFA, the global feature capture branch and the local feature capture branch are set up in parallel; The specific process for generating the MSFA output features is as follows: The shallow feature map is subjected to L2 norm normalization, and the number of channels is adjusted using a 1×1 convolutional layer, dividing the channel dimension into input terms for the global feature capture branch. and the input terms of the local feature capture branch The expression is: ； in, Indicates a channel splitting operation; Represents a 1×1 convolution; This is a shallow feature map; Represents the L2 norm; In the global feature capture branch, the input term is calculated first. The variance statistic in the spatial dimension, used to perceive the complexity of ground features, is expressed as follows: ； in, for Spatial dimension variance characteristics; for The total number of pixels; for The Middle The value of each pixel; for The average pixel value; Calculate input items Attention weights for each channel and After performing element-wise multiplication fusion, and then processing it through downsampling and depthwise separable convolution, non-local structural information is generated. Then, after fusing the variance statistics and the nonlocal structure information to generate a spatially modulated signal, the data is aggregated. Obtain global features; Input items The main feature branch and edge enhancement branch of the local feature capture branch are respectively input to extract local main features and edge features, and the local main features and edge features are weighted and fused to output local features; The global features and the local features are added element-wise, and a 1×1 convolutional layer is used to perform dimensionality transformation to output the MSFA output features.

4. The method for super-resolution reconstruction of remote sensing images based on feature modulation according to claim 3, characterized in that... In calculating input items When calculating the attention weights for each channel, the spectral attention module is used. The spectral attention module consists of a global average pooling layer, a first convolutional layer, and a second convolutional layer; Through the global average pooling layer Each channel is compressed into a scalar to achieve spatial information aggregation; Compressed Important features are filtered out by introducing non-linearity through the first convolutional layer and using the ReLU activation function; The selected key features are then restored to the original number of channels through a second convolutional layer; Finally, the normalized attention weights for each channel are generated using the Sigmoid activation function.

5. The method for super-resolution reconstruction of remote sensing images based on feature modulation according to claim 3, characterized in that... The spatially modulated signal incorporates a learnable scaling factor during the fusion generation process, the expression of which is: ； in, It is a spatially modulated signal; It is a 1×1 convolution; This is non-local structural information; This is an element-wise multiplication operation; , This is a learnable scaling factor.

6. The method for super-resolution reconstruction of remote sensing images based on feature modulation according to claim 3, characterized in that... The main feature branch is a convolutional multilayer perceptron architecture; In the main feature branch, information is first aggregated in the local spatial neighborhood of the input feature through depthwise separable convolution. Then, the channel dimension is expanded to the preset hidden layer dimension using 1×1 convolution and nonlinear transformation is performed in combination with the GELU activation function. Finally, the number of channels is restored by 1×1 convolution to obtain the local main feature. The edge enhancement branch is used to extract image edges or local high-frequency information of the input features. It captures the spatial gradient changes in each channel of the input features through depthwise separable convolution, then performs cross-channel information integration using 1×1 convolution, and finally uses the Tanh hyperbolic tangent activation function for mapping to obtain edge features.

7. The method for super-resolution reconstruction of remote sensing images based on feature modulation according to claim 1, characterized in that... In the Denoising Perception Convolutional Feedforward Network (DNFN), the MSFA output features are expanded in channel dimension by combining 1×1 convolution with the GELU activation function to generate high-dimensional hidden layer features. The hidden layer features are divided into first sub-feature blocks according to the channel dimension. Second sub-feature block ;in, The number of channels is set to one-quarter of the total number of channels of the hidden layer features, so as to perform partial channel denoising and balance noise suppression and detail preservation. Denoising convolutional blocks combined with the GELU activation function are used to... The denoising process is performed; wherein the denoising convolutional block consists of cascaded denoising convolutional layers, batch normalization layers, and activation functions; first, The local structure and noise patterns are initially extracted through a 3×3 denoising convolutional layer. Then, each channel is normalized and nonlinearly transformed through a batch normalization layer and the GELU activation function. Finally, the features are refined through a 3×3 denoising convolutional layer to obtain the denoised features. Features after denoising After concatenating the channel dimensions, the original input dimensions are restored through a 1×1 convolutional layer, and the modulated depth features are output.

8. The method for super-resolution reconstruction of remote sensing images based on feature modulation according to claim 1, characterized in that... During the pixel shuffling process, the high-dimensional channel data of the depth features are filled into the spatial neighborhood block according to the preset arrangement rules.

9. A remote sensing image super-resolution reconstruction apparatus based on feature modulation, used to implement the remote sensing image super-resolution reconstruction method based on feature modulation as described in any one of claims 1-8, characterized in that, include: The shallow feature extraction unit is used to acquire low-resolution remote sensing images and extract shallow feature maps. The feature modulation unit is used to feed the shallow feature map into a series of cascaded enhanced feature mixing modules (EFMB) for stepwise extraction and feature modulation of deep features. The enhanced feature mixing module (EFMB) includes a multispectral self-modulated feature aggregation module (MSFA) and a denoising sensing part convolutional feedforward network (DNFN), and a residual connection is introduced between MSFA and DNFN. The MSFA dual-branch capture unit is used to divide the shallow feature map into input items for the global feature capture branch and input items for the local feature capture branch according to the channel dimension after adjusting the number of channels using a convolutional layer within the MSFA, thereby obtaining the corresponding branch outputs. The branch outputs are then fused and dimensionally transformed to obtain the MSFA output features. The DNFN partial denoising unit is used to input the MSFA output feature into the DNFN, and divides it into a first sub-feature block and a second sub-feature block according to the channel to perform a partial channel denoising mechanism. After denoising the first sub-feature block, it is spliced with the second sub-feature block to output the modulated depth feature. The image reconstruction unit is used to convert the channel dimension of the depth features into a preset size and amplify it through pixel shuffling, and then superimpose it with the shallow features transmitted by the global residual connection to obtain a high-resolution remote sensing image.

10. A remote sensing image super-resolution reconstruction device based on feature modulation, characterized in that, The system includes a processor and a memory, wherein the memory stores a computer program that can be executed by the processor to implement a feature modulation-based remote sensing image super-resolution reconstruction method as described in any one of claims 1-8.