A remote sensing super-resolution reconstruction method based on state space evolution and residual diffusion

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing state-space evolution and residual diffusion techniques, the problems of long-distance time dependence and environmental interference in remote sensing image super-resolution are solved, enabling high-precision and high-detail remote sensing image reconstruction, which is applicable to fields such as land resource surveys, environmental monitoring, and urban planning.

CN122243746APending Publication Date: 2026-06-19CHINA UNIV OF MINING & TECH +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHINA UNIV OF MINING & TECH
Filing Date: 2026-05-20
Publication Date: 2026-06-19

Application Information

Patent Timeline

20 May 2026

Application

19 Jun 2026

Publication

CN122243746A

IPC: G06T3/4053; G06T3/4046; G06T5/70; G06T5/60; G06N3/0464; G06V20/13; G06V10/774; G06V10/30; G06V10/82; G06N3/0442

AI Tagging

Application Domain

Image enhancement Geometric image transformation

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122243746A_ABST

Patent Text Reader

Abstract

This invention discloses a remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion, belonging to the fields of image processing and deep learning technology. The method acquires multi-temporal low-resolution remote sensing sequences from satellite payloads and performs radiometric and geometric preprocessing; it constructs a long-range feature extraction and alignment module, utilizing deformable convolution to achieve spatial topological alignment of cross-temporal features; it constructs a spatiotemporal collaborative fusion module, using a bidirectional state-space evolution mechanism to capture the temporal dependence of long sequences, and dynamically adjusts the step size through a content-aware gating mechanism to suppress interference from anomalous observations; it constructs a two-stage reconstruction network: the first stage performs structural reconstruction to generate a coarse high-resolution map, and the second stage uses a conditional residual diffusion model to recover high-frequency textures, which are then superimposed to obtain the final result. This invention solves the problems of low efficiency in long sequence modeling, poor robustness to anomalous frames, and texture smoothing, significantly improving the quality of remote sensing image reconstruction.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing and deep learning technology, and specifically relates to a super-resolution reconstruction method for long-term remote sensing images. Background Technology

[0002] With the rapid development of remote sensing technology, high-resolution remote sensing images play a crucial role in resource surveys, environmental monitoring, disaster assessment, and urban planning. However, due to the physical imaging mechanism of satellite remote sensing platforms, there is an inherent trade-off between spatial resolution and revisit frequency (temporal resolution) for sensors. To obtain observational data that combines high spatial detail and high temporal frequency, multi-image super-resolution (MISR) technology, which synthesizes high-resolution images from multiple frames of low-resolution images over time using algorithms, has become a current research hotspot.

[0003] Existing multi-image super-resolution technologies for remote sensing images typically face three core challenges. First, remote sensing image sequences often have long time spans, and traditional convolutional neural networks (CNNs) struggle to capture long-distance temporal dependencies due to their limited receptive fields. While Transformer-based architectures possess global modeling capabilities, the computational complexity of their self-attention mechanism increases quadratically with sequence length, resulting in significant computational overhead and memory pressure when processing long-sequence remote sensing data.

[0004] Secondly, remote sensing imaging is highly susceptible to interference from complex environmental factors. Long-term observation sequences often contain low-quality observation frames caused by cloud cover, sudden changes in illumination, or sensor noise. Most existing fusion methods employ simple cascading or weighted averaging, lacking the ability to autonomously identify and filter the quality of individual frames. This leads to the erroneous propagation of noise features from abnormal observations along the time axis, resulting in structural drift, brightness flicker, or artifact residue in the reconstructed sequence, severely impacting the robustness of the results.

[0005] Finally, remotely sensed ground features are extremely complex, containing numerous fine building edges, vegetation textures, and road topologies. Traditional regression-based reconstruction methods (such as L1 or L2 loss) tend to seek statistical averages during optimization, leading to over-smoothing when dealing with complex degradations and failing to recreate true high-frequency details. While Generative Adversarial Networks (GANs) have improved visual effects to some extent, their training process is highly unstable and prone to producing artifacts that do not conform to physical characteristics. Therefore, achieving high-fidelity, high-detail remote sensing image reconstruction while ensuring temporal consistency remains a critical problem that urgently needs to be solved in this field. Summary of the Invention

[0006] The purpose of this invention is to provide a long-term remote sensing image super-resolution method based on state space and residual diffusion. This method can effectively solve the problems of insufficient multi-scale feature fusion, poor dynamic adaptability, and weak detail recovery ability in traditional remote sensing image super-resolution technology.

[0007] To achieve the above objectives, this invention provides a remote sensing super-resolution reconstruction method based on state space evolution and residual diffusion, comprising the following steps:

[0008] S1. Obtain multi-temporal low-resolution image sequences, and construct a standardized time-series dataset after radiometric calibration, atmospheric correction and spatial alignment;

[0009] S2. Construct a long-range extraction module to extract deep features in the feature manifold space and achieve cross-temporal topology alignment through offset field learning;

[0010] S3. A bidirectional state-space mechanism is used to capture long-range temporal dependencies, and cloud and fog noise is suppressed by perception gating to generate multi-frame aggregated features.

[0011] S4. After two-stage network processing: first, the basic structure is reconstructed using aggregated features, and then the high-frequency texture is recovered through the residual diffusion model;

[0012] S5. Use remote sensing datasets for training, introduce time consistency loss constraints, and iteratively optimize network weights through backpropagation.

[0013] Furthermore, the specific steps in S1 for acquiring multi-temporal low-resolution image sequences, performing radiometric calibration, atmospheric correction, and spatial alignment to construct a standardized time-series dataset include:

[0014] S1.1 Acquire multiple frames of observation images within the same orbital period or across periods, and convert the original observation values into absolute radiance or apparent reflectance using sensor calibration parameters;

[0015] S1.2. Use a high-precision digital elevation model and ground control points to perform geometric fine correction on the image. Use resampling technology to unify images from different time phases to the same geographic projection coordinate system to achieve pixel-level coarse alignment.

[0016] Furthermore, in S2, a long-range extraction module is constructed to extract deep features in the feature manifold space and achieve cross-temporal topological alignment through offset field learning, specifically including the following:

[0017] S2.1, Parallel encoding of multi-temporal features uses shallow feature extraction operators with shared weights on the input sequence. After processing, the feature sequence is obtained:

[0018] ;

[0019] in Indicates the first Frame input image, For encoder, The number of channels extracted is Feature map;

[0020] S2.2 Spatiotemporal offset field learning fuses current frame features through a cascaded offset prediction network. Features of the reference frame The cascaded fusion is achieved through the Concat operator, which concatenates and stacks two feature matrices along the channel dimension to calculate the nonlinear deformation displacement field.

[0021] ;

[0022] in To predict networks, Includes the displacement of the convolution sampling points in the horizontal and vertical directions;

[0023] S2.3, Dynamic deformation feature alignment utilizes predicted offsets DConv deformable convolution resampling is performed on features from neighboring frames by adding an offset to the standard convolution sampling position. This enables the convolution kernel to autonomously adjust the sampling shape based on ground deformation:

[0024] ;

[0025] in It is the aligned feature sequence.

[0026] Furthermore, S3 employs a bidirectional state-space mechanism to capture long-range temporal dependencies, suppresses cloud and fog noise through perceptual gating, and generates multi-frame aggregated features, specifically including the following:

[0027] S3.1 Content-aware gating weight generation uses the GAP global average pooling operator to calculate feature maps. The average value of each channel is compressed into a vector representing global information, and the frame quality coefficient is calculated using convolutional layers and activation functions.

[0028] ;

[0029] in This is the weight matrix. For bias terms;

[0030] S3.2 Adaptive Discretization Step Size Calculation: The discretization step size is modulated by gated weights.

[0031] ;

[0032] The It is a smooth activation function used to ensure that the output step size is always greater than zero and to maintain the stability of numerical evolution;

[0033] S3.3, Bidirectional Evolution Trajectory Simulation defines the forward scan state update and the backward scan state update as follows:

[0034] ;

[0035] ;

[0036] in, and Representing time respectively The forward and backward hidden states, and These represent the evolutionary states at adjacent time points. and Here is the state transition matrix. and For the input projection matrix, and This represents the input features at the current moment, and thus simulates the forward evolution and backward origination of ground features on the time axis.

[0037] S3.4. Feature linear mapping concatenates the hidden states obtained from bidirectional scanning along the channel dimension, and then projects them through a linear projection layer. Perform dimensionality compression and information fusion to output aggregated features:

[0038] ;

[0039] in, The aggregated features after fusion This represents the weights of the linear projection layer used for dimensionality compression and information integration. This represents the concatenation operation performed on the channel dimension by hiding the forward and backward states.

[0040] Furthermore, the S4 process involves a two-stage network: first, the basic structure is reconstructed using aggregated features, and then high-frequency textures are recovered using a residual diffusion model. Specifically, this includes the following:

[0041] S4.1 The first stage of structural reconstruction aggregates the feature input into the reconstruction branch, which consists of a residual channel attention network and sub-pixel convolutional layers, and generates a coarse high-resolution image by enhancing the spatial dimension. :

[0042] ;

[0043] Here, This represents a coarse, high-resolution structural image generated from the reconstructed branch. The mapping function for reconstructing branches.

[0044] S4.2, Second-stage residual diffusion refinement defines high-frequency residuals:

[0045] ;

[0046] in For true high-resolution images, a diffusion model is constructed to predict noise during the reverse denoising process:

[0047] ;

[0048] in For the first The noisy residual state during step iteration. In this definition, For true high-frequency detail residuals, For ground truth images, It is a parameterized network used to predict noise in the diffusion model. For the number of diffusion steps, and Then it represents the first Noisy residual state during step iteration;

[0049] S4.3 High-frequency detail recovery gradually removes noise through multi-step recursive sampling to obtain the predicted detail residuals. The final result is obtained by superimposing the results:

[0050] ;

[0051] in, The final generated super-resolution reconstructed image result, This represents the residual of the predicted details recovered through recursive sampling using a diffusion model.

[0052] Furthermore, in step S5, remote sensing datasets are used for training, a temporal consistency loss constraint is introduced, and the network weights are iteratively optimized through backpropagation. Specifically, this includes the following:

[0053] S5.1 To ensure that the reconstructed image is highly accurate in terms of spatial structure while maintaining smooth evolution in the temporal dimension, this method constructs a comprehensive loss function. The calculation method is as follows:

[0054] ;

[0055] in, This represents the total loss function during network training. , and These represent the weight coefficients of structural reconstruction loss, temporal consistency loss, and diffusion model denoising loss, respectively. By adjusting these hyperparameters, the relationship between static image fidelity and dynamic continuity can be balanced.

[0056] S5.2 To obtain the pixel-level structure reconstruction loss, firstly, the pixel deviation between the predicted high-resolution image and the true ground truth (GT) is calculated, typically using... Norms are used to achieve sharper edge effects.

[0057] ;

[0058] Here, Refers to pixel-level reconstruction loss. This represents the total number of samples in a single training batch. Let represent the true high-resolution reference image corresponding to the i-th sample, and This is the predicted high-resolution image result generated by the model;

[0059] S5.3 Temporal Consistency Loss Constraint: To address the issues of brightness discontinuities, structural instability, or uneven temporal evolution between adjacent reconstructed frames in multi-temporal remote sensing image sequences, a temporal consistency loss is introduced. This loss constrains the logical consistency between adjacent frames through motion compensation techniques:

[0060] ;

[0061] in, Let the time consistency loss function be... This represents the total number of timing frames involved in the calculation. and These represent the reconstructed images at the current and previous moments, respectively. This represents a spatial transformation operator based on pixel relocation, while Then it represents from the first Frame pointing The estimated optical flow field or motion vector of a frame. This operator maps pixels from the previous frame to the current frame based on their motion trajectories, thus forcing adjacent frames to meet physical consistency in their motion trajectories. The diffusion model residual denoising loss targets the recovery of high-frequency details; during the diffusion training phase, the accuracy of noise prediction is calculated using a simple mean square error.

[0062] ;

[0063] in, The denoising objective function of the diffusion model is represented by... This represents the expectation operation. The standard deviation of the injected random Gaussian noise, The noise value predicted by the network. In the number of diffusion steps The noisy state below, As conditional input, it guides the model to generate detailed residuals that conform to the original ground structure. The network weights are iteratively optimized through backpropagation to obtain the total loss. Then, the loss function is calculated using the chain rule for the parameters of each layer in the network. The gradient is calculated, and the parameters are updated using the AdamW optimizer;

[0064] S5.4 Gradient Calculation and Parameter Update The network weights are updated according to the following iterative formula:

[0065] ;

[0066] In the formula, and They represent the first Next and first The network weight parameter matrix at the next iteration This represents the preset learning rate, while This represents the gradient vector of the total loss function with respect to the current weights, guiding the model to evolve in the direction of error reduction; when When the network stabilizes and the indicators no longer show significant improvement, it is determined that the network has reached convergence. At this point, the optimal weight parameters are saved, and the entire optimization process from state space evolution to residual diffusion refinement is completed, enabling it to perform long-term stable super-resolution reconstruction of any temporal remote sensing sequence.

[0067] Beneficial Effects: This invention effectively improves the super-resolution reconstruction quality of long-term remote sensing image sequences by combining a bidirectional state-space evolution mechanism, cross-temporal topological alignment, and residual diffusion refinement techniques. High-precision topological alignment is achieved in the feature manifold space through offset field learning and deformable convolution techniques, significantly improving the feature misalignment and detail loss caused by nonlinear deformation between multi-temporal observations. A perception-gated bidirectional state-space evolution mechanism is introduced, utilizing adaptive discretization step size to dynamically suppress abnormal observation interference such as cloud and fog obstruction and sensor noise, ensuring the accuracy of long-range temporal feature capture and system robustness. A two-stage reconstruction strategy is adopted, using the first-stage structural reconstruction branch to generate a robust basic structure map, combined with the recursive denoising sampling capability of the second-stage residual diffusion model, effectively solving the problems of traditional regression methods. This method addresses the oversmoothing deficiency in texture restoration, significantly enhancing high-frequency details and visual realism in images. By introducing a temporal consistency loss function based on pixel relocation spatial transformation operators, it strengthens the logical consistency of inter-frame features from a motion compensation perspective, effectively eliminating visual flicker and feature abrupt changes in the reconstructed sequence, ensuring excellent spatiotemporal performance of the generated images. Through synergistic optimization of multi-level feature fusion and diffusion scheduling strategies, it significantly improves the resolution, sharpness, and topological fidelity of remote sensing images, exhibiting outstanding technical advantages, especially under complex and variable atmospheric environments and long-term dynamic monitoring conditions. This method is widely used in remote sensing image processing fields such as land resource surveys, environmental dynamic monitoring, disaster assessment, and urban planning, providing advanced technical support and reliable solutions for achieving high-precision and high-efficiency remote sensing information extraction. Attached Figure Description

[0068] Figure 1 This is a schematic diagram of the overall process of the present invention;

[0069] Figure 2 This is a flowchart of the cross-temporal feature space topology alignment process of the present invention;

[0070] Figure 3 This is the bidirectional state space evolution diagram guided by perception gating used in this invention;

[0071] Figure 4 This is the two-stage residual reconstruction and loss optimization loop diagram proposed in this invention. Detailed Implementation

[0072] The invention will now be further described with reference to the accompanying drawings.

[0073] Example

[0074] Furthermore, such as Figure 1 As shown, a remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion includes the following steps:

[0075] S1. Obtain multi-temporal low-resolution image sequences, and construct a standardized time-series dataset after radiometric calibration, atmospheric correction and spatial alignment;

[0076] S2. Construct a long-range extraction module to extract deep features in the feature manifold space and achieve cross-temporal topology alignment through offset field learning;

[0077] S3. A bidirectional state-space mechanism is used to capture long-range temporal dependencies, and cloud and fog noise is suppressed by perception gating to generate multi-frame aggregated features.

[0078] S4. After two-stage network processing: first, the basic structure is reconstructed using aggregated features, and then the high-frequency texture is recovered through the residual diffusion model;

[0079] S5. Use remote sensing datasets for training, introduce time consistency loss constraints, and iteratively optimize network weights through backpropagation.

[0080] Furthermore, the specific steps in S1 for acquiring multi-temporal low-resolution image sequences, performing radiometric calibration, atmospheric correction, and spatial alignment to construct a standardized time-series dataset include:

[0081] S1.1 Acquire multiple frames of observation images within the same orbital period or across periods, and convert the original observation values into absolute radiance or apparent reflectance using sensor calibration parameters;

[0082] S1.2. Use a high-precision digital elevation model and ground control points to perform geometric fine correction on the image. Use resampling technology to unify images from different time phases to the same geographic projection coordinate system to achieve pixel-level coarse alignment.

[0083] Furthermore, such as Figure 2 As shown, the long-range extraction module constructed in S2 extracts deep features in the feature manifold space and achieves cross-temporal topological alignment through offset field learning, specifically including the following:

[0084] S2.1, Parallel encoding of multi-temporal features uses shallow feature extraction operators with shared weights on the input sequence. After processing, the feature sequence is obtained:

[0085] ;

[0086] in Indicates the first Frame input image, For encoder, The number of channels extracted is Feature map;

[0087] S2.2 Spatiotemporal offset field learning fuses current frame features through a cascaded offset prediction network. Features of the reference frame The cascaded fusion is achieved through the Concat operator, which concatenates and stacks two feature matrices along the channel dimension to calculate the nonlinear deformation displacement field.

[0088] ;

[0089] in To predict networks, Includes the displacement of the convolution sampling points in the horizontal and vertical directions;

[0090] S2.3, Dynamic deformation feature alignment utilizes predicted offsets DConv deformable convolution resampling is performed on features from neighboring frames by adding an offset to the standard convolution sampling position. This enables the convolution kernel to autonomously adjust the sampling shape based on ground deformation:

[0091] ;

[0092] in It is the aligned feature sequence.

[0093] Furthermore, such as Figure 3 As shown, S3 employs a bidirectional state-space mechanism to capture long-range temporal dependencies, suppresses cloud and fog noise through perceptual gating, and generates multi-frame aggregated features, specifically including the following:

[0094] S3.1 Content-aware gating weight generation uses the GAP global average pooling operator to calculate feature maps. The average value of each channel is compressed into a vector representing global information, and the frame quality coefficient is calculated using convolutional layers and activation functions.

[0095] ;

[0096] in This is the weight matrix. For bias terms;

[0097] S3.2 Adaptive Discretization Step Size Calculation: The discretization step size is modulated by gated weights.

[0098] ;

[0099] The It is a smooth activation function used to ensure that the output step size is always greater than zero and to maintain the stability of numerical evolution;

[0100] S3.3, Bidirectional Evolution Trajectory Simulation defines the forward scan state update and the backward scan state update as follows:

[0101] ;

[0102] ;

[0103] in, and Representing time respectively The forward and backward hidden states, and These represent the evolutionary states at adjacent time points. and Here is the state transition matrix. and For the input projection matrix, and This represents the input features at the current moment, and thus simulates the forward evolution and backward origination of ground features on the time axis.

[0104] S3.4. Feature linear mapping concatenates the hidden states obtained from bidirectional scanning along the channel dimension, and then projects them through a linear projection layer. Perform dimensionality compression and information fusion to output aggregated features:

[0105] ;

[0106] in, The aggregated features after fusion This represents the weights of the linear projection layer used for dimensionality compression and information integration. This represents the concatenation operation performed on the channel dimension by hiding the forward and backward states.

[0107] Furthermore, such as Figure 4 As shown, S4 involves a two-stage network processing: first, the basic structure is reconstructed using aggregated features, and then high-frequency textures are recovered using a residual diffusion model. Specifically, this includes the following:

[0108] S4.1 The first stage of structural reconstruction aggregates the feature input into the reconstruction branch, which consists of a residual channel attention network and sub-pixel convolutional layers, and generates a coarse high-resolution image by enhancing the spatial dimension. :

[0109] ;

[0110] Here, This represents a coarse, high-resolution structural image generated from the reconstructed branch. The mapping function for reconstructing branches.

[0111] S4.2, Second-stage residual diffusion refinement defines high-frequency residuals:

[0112] ;

[0113] in For true high-resolution images, a diffusion model is constructed to predict noise during the reverse denoising process:

[0114] ;

[0115] in For the first The noisy residual state during step iteration. In this definition, For true high-frequency detail residuals, For ground truth images, It is a parameterized network used to predict noise in the diffusion model. For the number of diffusion steps, and Then it represents the first Noisy residual state during step iteration;

[0116] S4.3 High-frequency detail recovery gradually removes noise through multi-step recursive sampling to obtain the predicted detail residuals. The final result is obtained by superimposing the results:

[0117] ;

[0118] in, The final generated super-resolution reconstructed image result, This represents the residual of the predicted details recovered through recursive sampling using a diffusion model.

[0119] Furthermore, such as Figure 4 As shown, in step S5, remote sensing datasets are used for training, a temporal consistency loss constraint is introduced, and the network weights are iteratively optimized through backpropagation. Specifically, this includes the following:

[0120] S5.1 To ensure that the reconstructed image is highly accurate in terms of spatial structure while maintaining smooth evolution in the temporal dimension, this method constructs a comprehensive loss function. The calculation method is as follows:

[0121] ;

[0122] in, This represents the total loss function during network training. , and These represent the weight coefficients of structural reconstruction loss, temporal consistency loss, and diffusion model denoising loss, respectively. By adjusting these hyperparameters, the relationship between static image fidelity and dynamic continuity can be balanced.

[0123] S5.2 To obtain the pixel-level structure reconstruction loss, firstly, the pixel deviation between the predicted high-resolution image and the true ground truth (GT) is calculated, typically using... Norms are used to achieve sharper edge effects.

[0124] ;

[0125] Here, Refers to pixel-level reconstruction loss. This represents the total number of samples in a single training batch. Let represent the true high-resolution reference image corresponding to the i-th sample, and This is the predicted high-resolution image result generated by the model;

[0126] S5.3 Temporal Consistency Loss Constraint: To address the issues of brightness discontinuities, structural instability, or uneven temporal evolution between adjacent reconstructed frames in multi-temporal remote sensing image sequences, a temporal consistency loss is introduced. This loss constrains the logical consistency between adjacent frames through motion compensation techniques:

[0127] ;

[0128] in, Let the time consistency loss function be... This represents the total number of timing frames involved in the calculation. and These represent the reconstructed images at the current and previous moments, respectively. This represents a spatial transformation operator based on pixel relocation, while Then it represents from the first Frame pointing The estimated optical flow field or motion vector of a frame. This operator maps pixels from the previous frame to the current frame based on their motion trajectories, thus forcing adjacent frames to meet physical consistency in their motion trajectories. The diffusion model residual denoising loss targets the recovery of high-frequency details; during the diffusion training phase, the accuracy of noise prediction is calculated using a simple mean square error.

[0129] ;

[0130] in, The denoising objective function of the diffusion model is represented by... This represents the expectation operation. The standard deviation of the injected random Gaussian noise, The noise value predicted by the network. In the number of diffusion steps The noisy state below, As conditional input, it guides the model to generate detailed residuals that conform to the original ground structure. The network weights are iteratively optimized through backpropagation to obtain the total loss. Then, the loss function is calculated using the chain rule for the parameters of each layer in the network. The gradient is calculated, and the parameters are updated using the AdamW optimizer;

[0131] S5.4 Gradient Calculation and Parameter Update The network weights are updated according to the following iterative formula:

[0132] ;

[0133] In the formula, and They represent the first Next and first The network weight parameter matrix at the next iteration This represents the preset learning rate, while This represents the gradient vector of the total loss function with respect to the current weights, guiding the model to evolve in the direction of error reduction; when When the network stabilizes and the indicators no longer show significant improvement, it is determined that the network has reached convergence. At this point, the optimal weight parameters are saved, and the entire optimization process from state space evolution to residual diffusion refinement is completed, enabling it to perform long-term stable super-resolution reconstruction of any temporal remote sensing sequence.

[0134] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention. The scope of protection of the present invention should be determined by the scope of protection of the appended claims.

Claims

1. A remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion, characterized in that, Includes the following steps: S1. Obtain multi-temporal low-resolution image sequences, and construct a standardized time-series dataset after radiometric calibration, atmospheric correction and spatial alignment; S2. Construct a long-range extraction module to extract deep features in the feature manifold space and achieve cross-temporal topology alignment through offset field learning; S3. A bidirectional state-space mechanism is used to capture long-range temporal dependencies, and cloud and fog noise is suppressed by perception gating to generate multi-frame aggregated features. S4. After two-stage network processing: first, the basic structure is reconstructed using aggregated features, and then the high-frequency texture is recovered through the residual diffusion model; S5. Use remote sensing datasets for training, introduce time consistency loss constraints, and iteratively optimize network weights through backpropagation.

2. The remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion according to claim 1, characterized in that, S1 includes the following steps: S1.1 Acquire multiple frames of observation images within the same orbital period or across periods, and convert the original observation values into absolute radiance or apparent reflectance using sensor calibration parameters; S1.

2. Use a high-precision digital elevation model and ground control points to perform geometric fine correction on the image. Use resampling technology to unify images from different time phases to the same geographic projection coordinate system to achieve pixel-level coarse alignment.

3. The remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion according to claim 1, characterized in that, S2 includes the following steps: S2.1, Parallel encoding of multi-temporal features uses shallow feature extraction operators with shared weights on the input sequence. After processing, the feature sequence is obtained: ； in Indicates the first Frame input image, For encoder, The number of channels extracted is Feature map; S2.2 Spatiotemporal offset field learning fuses current frame features through a cascaded offset prediction network. Features of the reference frame The cascaded fusion is achieved through the Concat operator, which concatenates and stacks two feature matrices along the channel dimension to calculate the nonlinear deformation displacement field. ； in To predict networks, Includes the displacement of the convolution sampling points in the horizontal and vertical directions; S2.3, Dynamic deformation feature alignment utilizes predicted offsets DConv deformable convolution resampling is performed on features from neighboring frames by adding an offset to the standard convolution sampling position. This enables the convolution kernel to autonomously adjust the sampling shape based on ground deformation: ； in It is the aligned feature sequence.

4. The remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion according to claim 1, characterized in that, S3 includes the following steps: S3.1 Content-aware gating weight generation uses the GAP global average pooling operator to calculate feature maps. The average value of each channel is compressed into a vector representing global information, and the frame quality coefficient is calculated using convolutional layers and activation functions. ； in This is the weight matrix. For bias terms; S3.2 Adaptive Discretization Step Size Calculation: The discretization step size is modulated by gated weights. ； The It is a smooth activation function used to ensure that the output step size is always greater than zero and to maintain the stability of numerical evolution; S3.3, Bidirectional Evolution Trajectory Simulation defines the forward scan state update and the backward scan state update as follows: ；； in, and Representing time respectively The forward and backward hidden states, and The evolutionary states at adjacent time points. and Here is the state transition matrix. and For the input projection matrix, and This represents the input features at the current moment, and thus simulates the forward evolution and backward origination of ground features on the time axis; S3.

4. Feature linear mapping concatenates the hidden states obtained from bidirectional scanning along the channel dimension, and then projects them through a linear projection layer. Perform dimensionality compression and information fusion to output aggregated features: ； in, The aggregated features after fusion This represents the weights of the linear projection layer used for dimensionality compression and information integration. This represents the concatenation operation performed on the channel dimension by hiding the forward and backward states.

5. The remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion according to claim 1, characterized in that, S4 includes the following steps: S4.1 The first stage of structural reconstruction aggregates the feature input into the reconstruction branch, which consists of a residual channel attention network and sub-pixel convolutional layers, and generates a coarse high-resolution image by enhancing the spatial dimension. : ； Here, This represents a coarse, high-resolution structural image generated from the reconstructed branch. The mapping function for reconstructing branches; S4.2, Second-stage residual diffusion refinement defines high-frequency residuals: ； in For true high-resolution images, a diffusion model is constructed to predict noise during the reverse denoising process: ； in For the first The noisy residual state during step iteration, in this definition... For true high-frequency detail residuals, For ground truth images, It is a parameterized network used to predict noise in the diffusion model. For the number of diffusion steps, and Then it represents the first Noisy residual state during step iteration; S4.3 High-frequency detail recovery gradually removes noise through multi-step recursive sampling to obtain the predicted detail residuals. The final result is obtained by superimposing the results: ； in, The final generated super-resolution reconstructed image result, This represents the residual of the predicted details recovered through recursive sampling using a diffusion model.

6. The remote sensing super-resolution reconstruction method based on state-space evolution and residual diffusion according to claim 1, characterized in that, S5 includes the following steps: S5.1 To ensure that the reconstructed image is highly accurate in terms of spatial structure while maintaining smooth evolution in the temporal dimension, this method constructs a comprehensive loss function. The calculation method is as follows: ； in, This represents the total loss function during network training. , and These represent the weight coefficients of structural reconstruction loss, temporal consistency loss, and diffusion model denoising loss, respectively. By adjusting these hyperparameters, the relationship between static image fidelity and dynamic continuity can be balanced. S5.2 To obtain the pixel-level structure reconstruction loss, firstly, the pixel deviation between the predicted high-resolution image and the true ground truth (GT) is calculated, typically using... Norms are used to achieve sharper edge effects. ； Here, Refers to pixel-level reconstruction loss. This represents the total number of samples in a single training batch. Let represent the true high-resolution reference image corresponding to the i-th sample, and This is the predicted high-resolution image result generated by the model; S5.3 Temporal Consistency Loss Constraint: To address the issues of brightness discontinuities, structural instability, or uneven temporal evolution between adjacent reconstructed frames in multi-temporal remote sensing image sequences, a temporal consistency loss is introduced. This loss constrains the logical consistency between adjacent frames through motion compensation techniques: ； in, Let the time consistency loss function be... This represents the total number of timing frames involved in the calculation. and These represent the reconstructed images at the current and previous moments, respectively. This represents a spatial transformation operator based on pixel relocation, while Then it represents from the first Frame pointing The estimated optical flow field or motion vector of a frame; through the action of this operator, pixels from the previous frame can be mapped to the current frame according to the motion trajectory, thereby forcing adjacent frames to meet physical consistency in the motion trajectory; the residual denoising loss of the diffusion model is aimed at the recovery of high-frequency details. During the diffusion training phase, the accuracy of noise prediction is calculated using simple mean square error. ； in, The denoising objective function of the diffusion model is represented by... This represents the expectation operation. The standard deviation of the injected random Gaussian noise, The noise value predicted by the network. In the number of diffusion steps The noisy state below, As conditional information input, it guides the model to generate detailed residuals that conform to the original ground structure; the network weights are iteratively optimized through backpropagation to obtain the total loss. Then, the loss function for each layer parameter in the network is calculated using the chain rule. The gradient is calculated, and the parameters are updated using the AdamW optimizer; S5.4 Gradient Calculation and Parameter Update The network weights are updated according to the following iterative formula: ； In the formula, and They represent the first Next and first The network weight parameter matrix at the next iteration This represents the preset learning rate, while This represents the gradient vector of the total loss function with respect to the current weights, guiding the model to evolve in the direction of error reduction; when When the network reaches a stable state and the indicators no longer improve significantly, it is determined that the network has reached a convergence state. At this point, the optimal weight parameters are saved, and the entire process of model evolution from state space to residual diffusion refinement is completed, enabling it to perform long-term stable super-resolution reconstruction of any temporal remote sensing sequence.