Intelligent super-resolution method of DEM based on high spatial resolution remote sensing data
By constructing a dual-branch collaborative deep learning framework and a terrain-aware cross-attention module, high-resolution remote sensing images are used to guide the generation of high-precision DEMs from low-resolution DEMs, solving the problems of high cost and long time consumption in traditional methods, and achieving efficient and accurate DEM reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CENTRAL SOUTH UNIVERSITY OF FORESTRY AND TECHNOLOGY
- Filing Date
- 2025-07-24
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies suffer from high costs, long processing times, limited coverage, and long data update cycles when generating high-resolution digital elevation models (DEMs). Furthermore, the limitations of local receptive fields and parameter sharing in traditional convolutional neural networks make it difficult for the models to capture complex terrain details, resulting in insufficient multimodal data fusion and inadequate accuracy in terrain feature reconstruction.
A high-precision, high-resolution DEM intelligent super-resolution method based on high spatial resolution remote sensing data is adopted. By constructing a dual-branch collaborative deep learning framework, high-resolution remote sensing images are used as guiding information, and multimodal feature fusion is performed by combining terrain perception cross-attention module to generate high-precision, high-resolution DEM.
It achieves cost-effective generation of high-precision, high-resolution DEMs, and can adaptively fuse features from different sources, improving the accuracy and efficiency of terrain feature reconstruction and solving the problems of high cost and long time consumption in traditional methods.
Smart Images

Figure CN120932067B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of digital elevation model processing technology, specifically to a DEM intelligent super-resolution method based on high spatial resolution remote sensing data. Background Technology
[0002] Digital elevation models (DEMs), as a digital representation of landforms, have crucial application value in many fields such as hydrological analysis, geomorphological research, disaster assessment, urban planning, infrastructure construction, and military defense. High-resolution DEMs can provide more detailed topographic information, thereby supporting more accurate analysis and decision-making. However, traditional methods for acquiring high-resolution DEMs, such as LiDAR scanning and aerial photogrammetry, while highly accurate, typically face problems such as high cost, long processing time, limited coverage, and long data update cycles, making it difficult to meet the needs of rapid, large-scale, and high-frequency applications.
[0003] In recent years, deep learning-based image super-resolution techniques have made significant progress, providing a new technical approach for cost-effectively generating high-resolution (HR) DEMs from low-resolution (LR) DEM data. These methods attempt to learn the mapping relationship from LR to HR to reconstruct lost high-frequency terrain details. Meanwhile, high-resolution remote sensing imagery (such as multispectral satellite imagery) is relatively easy to acquire and has a strong correlation with surface elevation; for example, features such as texture, shadows, and feature distribution in the imagery can often indirectly reflect changes in terrain undulation. Therefore, utilizing high-resolution remote sensing imagery as auxiliary information to guide DEM super-resolution reconstruction has become a promising research direction.
[0004] Despite the potential of deep learning-based DEM super-resolution methods, existing techniques still have some inherent limitations and shortcomings:
[0005] Limitations of traditional Convolutional Neural Networks (CNNs): Local receptive field limitation: Traditional CNNs mainly expand their receptive field by stacking convolutional layers, but their inherent local operation characteristics limit the model's ability to model global contextual information and long-range spatial dependencies. This is particularly disadvantageous for DEM reconstruction, because terrain features (such as mountains and valleys) often have large-scale continuity and correlation.
[0006] Insufficient parameter sharing and feature diversity capture: Convolutional kernels in CNNs share parameters across the entire feature map. While this improves parameter efficiency, it may also make it difficult for the model to effectively capture the local diversity and subtle changes in the input data (especially in complex terrain regions), thus affecting the model's generalization performance and ability to recover fine terrain details.
[0007] The scarcity of high-resolution DEM training data: Training deep learning models typically requires a large amount of labeled data. However, compared to readily available high-resolution satellite imagery, high-quality, large-scale high-resolution DEM data remains relatively scarce and costly to acquire, which significantly limits the application and performance improvement of fully supervised deep learning methods in DEM super-resolution tasks.
[0008] Challenges of multimodal data fusion: Although there is a strong correlation between high-resolution remote sensing imagery and DEM topographic features, how to effectively fuse these two types of multimodal data from different sources and with different characteristics (e.g., DEM is single-channel elevation data, while remote sensing imagery is usually multi-channel spectral data) remains a key challenge.
[0009] Existing fusion strategies may be relatively simple, such as early feature stitching or element-wise addition, which are difficult to fully utilize the rich information in the guide image. In particular, when the dimensions (e.g., number of channels) of the guide image features and DEM features are inconsistent, direct fusion may lead to information loss or require complex preprocessing, making it difficult to maintain the fine texture and structural information in the high-resolution guide image to effectively guide DEM reconstruction.
[0010] Insufficient accuracy in preserving and reconstructing terrain features: While many existing super-resolution methods improve the spatial resolution of DEMs, they may struggle to accurately recover key terrain features, such as the sharpness of ridgelines, the depth of valleys, and topographic parameters like slope and aspect. The reconstructed DEM may exhibit over-smoothing, blurred details, or artifacts, affecting subsequent terrain analysis and applications. Summary of the Invention
[0011] To address the shortcomings of existing technologies, the purpose of this invention is to provide an intelligent super-resolution method for DEMs based on high spatial resolution remote sensing data. This method utilizes readily available high-resolution remote sensing images as guiding information to direct the super-resolution reconstruction process of low-resolution DEMs, thereby enabling the generation of high-resolution DEM products with high accuracy and rich terrain details at a lower cost and higher efficiency.
[0012] Other features and advantages of this application will become apparent from the following detailed description, or may be learned in part from practice of this application.
[0013] According to a first aspect of this application, a DEM intelligent super-resolution method based on high spatial resolution remote sensing data is provided, comprising:
[0014] Input high-resolution remote sensing imagery into the guided selection branch module to extract the first shallow feature, and input low-resolution DEM into the terrain reconstruction branch module to output the second shallow feature. The first shallow feature and the second shallow feature are interactively processed by the terrain-aware cross-attention module, and the first deep feature is fed back to the guided selection branch module to obtain guidance information. The second deep feature is fed back to the terrain reconstruction branch module to obtain DEM features.
[0015] The guidance information and the DEM features are fused using multi-source features to obtain fused features;
[0016] The guiding information, the DEM features, and the fused features are trained using a total loss function until the total loss function converges, and the target high-resolution DEM is output.
[0017] In some embodiments of this application, based on the foregoing scheme, the guided selection branch module includes a guided feature optimization stage, which includes a plurality of sequentially set first Swin Transformer units. By inputting the first shallow feature into the first first Swin Transformer unit, the plurality of sequentially set first Swin Transformer units sequentially extract the first deep feature.
[0018] The terrain reconstruction branch module includes a DEM feature optimization stage, which includes several sequentially set second Swing Transformer units. By inputting the second shallow feature into the first second Swing Transformer unit, the several sequentially set second Swing Transformer units extract the second deep feature in sequence.
[0019] Each pair of the first Swin Transformer unit and the second Swin Transformer unit is provided with a terrain-aware cross-attention module. The terrain-aware cross-attention module is used to alternately perform terrain cross-attention transformation on the first deep feature and the second deep feature, output the transformed first deep feature to the next first Swin Transformer unit and the next terrain-aware cross-attention module, and output the transformed second deep feature to the next second Swin Transformer unit and the next terrain-aware cross-attention module.
[0020] In some embodiments of this application, based on the foregoing scheme, the terrain-aware cross-attention module includes a terrain-aware cross-attention unit and a feedforward network unit, and the calculation formula for the terrain-aware cross-attention unit is as follows:
[0021]
[0022] in, Represents the normalized exponential function, , , These represent the query vector, key vector, and value vector, respectively. Refers to the channel dimension of the query or key. , These represent the first deep feature and the second deep feature, respectively.
[0023] The feedforward network unit consists of two multi-layer perceptron layers and a terrain convolutional layer.
[0024] In some embodiments of this application, based on the foregoing scheme, the guided branch selection module further includes a guided feature encoding stage and a Dem estimation stage;
[0025] The guided feature encoding stage includes a first shallow feature extraction unit and a pair of third Swing Transformer units. The first shallow feature extraction unit includes multiple layers of first convolutional networks, with the kernel size of the multiple layers of first convolutional networks decreasing sequentially. The first convolutional network is used to extract first shallow features, and the pair of third Swing Transformer units is used to perform pixel rearrangement downsampling on the first shallow features.
[0026] The Dem estimation includes a first bottleneck layer unit, a first enhanced convolutional unit, and a pixel rearrangement unit. The first bottleneck layer unit reduces the number of channels of the first deep feature, the first enhanced convolutional unit enhances the first deep feature, and the pixel rearrangement unit performs pixel rearrangement upsampling on the first deep feature to obtain guiding information.
[0027] In some embodiments of this application, based on the foregoing scheme, the terrain reconstruction branch module further includes a DEM feature encoding stage and a DEM reconstruction stage;
[0028] The DEM feature encoding stage includes a second shallow feature extraction unit, which includes multiple layers of second convolutional networks. The kernel size of the multiple layers of second convolutional networks decreases sequentially. The second convolutional networks are used to extract second shallow features.
[0029] The DEM reconstruction stage includes a second bottleneck layer unit and an upsampling unit. The second bottleneck layer unit reduces the number of channels of the second deep feature, and the upsampling unit enhances the second deep feature to obtain the DEM feature.
[0030] In some embodiments of this application, based on the foregoing scheme, the step of fusing the guidance information with the DEM features from multiple sources to obtain fused features includes:
[0031] An attention fusion module is used to adaptively learn the importance indices of the guidance information and the DEM features in two feature channels, respectively. The two importance indices are then weighted and fused to obtain the fused features.
[0032] In some embodiments of this application, based on the foregoing scheme, the step of training the guiding information, the DEM features, and the fused features using a total loss function until the total loss function converges and outputs the target high-resolution DEM includes:
[0033] The guiding information, the DEM features, and the fused features are all trained end-to-end using a collaborative loss function. This collaborative loss function includes a root mean square error loss function for elevation and a root mean square error loss function for terrain slope, calculated using the following formula:
[0034]
[0035]
[0036]
[0037] in: and These are the predicted high-resolution DEM of the target and the actual high-resolution DEM of the target. The elevation value of each pixel. It is the total number of pixels in the target high-resolution DEM. This indicates computation using the Sobel operator. Topographic gradient map This indicates computation using the Sobel operator. Topographic gradient map Represents the collaborative loss function. Let be the root mean square error loss function for elevation. Let the root mean square error loss function be the terrain slope. It is a preset hyperparameter used to balance the contribution weights of the root mean square error loss function of elevation and the root mean square error loss function of terrain slope.
[0038] By minimizing the total loss function, the total loss function is brought to convergence. The formula for calculating the total loss function is as follows:
[0039] .
[0040] According to a second aspect of this application, a DEM intelligent super-resolution device based on high spatial resolution remote sensing data is provided, the device comprising:
[0041] The first acquisition module is used to input high-resolution remote sensing images into the guidance selection branch module, extract the first shallow features, and input low-resolution DEM into the terrain reconstruction branch module, output the second shallow features, and process the first shallow features and the second shallow features interactively through the terrain-aware cross-attention module, feed back the first deep features to the guidance selection branch module to obtain guidance information, and feed back the second deep features to the terrain reconstruction branch module to obtain DEM features.
[0042] The second acquisition module is used to perform multi-source feature fusion of the guidance information and the DEM features to obtain fused features;
[0043] The third acquisition module is used to train the guidance information, the DEM features and the fused features using a total loss function until the total loss function converges and outputs the target high-resolution DEM.
[0044] According to a third aspect of this application, a computer-readable storage medium is provided that stores a computer program thereon, the computer program including executable instructions that, when executed by a processor, implement the method described above.
[0045] According to a fourth aspect of this application, an electronic device is provided, comprising:
[0046] One or more processors;
[0047] Memory is used to store executable instructions for the processor, which, when executed by one or more processors, cause one or more processors to implement the methods described above.
[0048] The beneficial effects of this application are as follows:
[0049] This invention provides an intelligent super-resolution method for DEMs based on high spatial resolution remote sensing data. This method constructs a dual-branch collaborative deep learning framework comprising a guided selection branch module and a terrain reconstruction branch module, and utilizes a novel terrain-aware cross-attention module for multimodal feature fusion. The aim is to generate high-precision target high-resolution DEMs (HR DEMs) from low-resolution DEMs (LR DEMs) and high-resolution remote sensing imagery (HR_IMG). This method addresses the problems of high cost and long acquisition time associated with traditional high-resolution DEMs by using relatively readily available high-resolution remote sensing imagery as auxiliary information, achieving cost-effective and efficient high-precision DEM production. Its core "intelligence" lies in its ability to adaptively fuse features from different sources and optimize based on learning objectives tailored to terrain characteristics. This differs from general image super-resolution techniques and is more suitable for the specific needs of terrain reconstruction.
[0050] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description
[0051] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and are intended to explain the invention, but do not constitute an undue limitation thereof. In the drawings:
[0052] Figure 1 This is a flowchart of an intelligent super-resolution method for DEM based on high spatial resolution remote sensing data according to the present invention;
[0053] Figure 2 This invention provides a model network framework for an intelligent super-resolution method for DEMs based on high spatial resolution remote sensing data.
[0054] Figure 3 This is a schematic diagram of pixel rearrangement downsampling and pixel rearrangement upsampling in this invention;
[0055] Figure 4 This is a schematic diagram of the Swim Transformer block of the present invention;
[0056] Figure 5 This is a schematic diagram of the terrain-aware cross-attention module of the present invention;
[0057] Figure 6(a) is a low-resolution DEM of a specific embodiment of the present invention;
[0058] Figure 6(b) shows a high-resolution remote sensing image of a specific embodiment of the present invention;
[0059] Figure 6(c) shows a Sentinel-2 true-color image according to a specific embodiment of the present invention;
[0060] Figure 6(d) shows a target high-resolution DEM according to a specific embodiment of the present invention;
[0061] Figure 7 This is a schematic diagram of a DEM intelligent super-resolution device based on high spatial resolution remote sensing data according to the present invention;
[0062] Figure 8 This is a schematic diagram of an electronic device according to the present invention. Detailed Implementation
[0063] Specific embodiments of the invention will now be described in detail with reference to the accompanying drawings, which illustrate examples of the invention. Although the invention will be described in conjunction with specific embodiments, it will be understood that it is not intended to limit the invention to the embodiments described herein. Rather, it is intended to cover variations, modifications, and equivalents included within the spirit and scope of the invention as defined by the appended claims. It should be noted that the method steps described herein can be implemented by any functional block or functional arrangement, and any functional block or functional arrangement can be implemented as a physical entity or a logical entity, or a combination of both.
[0064] To enable those skilled in the art to better understand the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0065] Note: The examples described below are merely specific examples and are not intended to limit the embodiments of the present invention to the specific steps, values, conditions, data, order, etc. Those skilled in the art can utilize the concept of the present invention to construct more embodiments not mentioned herein by reading this specification.
[0066] Figure 1 The flowchart of a DEM intelligent super-resolution method based on high spatial resolution remote sensing data according to the present invention is shown. Figure 2 This invention provides a model network framework for an intelligent super-resolution method for DEM based on high spatial resolution remote sensing data.
[0067] According to a first aspect of this application, a DEM intelligent super-resolution method based on high spatial resolution remote sensing data is provided, comprising:
[0068] Step S101: Input high-resolution remote sensing imagery into the guided selection branch module to extract the first shallow feature, and input low-resolution DEM into the terrain reconstruction branch module to output the second shallow feature. The first shallow feature and the second shallow feature are interactively processed by the terrain-aware cross-attention module, and the first deep feature is fed back to the guided selection branch module to obtain guidance information. The second deep feature is fed back to the terrain reconstruction branch module to obtain DEM features.
[0069] In some implementations of this embodiment, a parallel dual-branch network structure is constructed, which includes a guided selection branch module, a terrain reconstruction branch module, and a terrain perception cross-attention module.
[0070] In some implementations of this embodiment, preprocessing of the input data is required. Specifically, the input low-resolution DEM (LR_DEM) and the high-resolution remote sensing image (HR_IMG) used as guiding information are normalized.
[0071] Since the low-resolution DEM has a large numerical range, in order to facilitate model learning and optimization, the low-resolution DEM is scaled to the range [-1, 1] using the following formula (1):
[0072] (1)
[0073] in, These are the normalized DEM values. It is a specific elevation value in the original low-resolution DEM. and These are the minimum and maximum elevation values of all low-resolution DEM samples in this batch or the entire dataset, respectively.
[0074] For high-resolution remote sensing images (HR_IMG), their pixel values are normalized to the range using the following formula (2):
[0075] (2)
[0076] in, It is a normalized remote sensing image. These are the pixel values of the original remote sensing image.
[0077] In some embodiments of this example, the guided branch selection module includes a guided feature encoding stage, a guided feature optimization stage, and a Dem estimation stage arranged sequentially.
[0078] In some embodiments of this example, the guided feature encoding stage includes a first shallow feature extraction unit and a pair of third Swing Transformer units. The first shallow feature extraction unit includes multiple layers of first convolutional networks, with the kernel size of the multiple layers of first convolutional networks decreasing sequentially. The first convolutional networks are used to extract first shallow features, and the pair of third Swing Transformer units are used to perform pixel rearrangement downsampling on the first shallow features.
[0079] In some implementations of this embodiment, such as Figure 4The diagram shown is a schematic of the first Swing Transformer unit, the second Swing Transformer unit, and the third Swing Transformer unit.
[0080] Specifically, the multi-layer first convolutional network includes a first convolutional network layer, a second convolutional network layer, and a third convolutional network layer. The kernel sizes of the first, second, and third convolutional networks are 5×5, 3×3, and 1×1, respectively, progressively generating first shallow features. These features are segmented into pixel tokens along the spatial dimension. To further represent multi-scale features, current methods typically fuse these features through a concatenation operation to maintain the same dimensionality between the guiding features and the terrain features. However, this may result in the loss of high-resolution (HR) guiding information in some cases. To overcome this problem, a downsampling unit composed of a pair of third Swin Transformer units is proposed to capture a larger receptive field. The downsampling unit performs 3×3 downsampling, performing pixel rearrangement downsampling on the first shallow features. During this process, layer normalization is used to enhance robustness, while unbiased 1×1 convolutions are used to adjust the number of output channels to three times that of the input features.
[0081] like Figure 3 The diagram illustrates the relationship between pixel rearrangement downsampling and pixel rearrangement upsampling. Unlike traditional pooling operations, pixel rearrangement downsampling preserves guiding information in the HR domain, thus avoiding the loss of high-resolution features. Therefore, this embodiment uses different numbers of guiding feature channels and terrain feature channels to achieve effective fusion of guiding features and terrain features in the terrain cross-attention mechanism. In this stage, the output features of the guiding feature encoding stage... It is fixed at 216 channels.
[0082] In some embodiments of this example, the guided feature optimization stage includes several sequentially arranged first Swin Transformer units. The first shallow feature is input to the first first Swin Transformer unit, so that the several sequentially arranged first Swin Transformer units extract the first deep feature in sequence.
[0083] This embodiment introduces a first Swing Transformer unit to further process the features generated during the guided symbolization stage. Specifically, the first Swing Transformer unit receives data from... The input features are processed by the first Swin Transformer unit, while the subsequent first Swin Transformer units process the results from the terrain-aware attention (TAA) mechanism. The output features of each first Swin Transformer unit are defined as follows: .
[0084] In some implementations of this embodiment, such as Figure 3 The diagram illustrates the relationship between pixel rearrangement downsampling and pixel rearrangement upsampling. The Dem estimation includes a first enhanced convolutional unit and a pixel rearrangement unit. The first enhanced convolutional unit enhances the first deep feature, and the pixel rearrangement unit performs pixel rearrangement upsampling on the first deep feature to obtain guiding information.
[0085] Specifically, refined features The number of channels will be reduced from 216 to 72 through the first bottleneck layer unit. To perform the upsampling operation, a lightweight design is introduced to enhance feature resolution since there is no risk of information loss. The lightweight design includes a first enhanced convolutional unit and a pixel rearrangement unit to achieve upsampling, wherein the first enhanced convolutional unit is a 3×3 convolutional layer. The upsampling factor is then used. Thus, the features are ultimately obtained. .exist Perform three 3×3 convolutional layers on top and then add the results to the interpolation results to obtain the final result. .
[0086] In some embodiments of this example, the terrain reconstruction branch module includes a DEM feature encoding stage, a DEM feature optimization stage, and a DEM reconstruction stage arranged sequentially.
[0087] In some embodiments of this example, given a low-resolution DEM, the DEM feature encoding stage includes a multi-layer second convolutional network, where the kernel size of the multi-layer second convolutional network decreases sequentially, and the second convolutional network is used to extract second shallow features.
[0088] Specifically, the multi-layer second convolutional network includes a first layer, a second layer, and a third layer, which extract 72-channel shallow features. The kernel sizes of the first convolutional network in the first layer, the second convolutional network in the second layer, and the third convolutional network in the third layer are 5×5, 3×3, and 1×1, respectively. These features represent the second shallow features of the low-resolution DEM.
[0089] In some embodiments of this example, the DEM feature optimization stage includes several sequentially arranged second Swing Transformer units. The second shallow feature is input to the first second Swing Transformer unit, so that the several sequentially arranged second Swing Transformer units extract the second deep feature in sequence.
[0090] Specifically, in the DEM feature optimization stage, the pixel tokens of the second shallow layer features are input into several sequentially set second Swin Transformer units for terrain refinement. In this stage, the features output by each second Swin Transformer unit are... .
[0091] In some embodiments of this example, the DEM reconstruction stage includes a second bottleneck layer unit and an upsampling unit. The second bottleneck layer unit reduces the number of channels of the second deep feature, and the upsampling unit enhances the second deep feature to obtain the DEM feature.
[0092] Specifically, the second deep feature token output by each second Swin Transformer unit communicates bidirectionally with the first deep feature token via a terrain-aware cross-attention (TAA) module, effectively fusing guiding features and terrain features to further extract detailed information from the terrain map. To better reuse multi-level terrain features, all outputs of paired Swin Transformer modules are concatenated to obtain 72xT features, which are then adjusted to 72 channels using a bottleneck layer. It's important to note that while dense connections between the two branches aid feature fusion, these connections do not provide additional gain but rather increase experimental complexity; therefore, careful consideration is needed in model design. To improve feature resolution, an upsampling block is then executed, progressively enhancing feature resolution to ultimately output features with the target resolution. Finally, the result is obtained by adding an additional 3×3 convolutional layer and then adding it to the interpolation result. .
[0093] In some embodiments of this example, a terrain-aware cross-attention module is provided between each pair of the first Swing Transformer unit and the second Swing Transformer unit. The terrain-aware cross-attention module is used to alternately perform terrain cross-attention transformation on the first deep feature and the second deep feature, output the transformed first deep feature to the next first Swing Transformer unit and the next terrain-aware cross-attention module, and output the transformed second deep feature to the next second Swing Transformer unit and the next terrain-aware cross-attention module.
[0094] Specifically, terrain-aware cross-attention (TAA) units, such as Figure 5 As shown. Since the guided selection branch module and the terrain reconstruction branch module are used for guided DEM estimation and guided DEM super-resolution, respectively, these two modules should be fused together during the iteration process. As mentioned earlier, the number of channels for the guided features and terrain features are 216 and 72, respectively. Assuming the normalized feature pairs used for cross-attention are... The query comes from the first element, while the value and key come from the second element. In this setup, the query... ,key Sum Each head is embedded into the multi-head self-attention (MHSA) mechanism, as shown in the following formula (3):
[0095] (3)
[0096] in, Represents the normalized exponential function, , , These represent the query vector, key vector, and value vector, respectively. Refers to the channel dimension of the query or key. , These represent the first deep feature and the second deep feature, respectively.
[0097] Because the shape of the attention map is much smaller than the H×W shape in the standard global attention mechanism, windowing is unnecessary, allowing for global attention and further increasing the receptive field. The TAA module also includes a feedforward network unit consisting of two multilayer perceptron (MLP) layers and a 3×3 terrain convolutional layer, used for further nonlinear transformation of the fused features. The feature fusion process is bidirectional. Specifically, for In feature pairs and The terrain cross-attention transformation is performed alternately. The output features are fed back to the corresponding branches and used to prepare for the next cross-attention calculation if necessary.
[0098] Step S102: The guidance information and the DEM features are fused using multi-source features to obtain fused features.
[0099] In some implementations of this embodiment, an attention fusion module is used to adaptively learn the importance indices of the guidance information and the DEM features in two feature channels, respectively, and the two importance indices are weighted and fused to obtain fused features.
[0100] In some embodiments of this example, the branch selection module outputs guidance information. The terrain reconstruction branch module outputs DEM features. To integrate these two high-frequency detail images and generate a unified and more accurate final DEM high-frequency detail information, an attention fusion module is employed. This module is preferably implemented based on the channel attention (CA) mechanism, receiving... and As input, the relative importance of different feature channels in both is adaptively learned, and a weighted fusion is performed accordingly. The resulting fused features are then... .
[0101] Step 103: Train the guidance information, the DEM features and the fused features using the total loss function until the total loss function converges, and output the target high-resolution DEM.
[0102] In some embodiments of this example, the guidance information, the DEM features, and the fused features are all trained end-to-end using a collaborative loss function. This collaborative loss function includes the root mean square error of elevation loss function and the root mean square error of terrain slope loss function, and the calculation formula is as follows:
[0103] (4)
[0104] (5)
[0105] (6)
[0106] in: and These are the predicted high-resolution DEM of the target and the actual high-resolution DEM of the target. The elevation value of each pixel. It is the total number of pixels in the target high-resolution DEM. This indicates computation using the Sobel operator. Topographic gradient map This indicates computation using the Sobel operator. Topographic gradient map Represents the collaborative loss function. Let be the root mean square error loss function for elevation. Let the root mean square error loss function be the terrain slope. It is a preset hyperparameter used to balance the contribution weights of the root mean square error loss function for elevation and the root mean square error loss function for terrain slope.
[0107] in, The collaborative loss function representing the guidance information. The collaborative loss function representing the fused features. The collaborative loss function represents the features of the DEM.
[0108] By minimizing the total loss function, the total loss function is brought to convergence. The formula for calculating the total loss function is as follows:
[0109] (7)
[0110] All deep learning methods were implemented using the PyTorch framework, with a fixed number of 4 Swin transformer blocks (T), and training was performed on an NVIDIA 3080 10GB GPU. To ensure fairness, all models used the same hyperparameter settings: a batch size of 16, an input low-resolution DEM data pixel size of 32×32, and a Sentinel-2 image size of 96×96. During training, the Adam optimizer was used for 100 epochs with an initial learning rate of 0.0001, which decayed to 0.1 over 90 epochs. Root mean square error (RMSE) was chosen as the primary evaluation metric for model performance; a lower RMSE value indicates better model performance.
[0111] All trained models were applied to the test dataset, and the comparison results are listed in Table 1. Statistical results show that the proposed method outperforms all other methods in all three terrain evaluation metrics. Furthermore, it can be observed that deep learning methods generally exhibit more satisfactory performance and higher stability compared to traditional interpolation methods (BiCubic), highlighting the advantages of deep learning methods in the DEMSR task. As a convolution-based SR model, TFASR achieves better results than SRCNN in the DEMSR task. In contrast, TTSR, which incorporates terrain feature awareness, demonstrates competitive performance, second only to the model proposed in this embodiment. Based on these insights, using HRIMG to guide DEM super-resolution and promoting the fusion of multiple terrain and elevation features, this invention provides a more accurate and reliable solution for the DEMSR task.
[0112] Table 1. Root mean square error evaluation results under different methods.
[0113]
[0114] The invention will be further described below with reference to specific experiments. Figure 6(a) shows a low-resolution DEM according to a specific embodiment of the invention; Figure 6(b) shows a high-resolution remote sensing image according to a specific embodiment of the invention; Figure 6(c) shows a Sentinel-2 true-color image according to a specific embodiment of the invention; and Figure 6(d) shows a target high-resolution DEM according to a specific embodiment of the invention. Visual comparisons show that the intelligent super-resolution method for DEMs based on high spatial resolution remote sensing data provided in this embodiment can more accurately preserve terrain details.
[0115] According to a second aspect of this application, a DEM intelligent super-resolution device based on high spatial resolution remote sensing data is provided, such as... Figure 7 As shown, the device includes:
[0116] The first acquisition module 201 is used to input high-resolution remote sensing images into the guided selection branch module, extract first shallow features, and input low-resolution DEM into the terrain reconstruction branch module, output second shallow features, and process the first shallow features and second shallow features interactively through the terrain-aware cross-attention module, feed back the first deep features to the guided selection branch module to obtain guidance information, and feed back the second deep features to the terrain reconstruction branch module to obtain DEM features;
[0117] The second acquisition module 202 is used to perform multi-source feature fusion of the guidance information and the DEM features to obtain fused features;
[0118] The third acquisition module 203 is used to perform multi-source feature fusion of the guidance information and the DEM features to obtain fused features.
[0119] Specifically, this embodiment corresponds one-to-one with the above method embodiments. The functions of each module have been described in detail in the corresponding method embodiments, so they will not be repeated here.
[0120] According to a third aspect of this application, a computer-readable storage medium is provided that stores a computer program thereon, the computer program including executable instructions that, when executed by a processor 301, implement the method described above.
[0121] The present invention can implement all or part of the processes in the above methods, or it can be accomplished by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor 301, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium can include: any entity or device capable of carrying computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium can be appropriately added or removed according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, the computer-readable medium does not include electrical carrier signals and telecommunication signals.
[0122] According to a fourth aspect of this application, an electronic device is provided, such as... Figure 8 As shown, it includes:
[0123] One or more processors 301;
[0124] The memory 302 is used to store the executable instructions of the processor 301, which, when executed by one or more processors 301, cause one or more processors 301 to implement the above-described method.
[0125] The electronic device is manifested in the form of a general-purpose computing device. The components of the electronic device may include, but are not limited to: at least one processor 301, at least one memory 302, and a bus 303 connecting different device components (including memory 302 and processor 301).
[0126] The processor 301 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor. The processor 301 is the control center of the computer device, connecting various parts of the entire computer device through various interfaces and lines.
[0127] The memory 302 can be used to store computer programs and / or modules. The processor 301 implements various functions of the computer device by running or executing the computer programs and / or modules stored in the memory 302 and calling the data stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area. The program storage area may store application programs required for operating the device and at least one function (such as sound playback function, image playback function, etc.). The data storage area may store data created according to the use of the mobile phone (such as audio data, video data, etc.). In addition, the memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, RAM, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other volatile solid-state storage device.
[0128] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, apparatus, servers, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage and optical storage) containing computer-usable program code.
[0129] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (devices), servers, and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor 301 of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor 301 of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0130] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0131] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0132] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A smart super-resolution method for DEM based on high spatial resolution remote sensing data, characterized in that, include: Input high-resolution remote sensing imagery into the guided selection branch module to extract the first shallow feature, and input low-resolution DEM into the terrain reconstruction branch module to output the second shallow feature. The first shallow feature and the second shallow feature are interactively processed by the terrain-aware cross-attention module, and the first deep feature is fed back to the guided selection branch module to obtain guidance information. The second deep feature is fed back to the terrain reconstruction branch module to obtain DEM features. The guided selection branch module includes a guided feature optimization stage, which includes several sequentially set first Swing Transformer units. By inputting the first shallow feature into the first first Swing Transformer unit, the several sequentially set first Swing Transformer units extract the first deep feature in sequence. The terrain reconstruction branch module includes a DEM feature optimization stage, which includes several sequentially set second Swing Transformer units. By inputting the second shallow feature into the first second Swing Transformer unit, the several sequentially set second Swing Transformer units extract the second deep feature in sequence. A terrain-aware cross-attention module is provided between each pair of the first Swing Transformer unit and the second Swing Transformer unit. The terrain-aware cross-attention module is used to alternately perform terrain cross-attention transformation on the first deep feature and the second deep feature, output the transformed first deep feature to the next first Swing Transformer unit and the next terrain-aware cross-attention module, and output the transformed second deep feature to the next second Swing Transformer unit and the next terrain-aware cross-attention module. The terrain-aware cross-attention module includes a terrain-aware cross-attention unit and a feedforward network unit. The calculation formula for the terrain-aware cross-attention unit is as follows: in, Represents the normalized exponential function, , , These represent the query vector, key vector, and value vector, respectively. The channel dimension referring to the query or key. , These represent the first deep feature and the second deep feature, respectively. The feedforward network unit consists of two multi-layer perceptron layers and a terrain convolutional layer; The guidance information and the DEM features are fused using multi-source features to obtain fused features; The guiding information, the DEM features, and the fused features are trained using a total loss function until the total loss function converges, and the target high-resolution DEM is output.
2. The method according to claim 1, characterized in that: The guided branch selection module also includes a guided feature encoding stage and a Dem estimation stage; The guided feature encoding stage includes a first shallow feature extraction unit and a pair of third Swing Transformer units. The first shallow feature extraction unit includes multiple layers of first convolutional networks, with the kernel size of the multiple layers of first convolutional networks decreasing sequentially. The first convolutional network is used to extract first shallow features, and the pair of third Swing Transformer units is used to perform pixel rearrangement downsampling on the first shallow features. The Dem estimation includes a first bottleneck layer unit, a first enhanced convolutional unit, and a pixel rearrangement unit. The first bottleneck layer unit reduces the number of channels of the first deep feature, the first enhanced convolutional unit enhances the first deep feature, and the pixel rearrangement unit performs pixel rearrangement upsampling on the first deep feature to obtain guiding information.
3. The method according to claim 1, characterized in that: The terrain reconstruction branch module also includes a DEM feature encoding stage and a DEM reconstruction stage; The DEM feature encoding stage includes a second shallow feature extraction unit, which includes multiple layers of second convolutional networks. The kernel size of the multiple layers of second convolutional networks decreases sequentially. The second convolutional networks are used to extract second shallow features. The DEM reconstruction stage includes a second bottleneck layer unit and an upsampling unit. The second bottleneck layer unit reduces the number of channels of the second deep feature, and the upsampling unit enhances the second deep feature to obtain the DEM feature.
4. The method according to claim 1, characterized in that, The step of fusing the guidance information with the DEM features from multiple sources to obtain fused features includes: An attention fusion module is used to adaptively learn the importance indices of the guidance information and the DEM features in two feature channels, respectively. The two importance indices are then weighted and fused to obtain the fused features.
5. The method according to claim 1, characterized in that, The process of training the guiding information, the DEM features, and the fused features using a total loss function until the total loss function converges, and then outputting the target high-resolution DEM, includes: The guiding information, the DEM features, and the fused features are all trained end-to-end using a collaborative loss function. This collaborative loss function includes a root mean square error loss function for elevation and a root mean square error loss function for terrain slope, calculated using the following formula: in: and These are the predicted high-resolution DEM of the target and the actual high-resolution DEM of the target. The elevation value of each pixel. It is the total number of pixels in the target high-resolution DEM. This indicates computation using the Sobel operator. Topographic gradient map This indicates computation using the Sobel operator. Topographic gradient map Represents the collaborative loss function. Let be the root mean square error loss function for elevation. Let the root mean square error loss function be the terrain slope. It is a preset hyperparameter used to balance the contribution weights of the root mean square error loss function of elevation and the root mean square error loss function of terrain slope. By minimizing the total loss function, the total loss function is brought to convergence. The formula for calculating the total loss function is as follows: 。 6. A DEM intelligent super-resolution device based on high spatial resolution remote sensing data, applied to the method described in any one of claims 1-5, characterized in that, The device includes: The first acquisition module is used to input high-resolution remote sensing images into the guidance selection branch module, extract the first shallow features, and input low-resolution DEM into the terrain reconstruction branch module, output the second shallow features, and process the first shallow features and the second shallow features interactively through the terrain-aware cross-attention module, feed back the first deep features to the guidance selection branch module to obtain guidance information, and feed back the second deep features to the terrain reconstruction branch module to obtain DEM features. The second acquisition module is used to perform multi-source feature fusion of the guidance information and the DEM features to obtain fused features; The third acquisition module is used to train the guidance information, the DEM features and the fused features using a total loss function until the total loss function converges and outputs the target high-resolution DEM.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, The computer program includes executable instructions that, when executed by a processor, implement the method of any one of claims 1-5.
8. An electronic device, characterized in that, include: One or more processors; A memory for storing executable instructions of the processor, which, when executed by the one or more processors, cause the one or more processors to perform the method according to any one of claims 1-5.