A method for retrieving urban building and vegetation height from fused multi-source remote sensing data

By constructing a height prediction combination model based on multi-source remote sensing data, the problem of obtaining building and vegetation height information over a large scale is solved, achieving high-precision and low-cost urban height inversion, supporting cross-regional applications, and providing reliable prediction error assessment.

CN122241301APending Publication Date: 2026-06-19CHINA UNIV OF MINING & TECH (BEIJING) +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA UNIV OF MINING & TECH (BEIJING)
Filing Date
2026-02-12
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to efficiently and accurately acquire building and vegetation height information on a large scale, and are costly and time-consuming. When retrieving heights from a single remote sensing data source in complex urban and mixed vegetation scenarios, there is high uncertainty, making it difficult to fully characterize the height features of different land cover types.

Method used

A combined height prediction model integrating multi-source remote sensing data, including nDSM height data, satellite remote sensing data, and land cover type data, is constructed. Convolutional neural networks and Transformer modules are used for feature extraction and fusion. A weakly supervised training framework is combined to identify and segment building and vegetation types and predict their heights.

🎯Benefits of technology

It achieves high-precision, large-scale, and low-cost inversion of urban building and vegetation heights, supports cross-regional and cross-city height product generation globally, improves the method's scalability and universality, and provides reliable prediction error assessment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241301A_ABST
    Figure CN122241301A_ABST
Patent Text Reader

Abstract

This invention discloses a method for inverting urban building and vegetation height by fusing multi-source remote sensing data. The method includes: constructing a height inversion sample dataset for the study area, comprising nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data; constructing a height prediction combined model including a convolutional neural network module and a Transformer module; inputting the height inversion sample dataset into the height prediction combined model for feature extraction and fusion processing according to building and vegetation types to obtain the fused features of each pixel in the study area; the Transformer module capturing the spatial dependency information of each pixel in the study area through a multi-head self-attention mechanism; and inputting the height inversion dataset obtained from the study area into the height prediction combined model to obtain rasterized type and height data of the study area. This invention achieves high-precision, large-scale, and low-cost height inversion of urban buildings and vegetation in large-scale study areas, providing technical support for urban planning, ecological monitoring, and other scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of satellite remote sensing altitude inversion and deep learning, and particularly to a method for inverting urban building and vegetation altitude by fusing multi-source remote sensing data. Background Technology

[0002] The height of urban buildings and vegetation are crucial parameters characterizing the three-dimensional spatial morphology and ecological structure of cities, possessing significant application value in urban planning and management, urban renewal assessment, disaster risk analysis, ecological environment monitoring, and carbon cycle research. How to efficiently and accurately acquire building and vegetation height information on a large scale has always been a key research focus and technical challenge in the field of remote sensing and geographic information. Existing height inversion methods mainly rely on field measurements, airborne / vehicle-mounted lidar, aerial photogrammetry, and 3D reconstruction from stereo remote sensing images. While these methods can obtain high-precision height data in local areas, they generally suffer from high costs, long operation cycles, and strong dependence on equipment and operating conditions, making it difficult to acquire continuous and periodic height information at the urban agglomeration or even regional scale. Furthermore, limitations in data acquisition costs and operating conditions significantly restrict the widespread application of these methods.

[0003] With the development of Earth observation technology, spaceborne remote sensing data has gradually become an important data source for acquiring three-dimensional information about the Earth's surface. Optical remote sensing imagery provides rich spectral and spatial structure information, while synthetic aperture radar (SAR) data offers all-weather, all-day observation capabilities and is highly sensitive to surface structural features. Spaceborne lidar data (such as ICESat-2 and GEDI) can directly acquire high-precision surface height information; however, single optical or SAR data is significantly affected by imaging mechanisms and environmental conditions, resulting in high uncertainty when retrieving heights in complex urban and mixed vegetation scenarios. Furthermore, existing spaceborne lidar data is typically distributed as discrete points, with discontinuous spatial coverage, making it difficult to directly generate continuous height products. In recent years, deep learning methods have been widely applied to remote sensing information extraction, showing potential in height inversion. Buildings and vegetation differ significantly in structural morphology, scattering mechanisms, and height distribution characteristics. Existing height inversion techniques often use a single data source, making it difficult to fully characterize the height features of different land cover types, further increasing the uncertainty of the inversion results. Therefore, how to fully utilize the complementary advantages of multi-source remote sensing data under limited high-precision sample conditions, reduce dependence on high-density labeled samples, and achieve high-precision, continuous extraction of building and vegetation heights as well as large-scale height inversion is a key technical problem that urgently needs to be solved. Summary of the Invention

[0004] The purpose of this invention is to provide a method for inverting the height of urban buildings and vegetation by integrating multi-source remote sensing data. This method constructs multi-source height inversion data including nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data. A height prediction combination model utilizes the height inversion sample dataset for training on building and vegetation type identification and segmentation, as well as height prediction training under a weakly supervised training framework. This achieves high-precision, large-scale, and low-cost height inversion of urban buildings and vegetation, effectively solving core technical challenges such as the difficulty in balancing large-scale application and accuracy. It provides reliable technical support for urban planning, ecological monitoring, and other scenarios.

[0005] The objective of this invention is achieved through the following technical solution: A method for inverting urban building vegetation height by fusing multi-source remote sensing data, the method comprising: S1. Construct a height inversion sample dataset for the study area, which includes nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data. S2. Construct a height prediction combined model that integrates a convolutional neural network (CNN) module and a Transformer module. Input the height inversion sample dataset into the height prediction combined model, perform feature extraction and fusion processing according to building and vegetation types, and obtain the fused features of each pixel in the study area. The Transformer module captures the spatial dependency information of each pixel in the study area through a multi-head self-attention mechanism module; the height prediction combined model uses the height inversion sample dataset for training in building and vegetation type recognition and segmentation, and height prediction. S3. Obtain the height inversion dataset of the study area, which includes nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data. Input the height inversion dataset into the height prediction combination model to obtain the rasterized type and height data of the study area. The type and height data include the building or vegetation type of the pixel and the predicted height value.

[0006] To better implement this invention, the method for obtaining the nDSM height data of the height inversion sample dataset and the height inversion dataset is as follows: DSM data is obtained through a Digital Surface Model (DSM); the DSM data of the Digital Surface Model (DSM) is resampled and subjected to minimum value smoothing filtering to obtain bare ground elevation DEM data; the difference between the DSM data and the bare ground elevation DEM data is calculated to obtain the nDSM height data; the satellite remote sensing data includes Sentinel-1 satellite remote sensing data and Sentinel-2 satellite remote sensing data.

[0007] Preferably, the high-precision height data sources of the height inversion sample dataset and the height inversion dataset include physical measurement height data and / or height data obtained by high-density lidar; or the height of building types in the high-precision height data is obtained jointly by the ICESat-2ATL03 product and GEDI data, and the height of vegetation types in the high-precision height data is obtained jointly by the ICESat-2ATL08 product and GEDI data.

[0008] Preferably, in method S2, the convolutional neural network module CNN includes a satellite remote sensing data feature extraction branch, an nDSM altitude data channel branch, and a confidence channel branch. The satellite remote sensing data feature extraction branch is used to extract remote sensing features from the satellite remote sensing data, and the nDSM altitude data channel branch is used to extract the altitude prior features from the nDSM altitude data. The confidence channel branch performs confidence-gated weak supervision processing using nDSM altitude data as prior data and high-precision altitude data as supervised data to obtain confidence features. The remote sensing features, altitude prior features, and confidence features are fused to obtain fused features. .

[0009] Preferably, the Transformer module includes a location encoding module, a multi-head self-attention mechanism module, and a multilayer perceptron (MLP) module. The location encoding module is used for encoding the pixel location in the study area, the multi-head self-attention mechanism module is used to capture global dependencies in long-distance space, and the multilayer perceptron (MLP) module enhances feature representation capabilities through nonlinear transformation.

[0010] Preferably, in method S3, the height prediction combination model is constructed using a deep learning network. The deep learning network of the height prediction combination model obtains the predicted height value of the pixel according to the following method: n random seeds are set to process the same feature obtained by the height prediction combination model to obtain n initial predicted height values. The average of the n initial predicted height values ​​is the predicted height value. The standard deviation of the n initial predicted height values ​​is obtained as the prediction error.

[0011] Preferably, the height prediction combination model has a building height prediction unit and a vegetation height prediction unit. The building height prediction unit identifies and segments a building sample dataset from the height inversion sample dataset, and the vegetation height prediction unit identifies and segments a vegetation sample dataset from the height inversion sample dataset. The height prediction combination model uses the building sample dataset and the vegetation sample dataset to perform more accurate building and / or vegetation type identification and segmentation processing on the pixels of the height inversion dataset of the study area. The building height prediction unit uses the building sample dataset to perform more accurate height prediction processing on pixels identified as building types, and the vegetation height prediction unit uses the building height prediction unit to perform more accurate height prediction processing on pixels identified as vegetation types.

[0012] Preferably, the satellite remote sensing data feature extraction branch includes a feature extraction module A and a convolution module. The feature extraction module A is used to extract the texture features and spectral index features of the satellite remote sensing data. The convolution module is a 3×3 convolution layer. The convolution module performs convolution processing on the features, and then sequentially processes them through a BatchNorm layer and a ReLU activation function to obtain remote sensing features. The nDSM height data channel branch processes the nDSM height data sequentially through 3×3 convolution, a BatchNorm layer, a ReLU activation function, and a 3×3 convolution to obtain height prior features. The confidence channel branch includes 3×3 convolution processing and Sigmoid activation function processing.

[0013] Preferably, the Transformer module further includes an adaptive position-sensitive information injection mechanism before processing, the method of which includes: fusing features of the pixel grid in the study area. The sequence is then rearranged in dimension and converted into a pixel-level feature sequence.

[0014] Preferably, the confidence channel branch is constructed with the following loss function: , For the total loss, These are the weighting coefficients. To monitor losses, The total number of sample pixels or cells; Is it a pixel or a unit of data? confidence level This is a loss due to weak supervision.

[0015] Compared with the prior art, the present invention has the following advantages and beneficial effects: (1) This invention constructs multi-source height inversion data including nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data. The height prediction combination model uses the height inversion sample dataset to train the type identification and segmentation of buildings and vegetation, as well as the height prediction training under a weakly supervised training framework. This achieves high-precision, large-scale, and low-cost height inversion of urban buildings and vegetation, effectively solving the core technical problems of balancing large-scale application and accuracy, and providing reliable technical support for urban planning, ecological monitoring, and other scenarios.

[0016] (2) This invention can perform height inversion without the need for regional dedicated aerial / ground measurement data, and supports the generation of building and vegetation height products on a global scale, across regions and cities, which significantly improves the scalability and universality of the method. This invention introduces generated nDSM height data and its confidence as weak supervision information, and constrains the model training process through a confidence-weighted prior injection mechanism, which effectively alleviates the problem of insufficient supervision caused by the small coverage area and sparse samples of the star-borne lidar point cloud. It can still maintain stable height inversion accuracy even when the samples are insufficient or the spatial distribution is uneven.

[0017] (3) The adaptive reliability assessment mechanism of the height prediction combination model of the present invention trains multiple random seed models for buildings and vegetation respectively, and uses the ensemble mean as the final predicted height and the ensemble standard deviation as the prediction error output, thereby realizing the quantitative assessment of the reliability of the prediction results. The prediction error can reflect the degree of recognition of the model by the input features and the reliability of the local area inversion results, providing dual product support of "height value + error" for applications such as urban planning, ecological monitoring, and disaster assessment, and enhancing the interpretability and engineering usability of the results.

[0018] (4) This invention has significant advantages in predicting the height of large-scale study areas. It has significant advantages in terms of usability, robustness under weak supervision conditions and prediction credibility assessment capabilities. It can achieve high-precision and widely applicable urban building and vegetation height inversion under low-cost conditions, and has important scientific significance and engineering application value. Attached Figure Description

[0019] Figure 1 This is a flowchart of the method for inverting urban building vegetation height according to the present invention; Figure 2 This is a schematic diagram illustrating the structural principle of the Convolutional Neural Network (CNN) module in the embodiment. Figure 3 This is a schematic diagram illustrating the adaptive position-sensitive information injection mechanism and the processing principle of the Transformer module in the embodiment. Figure 4 This is a schematic diagram illustrating the principle of setting n random seeds to calculate and output the predicted height in the height prediction combination model in the embodiment. Detailed Implementation

[0020] The present invention will be further described in detail below with reference to embodiments: Example like Figure 1 As shown, a method for inverting urban building vegetation height by fusing multi-source remote sensing data includes: S1. Construct a height inversion sample dataset for the study area, comprising nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data (the height inversion sample dataset is spatially aligned and matched). Based on the source and category of the samples, they are divided into two categories: building samples and vegetation samples, labeled as follows: and The building sample set and the vegetation sample set were divided into training set, validation set and test set in a ratio of 7:2:1 to ensure that both types of tasks can independently complete parameter learning and generalization evaluation.

[0021] The method for obtaining nDSM height data from the height inversion sample dataset and the height inversion dataset of this invention is as follows: DSM data is obtained through a Digital Surface Model (DSM). The DSM data is then resampled (e.g., resampled from 30m resolution to 10m or lower resolution) and subjected to minimum value smoothing filtering to obtain bare ground elevation DEM data. The difference between the DSM data and the bare ground elevation DEM data is calculated to obtain the nDSM height data. The high-precision height data consists of height data at discrete locations. The confidence expression for the nDSM height data and the high-precision height data at the same location point k is as follows: ,in Let k be the confidence level at location k. The height of position point k in the high-precision height data. For the height of location point k in the nDSM height data, range constraints are applied to the nDSM and confidence channel optimization: nDSM is limited to be greater than 0; confidence is limited to [0,1]; if the confidence is invalid, it is assigned a value of 0 to automatically reduce the strength of the weak supervision constraint. Following the above method, the confidence of each discrete location point in the high-precision height data can be calculated. For the confidence of other location points in the study area, Kriging interpolation can be used for interpolation. Satellite remote sensing data includes Sentinel-1 satellite remote sensing data (e.g., Sentinel-1 VV and VH polarization data) and Sentinel-2 satellite remote sensing data (e.g., Sentinel-2 13 raw band data).

[0022] In some embodiments, the high-precision height data sources for the height inversion sample dataset and the height inversion dataset include physical measurement height data and / or height data obtained from high-density lidar. Alternatively, the height of building types in the high-precision height data is obtained jointly by the ICESat-2ATL03 product and GEDI data, using a method that includes: calculating the average height of ATL03 photons falling within the building area by using a building plane vector and generating a 10m buffer zone outward from the vector range, and subtracting the average height of ATL03 photons falling within the buffer zone to obtain the building height; the GEDI product provides its own height data, generates a two-meter buffer zone for photons and spatially connects it with the building vector to finally obtain the building height. The height of vegetation types in the high-precision height data is obtained jointly by ICESat-2ATL08 and GEDI data (the height field is used to obtain the vegetation height). Land cover type data can be obtained using Worldcover V2 data with a 10-meter resolution.

[0023] This embodiment uses typical areas of Beijing and Tianjin and their surrounding areas as example study areas and creates a height inversion sample dataset. Based on this height inversion sample dataset, it realizes the regional-scale building and vegetation height inversion. The method is as follows: Sentinel-1 and Sentinel-2 data within the coverage area of ​​Beijing and Tianjin are obtained through the Google Earth Engine (GEE) platform. Among them, the Sentinel-1 data selects VV and VH polarimetric images, and extracts the optimal 3×3 kernel texture features using the gray-level co-occurrence matrix (GLCM), including mean, dissimilarity, homogeneity, variance, entropy, correlation, contrast, and second moment, for a total of 16 features. The Sentinel-2 data acquires 13 original bands and calculates 6 spectral indices: NDVI, GNDVI, SAVI, NDWI, MNDVI, and NDBI, for a total of 19 features. Together with the Sentinel-1 texture features, they form a 35-dimensional input feature. Meanwhile, AW3D DSM data was obtained through GEE, and ESA Worldcover V2 2021 land cover type data was obtained for subsequent result fusion. Further, samples were constructed separately for buildings and vegetation, and the methods are as follows: (1) Building samples: Download ICESat-2 ATL03 product and GEDI data. For ATL03 photon data, a 10 m buffer was constructed based on the building plane vector, and the average height of photons falling within the building range and the average height of photons falling within the buffer were calculated respectively. The difference between the two was used as the building height sample; for GEDI data, the product's built-in height attribute was used to generate a 2 m buffer based on photons and spatially connect it with the building vector to obtain the building height sample. (2) Vegetation samples: Download ICESat-2 ATL08 product and GEDI data. Based on Worldcover V2 2021, vegetation coverage areas were screened, and photons falling within the vegetation range were used as vegetation samples. The vegetation height samples were generated by combining the height attributes of ATL08 product and GEDI. AW3D DSM data was downloaded via GEE, and the 30 m resolution data was resampled to 10 m resolution. Minimum smoothing filtering was applied to the resampled DSM to obtain the DEM, and then the DSM was subtracted from the DEM to obtain the nDSM. Subsequently, the consistency of the nDSM was evaluated using the generated building and vegetation sample data, generating an nDSM confidence map for weighted constraints in subsequent model training. The registered nDSM was then simultaneously cropped to the same resolution.

[0024] S2. Construct a height prediction combined model that integrates a convolutional neural network (CNN) module and a Transformer module. Input the height inversion sample dataset into the height prediction combined model, perform feature extraction and fusion processing according to building and vegetation types, and obtain the fused features of each pixel in the study area. In some embodiments, the Convolutional Neural Network (CNN) module includes a satellite remote sensing data feature extraction branch, an nDSM altitude data channel branch, and a confidence channel branch. The satellite remote sensing data feature extraction branch is used to extract remote sensing features from the satellite remote sensing data. Figure 2 As shown, the satellite remote sensing data feature extraction branch includes a feature extraction module A and a convolution module. Feature extraction module A is used to extract the texture features and spectral index features of the satellite remote sensing data. The satellite remote sensing data includes Sentinel-1 satellite remote sensing data (e.g., Sentinel-1 VV and VH polarization data) and Sentinel-2 satellite remote sensing data (e.g., Sentinel-2 13 raw band data). The Sentinel-1 satellite remote sensing data obtains texture information through the gray-level co-occurrence matrix (GLCM). The texture information includes mean, dissimilarity, homogeneity, variance, entropy, correlation, contrast, and second-order matrix, etc. The Sentinel-2 satellite remote sensing data provides 13 raw band data, and six spectral indices (including NDVI, GNDVI, SAVI, NDWI, MNDVI, and NDBI) can be calculated. The Sentinel-1 and Sentinel-2 satellite remote sensing data can obtain 35-dimensional feature data. The convolution module is a 3×3 convolutional layer. The convolution module performs convolution processing on the features, and then the features are processed sequentially through the BatchNorm layer and the ReLU activation function to obtain remote sensing features.

[0025] The nDSM altitude data channel branch is used to extract prior altitude features from the nDSM altitude data; such as Figure 2 As shown, the nDSM height data channel branch processes the nDSM height data sequentially using 3×3 convolution, BatchNorm layers, ReLU activation function, and 3×3 convolution to obtain height prior features. The confidence channel branch uses nDSM height data as prior data and high-precision height data as supervised data to perform confidence-gated weak supervision processing and obtain confidence features; as shown... Figure 2 As shown, the confidence channel branch includes 3×3 convolution processing and Sigmoid activation function processing; the Convolutional Neural Network (CNN) module can extract multimodal features from data from different sources, providing rich information for subsequent fusion and learning. The confidence channel branch is constructed with the following loss function: , For the total loss, These are the weighting coefficients. To monitor losses, The total number of sample pixels or cells. Is it a pixel or a unit of data? confidence level This is a loss due to weak supervision.

[0026] The fused feature is obtained by fusing remote sensing features, high-prior features, and confidence features. ;like Figure 2 As shown, the Convolutional Neural Network (CNN) module concatenates remote sensing features, high-prior features, and confidence features through channel concatenation, then compresses the channels to 64 dimensions using a 1×1 convolution and adds channel attention. Finally, it performs weighted fusion in the form of residuals to obtain the fused features. To enhance the feature modeling capability of the CNN module, this embodiment designs a cross-channel collaborative attention enhancement mechanism to further enhance the expressive power of features and ensure the prominence of key features, i.e. Figure 2 The channel attention module (i.e., the channel attention module in the model) consists of: average pooling and max pooling operations to calculate the global statistics of each channel; channel weights are calculated through a fully connected network, and the attention weights are output using the sigmoid activation function. Therefore, regions with higher confidence receive stronger nDSM prior injections, while regions with lower confidence have their weak supervision effects automatically suppressed. This mechanism reduces the negative interference of low-quality nDSM on training, achieving adaptive weighted constraints for weak supervision. To improve the model's generalization ability across different cities and terrain features, this embodiment can employ a multi-city joint training mechanism, unifying the training points from each city into a unified training coordinate set. The verification points are summarized as follows To enhance robustness and reduce sample distribution bias, this embodiment... Perform Bootstrap self-sampling: ,in The autopilot sampling ratio is preferably 0.8. This strategy ensures that the model sees a diverse set of samples during training, thus providing a diverse source for subsequent ensemble learning.

[0027] In the case study of typical areas in Beijing and Tianjin and their surrounding regions, the first 35 dimensions are multi-source remote sensing features, which are extracted as local features through convolution and channel attention. The 36th dimension, nDSM, extracts the height prior feature through independent convolution branches. The 37th dimension, confidence, is obtained as a confidence feature through convolution and sigmoid. The remote sensing features, nDSM prior features, and confidence features are concatenated and fused, and the key feature expression is enhanced by 1×1 convolution and channel attention. Simultaneously, the residual is added to the basic features to preserve the original remote sensing representation. The fused features are unfolded into a sequence and a learnable location code is added. This sequence is then input into the Transformer module for global dependency modeling, and the output center location representation vector is used for height regression. The height prediction value is output through a fully connected regression head to achieve building or vegetation height inversion. The experimental parameters in the case study of typical areas in Beijing and Tianjin and their surrounding regions are shown in Table 1, and the experimental server performance is shown in Table 2.

[0028] The Transformer module captures the spatial dependency information of each pixel in the study area through a multi-head self-attention mechanism. The Transformer module includes a location encoding module, a multi-head self-attention mechanism module, and a multilayer perceptron (MLP) module. The location encoding module encodes the pixel locations in the study area, the multi-head self-attention mechanism module captures global dependencies in long-range space, and the MLP module enhances feature representation capabilities through nonlinear transformations. Figure 3 As shown, the Transformer module also includes an adaptive position-sensitive information injection mechanism before processing, which includes the following methods: fusing features of the pixel grid in the study area. The sequence is then rearranged in dimension and converted into a pixel-level feature sequence.

[0029] The height prediction ensemble model utilizes a height inversion sample dataset for training on building and vegetation type recognition and segmentation, as well as height prediction. The height inversion sample dataset is divided into training, validation, and test sets in a 7:2:1 ratio, used for training, hyperparameter tuning, and performance evaluation of the height prediction ensemble model, respectively. During the height prediction training phase using the height inversion sample dataset, the height prediction ensemble model trains on input samples... The height prediction combined model outputs the predicted height. Its actual height is The mean squared error loss function is used to optimize the height prediction combination model, as shown in the following expression: Where N is the total number of sample pixels or cells, and the loss function is directly used as an evaluation of the high prediction accuracy of the high prediction combined model.

[0030] S3. Obtain a height inversion dataset for the study area, including nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data (the height inversion dataset is spatially aligned and matched). Input the height inversion dataset into the height prediction combination model to obtain rasterized type and height data for the study area. The type and height data include the building or vegetation type of the pixel and the predicted height value. Output the type and height data (i.e., including the type and height data of all pixels) across the entire study area, so that the final height product is consistent with the land cover semantics. The building area outputs the building height, the vegetation area outputs the vegetation height, and the corresponding error is output at the same time, realizing the integrated expression of height and credibility.

[0031] In some embodiments, the height prediction ensemble model is constructed using a deep learning network, and the deep learning network of the height prediction ensemble model obtains the predicted height values ​​of pixels as follows: Figure 4As shown, n random seeds are used to combine the height prediction models to obtain the same feature and predict the height, resulting in n initial height prediction values. The average of these n initial height prediction values ​​is the predicted height value, and the standard deviation of these n initial height prediction values ​​is used as the prediction error. When the pixel segmentation identifies the type as a building, the expression is as follows: ,in Let n be the average predicted height of the building type pixels, and n be the number of random seeds. The predicted height value for random seed k. The standard deviation of the building type pixels.

[0032] When the pixel segmentation identifies the type as vegetation, the expression is as follows: ,in is the average predicted height of the pixels for each vegetation type, and n is the number of random seeds. The predicted height value for random seed k. represents the standard deviation of vegetation type pixels. The height prediction composite model not only provides height estimation results but also outputs prediction errors, which can better reflect the model's acceptance of feature data and further provide an assessment of the reliability of the prediction results.

[0033] Preferably, the present invention sets up two independent training processes for building height inversion and vegetation height inversion respectively; for each type of task, n different random seeds are set to train n sub-models, forming a model set: The average output of each sub-model in each model set is calculated, along with the validation loss and evaluation metric. The value of n and the model are adaptively adjusted to ultimately output the optimal model set for the current region. During training, the network input patch size is set to 5×5; to avoid boundary overflow, sampling points must be at least 2 pixels away from the boundary. The training batch size is set to 64, and the validation batch size is preferably 1. The number of training epochs is set to 100, the optimizer is Adam, and the learning rate is set to... During training, for each epoch, the following is performed: forward propagation to obtain the predicted height value. Calculate the loss Backpropagation updates the parameters. To avoid training interruption due to invalid individual samples, this invention implements a safe assembly strategy for batches: if invalid samples exist (such as abnormal channel counts or invalid regions), they are filtered out; when a batch is entirely invalid, a zero tensor of the same shape is constructed as a safe input to ensure the continuous and stable operation of the training process. After each epoch, the validation loss and evaluation metrics are calculated on the validation set, including: mean squared error (MSE), root mean square error (RMSE), and coefficient of determination. Mean relative error (MRE). Where: , This is the actual height value (height values ​​can be extracted from high-precision height data).

[0034] In the case study of typical areas in Beijing and Tianjin and their surrounding regions, the following integrated prediction methods were used: Building height prediction: For each pixel within the Beijing, Tianjin, and surrounding urban areas, an input feature patch was constructed. Five building models were used for prediction, and the mean of the five outputs was taken as the building height prediction result, while the standard deviation was taken as the building height prediction error. Vegetation height prediction: Five vegetation models were used for prediction, and the mean vegetation height and vegetation prediction error were output. If the pixel category was building (Built-up), the building height result and its error were taken; if the category was vegetation (Tree cover), the vegetation height result and its error were taken, thus generating a unified regional height product and error product. The final output is a building-vegetation fusion height map and corresponding prediction error map within the Beijing, Tianjin, and surrounding urban areas, which can be used for applications such as urban planning, ecological monitoring, and disaster assessment. Through a complete process of multi-source feature fusion, weakly supervised constraints, and ensemble learning evaluation, it can achieve regional-scale inversion and reliability output of building and vegetation heights in Beijing, Tianjin, and surrounding cities.

[0035] In some embodiments, the height prediction combined model includes a building height prediction unit and a vegetation height prediction unit. The building height prediction unit identifies and segments a building sample dataset from the height inversion sample dataset, and the vegetation height prediction unit identifies and segments a vegetation sample dataset from the height inversion sample dataset. The height prediction combined model utilizes the building sample dataset and the vegetation sample dataset to perform more accurate building and / or vegetation type identification and segmentation processing on the pixels in the study area's height inversion dataset. The building height prediction unit uses the building sample dataset to perform more accurate height prediction processing on pixels identified as building types, and the vegetation height prediction unit uses the building height prediction unit to perform more accurate height prediction processing on pixels identified as vegetation types.

[0036] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for inverting urban building vegetation height by fusing multi-source remote sensing data, characterized in that: The methods include: S1. Construct a height inversion sample dataset for the study area, which includes nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data. S2. Construct a height prediction combined model that integrates a convolutional neural network (CNN) module and a Transformer module. Input the height inversion sample dataset into the height prediction combined model, perform feature extraction and fusion processing according to building and vegetation types, and obtain the fused features of each pixel in the study area. The Transformer module captures the spatial dependency information of each pixel in the study area through a multi-head self-attention mechanism module; the height prediction combined model uses the height inversion sample dataset for training in building and vegetation type recognition and segmentation, and height prediction. S3. Obtain the height inversion dataset of the study area, which includes nDSM height data, high-precision height data, satellite remote sensing data, and land cover type data. Input the height inversion dataset into the height prediction combination model to obtain the rasterized type and height data of the study area. The type and height data include the building or vegetation type of the pixel and the predicted height value.

2. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 1, characterized in that: The method for obtaining the nDSM height data of the height inversion sample dataset and the height inversion dataset is as follows: DSM data is obtained through the Digital Surface Model (DSM); the DSM data of the Digital Surface Model (DSM) is resampled and subjected to minimum value smoothing filtering to obtain bare ground elevation DEM data; the difference between the DSM data and the bare ground elevation DEM data is calculated to obtain the nDSM height data; the satellite remote sensing data includes Sentinel-1 satellite remote sensing data and Sentinel-2 satellite remote sensing data.

3. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 1, characterized in that: The high-precision height data sources for the height inversion sample dataset and the height inversion dataset include physical measurement height data and / or height data obtained from high-density lidar; or the height of building types in the high-precision height data is obtained jointly by the ICESat-2ATL03 product and GEDI data, and the height of vegetation types in the high-precision height data is obtained jointly by the ICESat-2ATL08 product and GEDI data.

4. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 1, characterized in that: In method S2, the convolutional neural network module (CNN) includes a satellite remote sensing data feature extraction branch, an nDSM altitude data channel branch, and a confidence channel branch. The satellite remote sensing data feature extraction branch is used to extract remote sensing features from the satellite remote sensing data, and the nDSM altitude data channel branch is used to extract the altitude prior features from the nDSM altitude data. The confidence channel branch uses the nDSM altitude data as prior data and high-precision altitude data as supervised data to perform confidence-gated weak supervision processing and obtain confidence features. The remote sensing features, altitude prior features, and confidence features are fused to obtain fused features. .

5. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 1, characterized in that: The Transformer module includes a location encoding module, a multi-head self-attention mechanism module, and a multilayer perceptron (MLP) module. The location encoding module is used for encoding the pixel locations in the study area, the multi-head self-attention mechanism module is used to capture global dependencies in long-distance space, and the multilayer perceptron (MLP) module enhances feature representation capabilities through nonlinear transformations.

6. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 1, characterized in that: In method S3, the height prediction ensemble model is constructed using a deep learning network. The deep learning network of the height prediction ensemble model obtains the predicted height value of the pixel as follows: n random seeds are set to process the same feature obtained by the height prediction ensemble model to obtain n initial predicted height values. The average of the n initial predicted height values ​​is the predicted height value. The standard deviation of the n initial predicted height values ​​is obtained as the prediction error.

7. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 1, characterized in that: The height prediction combined model includes a building height prediction unit and a vegetation height prediction unit. The building height prediction unit identifies and segments a building sample dataset from the height inversion sample dataset, and the vegetation height prediction unit identifies and segments a vegetation sample dataset from the height inversion sample dataset. The height prediction combined model uses the building sample dataset and the vegetation sample dataset to perform more accurate building and / or vegetation type identification and segmentation processing on the pixels of the height inversion dataset of the study area. The building height prediction unit uses the building sample dataset to perform more accurate height prediction processing on pixels identified as building types, and the vegetation height prediction unit uses the building height prediction unit to perform more accurate height prediction processing on pixels identified as vegetation types.

8. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 1, characterized in that: The satellite remote sensing data feature extraction branch includes a feature extraction module A and a convolution module. The feature extraction module A is used to extract the texture features and spectral index features of the satellite remote sensing data. The convolution module is a 3×3 convolutional layer. The convolution module performs convolution processing on the features, and then sequentially processes them through a BatchNorm layer and a ReLU activation function to obtain the remote sensing features. The nDSM height data channel branch processes the nDSM height data sequentially through 3×3 convolution, a BatchNorm layer, a ReLU activation function, and a 3×3 convolution to obtain the height prior features. The confidence channel branch includes 3×3 convolution processing and Sigmoid activation function processing.

9. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 5, characterized in that: Before the Transformer module processes the data, an adaptive position-sensitive information injection mechanism is also included, which includes the following: fusing features of the pixel grid in the study area. The sequence is then rearranged in dimension and converted into a pixel-level feature sequence.

10. The method for inverting urban building vegetation height by fusing multi-source remote sensing data according to claim 4, characterized in that: The confidence channel branch is constructed with the following loss function: , For the total loss, These are the weighting coefficients. To monitor losses, The total number of sample pixels or cells; Is it a pixel or a unit of data? Confidence level, This is a loss due to weak supervision.