A building height and profile collaborative extraction method
By constructing a collaborative extraction network model and using a dual-branch coding and spatial frequency fusion module to fuse features from SAR and optical image data, the problem of insufficient extraction accuracy for building height and contours was solved, achieving efficient and accurate extraction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CENT SOUTH UNIV
- Filing Date
- 2025-06-03
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies for extracting building height and contours suffer from insufficient interaction of complementary data modal features and high computational overhead, resulting in insufficient extraction accuracy.
A collaborative extraction method for building height and contour is adopted. By constructing a collaborative extraction network model, a dual-branch coding module is used to extract features from SAR image data and optical image data, and a spatial frequency fusion module is used to fuse the dual-domain features. Finally, the height and contour extraction results are output through a dual-branch decoding module.
It improves the accuracy of building height and contour extraction, achieves full fusion of complementary features from optical image data and SAR image data, and reduces computational overhead.
Smart Images

Figure CN120673080B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of urban planning and construction technology, and in particular to a method for the collaborative extraction of building height and outline. Background Technology
[0002] Building height, as a crucial parameter of urban three-dimensional spatial structure, has seen its accurate extraction become a key technological breakthrough in modern urban science. With two-dimensional geographic information increasingly insufficient to meet the needs of smart city development, the acquisition of three-dimensional building data not only signifies a revolution in urban surveying technology but also profoundly impacts the transformation and upgrading of urban governance paradigms. Building height data supports quantitative research on urban morphology, providing new solutions to traditional challenges such as the urban heat island effect and visual corridor control. Simultaneously, research on the correlation between building height and energy consumption reveals the vertical development patterns of cities, providing decision-making support for sustainable urban development. At this juncture of urban digital transformation, the value of building height extraction technology has transcended the traditional scope of surveying and mapping, reshaping the dimensions of urban cognition. Building height data has evolved beyond a single geographic information attribute, becoming a vital parameter for the operation of complex urban systems, offering new methodologies for addressing urban problems and optimizing resource allocation.
[0003] As urbanization strategies advance, urban built-up areas are growing rapidly, and the proportion of high-rise buildings is continuously increasing. Traditional manual measurement methods are no longer sufficient to meet the data collection needs of complex building complexes in megacities. The development of Earth observation technology has made automated extraction of building heights over large areas possible. Early methods primarily included those based on shadows, stereo pairs, and Interferometry Synthetic Radar (InSAR) images. Shadow-based methods calculate building heights based on the spatial geometric relationship between the sun, buildings, and shadows; stereo pair-based methods estimate heights using photogrammetric analysis of optical stereo images; and InSAR-based methods utilize high-resolution TerraSAR-X / TanDEM-X data from satellites, obtaining ground object heights through interferometry and processing. Although some institutions have used these methods to generate a series of building height products, these methods have certain limitations in estimating building heights over large areas, such as inaccurate extraction due to building shadows and the difficulty in acquiring ultra-high-resolution stereo pair data (InSAR methods require DSM auxiliary data). Therefore, a building height extraction method that is both cost-effective and efficient is needed to address these issues. With the development of machine learning technology, machine learning methods based on open-source data have gradually replaced traditional height extraction methods. Based on data type, these methods can be mainly divided into those based on optical imagery and those based on synthetic aperture radar (SAR) data. Optical imagery, with its advantages of high-resolution texture, multispectral information, and historical data, is suitable for building boundary identification and shadow analysis in good weather conditions. However, its height inversion relies on indirect methods and is limited by vegetation obstruction and weather conditions. SAR, relying on the all-weather operation and penetration characteristics of microwaves, can acquire high-resolution, all-weather data unaffected by light. Its backscattering values show a strong correlation with building height, opening up new avenues for building height estimation. Therefore, some studies have explored the potential of Sentinel-1 SAR data in building height extraction by constructing mapping indices and spatial regression models. However, SAR imagery lacks fine spatial detail information, making it difficult to accurately depict the outline structure of buildings.
[0004] Optical imagery possesses the ability to delineate building outlines, while SAR imagery offers advantages in all-weather operation and structural awareness. The two complement each other in the problem of building height extraction. Current research has explored pathways for collaboratively extracting building outlines and heights by fusing Sentinel-1 SAR images and Sentinel-2 optical images using traditional machine learning models such as support vector machines and random forests, or deep learning models. These methods have effectively improved the results of large-scale building height extraction. However, SAR images and optical images exhibit significant heterogeneity and distinct feature information. Existing research often employs loosely coupled feature concatenation or data concatenation after feature extraction in the spatial domain, resulting in insufficient interaction of complementary features between the two data modalities and a tendency for information redundancy. On the other hand, early methods used convolutional neural networks to integrate multimodal information, but they typically exhibit local sensitivity and a lack of long-range dependencies, thus limiting their ability to integrate relevant features from both modalities to achieve building height extraction. In contrast, Transformer-based models, characterized by their large receptive field and global sensitivity, often outperform convolutional neural networks in capturing broad contextual information. However, these models suffer from significant computational overhead due to the quadratic increase in resources relative to sequence length. Summary of the Invention
[0005] This invention provides a method for the collaborative extraction of building height and contour, the purpose of which is to improve the accuracy of building height and contour extraction.
[0006] To achieve the above objectives, the present invention provides a method for collaborative extraction of building height and contour, comprising:
[0007] Step 1: Obtain the building height profile extraction dataset. The building height profile extraction dataset includes vertical polarization band data and vertical-horizontal polarization band data from the SAR image data of the building, four band data from the optical image data of the building, reference height data and reference profile data of the building.
[0008] Step 2: Train the constructed collaborative extraction network model using the building height contour extraction dataset to obtain the trained collaborative extraction network model;
[0009] Step 3: Input the SAR image data and optical image data of the target building into the trained collaborative extraction network model to extract information and obtain the height extraction result and the outline extraction result of the target building.
[0010] The collaborative extraction network model includes: a bi-branch coding module for feature extraction from SAR image data and optical image data, a spatial frequency fusion module for fusing dual-domain features, and a bi-branch decoding module for outputting building height extraction results and building contour extraction results.
[0011] Furthermore, the dual-branch coding module includes:
[0012] First coding unit, second coding unit, third coding unit, fourth coding unit, fifth coding unit, sixth coding unit, seventh coding unit, eighth coding unit;
[0013] First downsampling unit, second downsampling unit, third downsampling unit, fourth downsampling unit, fifth downsampling unit, sixth downsampling unit;
[0014] First feature fusion unit, second feature fusion unit, third feature fusion unit;
[0015] The input terminals of the first coding unit and the fifth coding unit are both input terminals of the dual-branch coding module;
[0016] The first output of the first encoding unit and the first output of the fifth encoding unit are both connected to the input of the first feature fusion unit. The output of the first feature fusion unit is connected to the second input of the dual-branch decoding module. The second output of the first encoding unit is connected to the input of the first downsampling unit. The output of the first downsampling unit is connected to the input of the second encoding unit. The second output of the fifth encoding unit is connected to the input of the fourth downsampling unit. The output of the fourth downsampling unit is connected to the input of the sixth encoding unit.
[0017] The first output of the second encoding unit and the first output of the sixth encoding unit are both connected to the input of the second feature fusion unit. The output of the second feature fusion unit is connected to the third input of the dual-branch decoding module. The second output of the second encoding unit is connected to the input of the second downsampling unit. The output of the second downsampling unit is connected to the input of the third encoding unit. The second output of the sixth encoding unit is connected to the input of the fifth downsampling unit. The output of the fifth downsampling unit is connected to the input of the seventh encoding unit.
[0018] The first output of the third encoding unit and the first output of the seventh encoding unit are both connected to the input of the third feature fusion unit. The output of the third feature fusion unit is connected to the fourth input of the dual-branch decoding module. The second output of the third encoding unit is connected to the input of the third downsampling unit. The output of the third downsampling unit is connected to the input of the fourth encoding unit. The second output of the seventh encoding unit is connected to the input of the sixth downsampling unit. The output of the sixth downsampling unit is connected to the input of the eighth encoding unit.
[0019] The outputs of the fourth coding unit and the eighth coding unit are both connected to the input of the spatial frequency fusion module.
[0020] Furthermore, the first coding unit and the fifth coding unit have the same structure;
[0021] The first coding unit includes a first convolutional block, a first half-instance normalized residual block, a second half-instance normalized residual block, and a third half-instance normalized residual block connected in sequence.
[0022] The input of the first convolutional block is the input of the first coding unit, and the output of the third half-instance normalized residual block is the first output and the second output of the first coding unit.
[0023] Furthermore, the first feature fusion unit has the same structure as the second and third feature fusion units;
[0024] The first feature fusion unit includes a splicing block, a second convolutional block, a batch normalization block, and an activation function block connected in sequence.
[0025] Furthermore, the second coding unit has the same structure as the third, fourth, sixth, seventh, and eighth coding units;
[0026] The second coding unit includes a patch embedding block, a first Mamba block for feature learning, a patch inverse embedding block, and a third convolutional block connected in sequence.
[0027] The input of the patch embedding block is the input of the second coding unit, and the third convolution block is the output of the second coding unit.
[0028] Furthermore, the spatial frequency fusion module includes:
[0029] Spatial domain fusion unit, frequency domain fusion unit, adaptive spatial frequency fusion unit;
[0030] The input of the spatial domain fusion unit is connected to the output of the fourth coding unit and the output of the eighth coding unit, respectively, and the output of the spatial domain fusion unit is connected to the input of the adaptive spatial frequency fusion module.
[0031] The input of the frequency domain fusion unit is connected to the output of the fourth coding unit and the output of the eighth coding unit, respectively, and the output of the frequency domain fusion unit is connected to the input of the adaptive spatial frequency fusion module.
[0032] The output of the adaptive spatial frequency fusion unit is connected to the first input of the dual-branch decoding module.
[0033] Furthermore, the spatial domain fusion unit includes:
[0034] The first cross-attention block consists of a first normalization layer, a second normalization layer, a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, a first reshaping layer, a second reshaping layer, a third reshaping layer, a fourth reshaping layer, a first matrix multiplier, a second matrix multiplier, and a first adder.
[0035] A spatial feedforward network consisting of a third normalization layer, an eighth convolutional layer, a ninth convolutional layer, a tenth convolutional layer, an eleventh convolutional layer, a twelfth convolutional layer, a first GELU activation layer, a second adder, and a third adder;
[0036] The input of the first normalization layer is connected to the output of the fourth coding unit, the output of the first normalization layer is connected to the input of the first convolutional layer, the output of the first convolutional layer is connected to the input of the fourth convolutional layer, the output of the fourth convolutional layer is connected to the input of the first reshaping layer, and the output of the first reshaping layer is connected to the first input of the first matrix multiplier.
[0037] The input of the second normalization layer is connected to the output of the eighth coding unit and the first input of the first adder, respectively. The output of the second normalization layer is connected to the input of the second convolutional layer and the input of the third convolutional layer, respectively. The output of the second convolutional layer is connected to the input of the fifth convolutional layer. The output of the fifth convolutional layer is connected to the input of the second reshaping layer. The output of the second reshaping layer is connected to the second input of the first matrix multiplier. The output of the first matrix multiplier is connected to the first input of the second matrix multiplier.
[0038] The output of the third convolutional layer is connected to the input of the sixth convolutional layer. The output of the sixth convolutional layer is connected to the input of the third reshaping layer. The output of the third reshaping layer is connected to the second input of the second matrix multiplier. The output of the second matrix multiplier is connected to the input of the fourth reshaping layer. The output of the fourth reshaping layer is connected to the input of the seventh convolutional layer. The output of the seventh convolutional layer is connected to the second input of the first adder. The first output of the first adder is connected to the first input of the third adder. The second output of the first adder is connected to the input of the third normalization layer.
[0039] The output of the third convolutional layer is connected to the input of the eighth and ninth convolutional layers, respectively. The output of the eighth convolutional layer is connected to the input of the tenth convolutional layer, and the output of the tenth convolutional layer is connected to the first input of the second adder.
[0040] The output of the ninth convolutional layer is connected to the input of the eleventh convolutional layer. The output of the eleventh convolutional layer is connected to the input of the first GELU activation layer. The output of the first GELU activation layer is connected to the second input of the second adder. The output of the second adder is connected to the input of the twelfth convolutional layer. The output of the twelfth convolutional layer is connected to the second input of the third adder. The output of the third adder is connected to the input of the adaptive spatial frequency fusion unit.
[0041] Furthermore, the frequency domain fusion unit includes:
[0042] The second cross-attention block consists of the fourth normalization layer, the fifth normalization layer, the sixth normalization layer, the seventh normalization layer, the thirteenth convolutional layer, the fourteenth convolutional layer, the fifteenth convolutional layer, the sixteenth convolutional layer, the seventeenth convolutional layer, the eighteenth convolutional layer, the nineteenth convolutional layer, the fifth reshaping layer, the sixth reshaping layer, the seventh reshaping layer, the eighth reshaping layer, the first Fourier transform layer, the second Fourier transform layer, the first inverse Fourier transform layer, the first element-wise multiplier, the second element-wise multiplier, and the fourth adder.
[0043] The feedforward network consists of the eighth normalization layer, the ninth reshaping layer, the tenth reshaping layer, the third Fourier transform layer, the second inverse Fourier transform layer, the twentieth convolutional layer, the twenty-first convolutional layer, the twenty-second convolutional layer, the twenty-third convolutional layer, the twenty-fourth convolutional layer, the second GELU activation layer, the third element multiplier, the fourth element multiplier, and the fifth adder.
[0044] The input of the fourth normalization layer is connected to the output of the eighth coding unit, the output of the fourth normalization layer is connected to the input of the thirteenth convolutional layer, the output of the thirteenth convolutional layer is connected to the input of the sixteenth convolutional layer, the output of the sixteenth convolutional layer is connected to the input of the fifth reshaping layer, the output of the fifth reshaping layer is connected to the input of the first Fourier transform layer, and the output of the first Fourier transform layer is connected to the first input of the first element multiplier.
[0045] The input of the fifth normalization layer is connected to the output of the fourth encoding unit and the first input of the fourth adder, respectively. The output of the fifth normalization layer is connected to the input of the fourteenth convolutional layer and the fifteenth convolutional layer, respectively. The output of the fourteenth convolutional layer is connected to the input of the seventeenth convolutional layer. The output of the seventeenth convolutional layer is connected to the input of the sixth reshaping layer. The output of the sixth reshaping layer is connected to the input of the second Fourier transform layer. The output of the second Fourier transform layer is connected to the second input of the first element-wise multiplier. The output of the first element-wise multiplier is connected to the input of the first inverse Fourier transform layer. The output of the first inverse Fourier transform layer is connected to the input of the eighth reshaping layer. The output of the eighth reshaping layer is connected to the input of the sixth normalization layer. The output of the sixth normalization layer is connected to the first input of the second element-wise multiplier.
[0046] The output of the fifteenth convolutional layer is connected to the input of the eighteenth convolutional layer, the output of the eighteenth convolutional layer is connected to the input of the seventh remodeling layer, the output of the seventh remodeling layer is connected to the second input of the second element multiplier, the output of the second element multiplier is connected to the input of the nineteenth convolutional layer, and the output of the nineteenth convolutional layer is connected to the second input of the fourth adder.
[0047] The output of the fourth adder is connected to the input of the seventh normalization layer and the first input of the fifth adder. The output of the seventh normalization layer is connected to the input of the ninth reshaping layer. The output of the ninth reshaping layer is connected to the input of the third Fourier transform layer. The output of the third Fourier transform layer is connected to the first input of the third element-wise multiplier. The second input of the third element-wise multiplier is connected to the learnable parameter matrix. The output of the third element-wise multiplier is connected to the input of the second inverse Fourier transform layer. The output of the second inverse Fourier transform layer is connected to the input of the tenth reshaping layer. The output of the tenth reshaping layer is connected to the inputs of the twentieth and twenty-first convolutional layers. The output of the twentieth convolutional layer is connected to the input of the twenty-second convolutional layer. The output of the twenty-second convolutional layer is connected to the first input of the fourth element-wise multiplier.
[0048] The output of the 21st convolutional layer is connected to the input of the 23rd convolutional layer. The output of the 23rd convolutional layer is connected to the input of the second GELU activation layer. The output of the second GELU activation layer is connected to the second input of the fourth element multiplier. The output of the fourth element multiplier is connected to the input of the 24th convolutional layer. The output of the 24th convolutional layer is connected to the input of the adaptive spatial frequency fusion unit.
[0049] Furthermore, the dual-branch decoding module includes:
[0050] First decoding unit, second decoding unit, third decoding unit, fourth decoding unit, fifth decoding unit, sixth decoding unit, seventh decoding unit, eighth decoding unit, height regression unit, contour extraction unit;
[0051] The first input terminal of the first decoding unit and the first input terminal of the second decoding unit are both connected to the output terminal of the adaptive spatial frequency fusion unit, and the second input terminal of the first decoding unit and the second input terminal of the second decoding unit are both connected to the output terminal of the third feature fusion unit.
[0052] The first input terminal of the third decoding unit is connected to the output terminal of the first decoding unit, the first input terminal of the fourth decoding unit is connected to the output terminal of the second decoding unit, and the second input terminals of the third decoding unit and the fourth decoding unit are both connected to the output terminal of the second feature fusion unit.
[0053] The first input terminal of the fifth decoding unit is connected to the output terminal of the third decoding unit, the first input terminal of the sixth decoding unit is connected to the output terminal of the fourth decoding unit, and the second input terminals of the fifth decoding unit and the sixth decoding unit are both connected to the output terminal of the third feature fusion unit.
[0054] The input of the seventh decoding unit is connected to the output of the fifth decoding unit, the input of the eighth decoding unit is connected to the output of the sixth decoding unit, the output of the seventh decoding unit is connected to the input of the height regression unit, the output of the height regression unit is the first output of the dual-branch decoding module, the output of the eighth decoding unit is connected to the input of the contour extraction unit, and the output of the contour extraction unit is the second output of the dual-branch decoding module.
[0055] Furthermore, the first decoding unit has the same structure as the second, third, fourth, fifth, sixth, seventh, and eighth decoding units;
[0056] The first decoding unit includes a bilinear interpolation block, a third convolutional block, and a second Mamba block connected in sequence.
[0057] The input of the bilinear interpolation block is the input of the first decoding unit, and the output of the second Mamba block is the output of the first decoding unit.
[0058] The above-described solution of the present invention has the following beneficial effects:
[0059] This invention trains a collaborative extraction network model using a dataset of building height and contour extraction, resulting in a trained collaborative extraction network model. SAR and optical image data of the target building are then input into the trained collaborative extraction network model for information extraction, yielding the target building's height and contour extraction results. The collaborative extraction network model includes: a dual-branch coding module for feature extraction from SAR and optical image data; a spatial-frequency fusion module for fusing dual-domain features; and a dual-branch decoding module for outputting the building height and contour extraction results. Compared to existing technologies, this invention introduces feature extraction in both the spatial and frequency domains, and the spatial-frequency fusion module promotes the full fusion of complementary features from optical and SAR image data, thereby improving the accuracy of collaborative extraction of building height and contour.
[0060] Other beneficial effects of the present invention will be described in detail in the following detailed description section. Attached Figure Description
[0061] Figure 1 This is a flowchart illustrating an embodiment of the present invention;
[0062] Figure 2 This is a schematic diagram of the collaborative extraction network model in an embodiment of the present invention;
[0063] Figure 3 This is a schematic diagram of the spatial domain fusion unit in an embodiment of the present invention;
[0064] Figure 4 This is a schematic diagram of the frequency domain fusion unit in an embodiment of the present invention. Detailed Implementation
[0065] To make the technical problems, solutions, and advantages of this invention clearer, a detailed description will be provided below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0066] In the description of this invention, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.
[0067] In the description of this invention, it should be noted that, unless otherwise explicitly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a locking connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.
[0068] Furthermore, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.
[0069] This invention addresses existing problems by providing a method for the collaborative extraction of building height and outline.
[0070] like Figure 1 As shown, an embodiment of the present invention provides a method for collaborative extraction of building height and contour, including:
[0071] Step 1: Obtain the building height profile extraction dataset. The building height profile extraction dataset includes vertical polarization band data and vertical-horizontal polarization band data from the SAR image data of the building, four band data from the optical image data of the building, reference height data and reference profile data of the building.
[0072] Step 2: Train the constructed collaborative extraction network model using the building height contour extraction dataset to obtain the trained collaborative extraction network model;
[0073] Step 3: Input the SAR image data and optical image data of the target building into the trained collaborative extraction network model to extract information and obtain the height extraction result and the contour extraction result of the target building.
[0074] Specifically, step 1 includes:
[0075] Obtain the building height profile extraction dataset X={I} from the existing dataset. SAR ,IOpt The building height and profile extraction dataset includes Sentinel-1 SAR image data, Sentinel-2 optical image data, and corresponding building reference height and profile data. Both reference height and profile data are 10m resolution raster data, segmented into 256×256 pixel dimensions. For Sentinel-1 SAR image data, vertical polarization (VV) and vertical-horizontal polarization (VH) band data are used. in, For Sentinel-2 optical image data, data from bands 2, 3, 4, and 8 are used. in, and reference height data and reference contour data
[0076] The above-mentioned building height contour extraction dataset X is divided into training and validation sets in an 8:2 ratio.
[0077] Specifically, such as Figure 2 As shown, the collaborative extraction network model constructed in this embodiment of the invention includes: a dual-branch coding module for feature extraction of SAR image data and optical image data, a spatial frequency fusion module for fusing dual-domain features, and a dual-branch decoding module for outputting building height extraction results and building contour extraction results.
[0078] Specifically, step 2 includes:
[0079] The training set is input into the collaborative extraction network model for extraction, resulting in the extracted building height F. h and the height of the i-th reference building
[0080] The building height extraction result F is calculated using the mean squared error loss function. h and the height of the i-th reference building The loss value between;
[0081] The building outline extraction result F is calculated using the binary cross-entropy loss function and the Dice loss function. f and the i-th reference building outline The loss value between;
[0082] The building height and contour collaborative extraction network model was trained using the AdamW optimizer, the StepLR scheduler with a fixed step learning rate, and a loss function. The initial learning rate was set to 0.001, the batch size to 16, the epoch to 100, the step size of the fixed step learning rate scheduler to 8, and the gamma to 0.95, resulting in the optimized building height and contour collaborative extraction network model.
[0083] Specifically, the dual-branch coding module includes:
[0084] First coding unit, second coding unit, third coding unit, fourth coding unit, fifth coding unit, sixth coding unit, seventh coding unit, eighth coding unit;
[0085] First downsampling unit, second downsampling unit, third downsampling unit, fourth downsampling unit, fifth downsampling unit, sixth downsampling unit;
[0086] First feature fusion unit, second feature fusion unit, third feature fusion unit;
[0087] The input terminals of the first coding unit and the fifth coding unit are both input terminals of the dual-branch coding module;
[0088] The first output of the first encoding unit and the first output of the fifth encoding unit are both connected to the input of the first feature fusion unit. The output of the first feature fusion unit is connected to the second input of the dual-branch decoding module. The second output of the first encoding unit is connected to the input of the first downsampling unit. The output of the first downsampling unit is connected to the input of the second encoding unit. The second output of the fifth encoding unit is connected to the input of the fourth downsampling unit. The output of the fourth downsampling unit is connected to the input of the sixth encoding unit.
[0089] The first output of the second encoding unit and the first output of the sixth encoding unit are both connected to the input of the second feature fusion unit. The output of the second feature fusion unit is connected to the third input of the dual-branch decoding module. The second output of the second encoding unit is connected to the input of the second downsampling unit. The output of the second downsampling unit is connected to the input of the third encoding unit. The second output of the sixth encoding unit is connected to the input of the fifth downsampling unit. The output of the fifth downsampling unit is connected to the input of the seventh encoding unit.
[0090] The first output of the third encoding unit and the first output of the seventh encoding unit are both connected to the input of the third feature fusion unit. The output of the third feature fusion unit is connected to the fourth input of the dual-branch decoding module. The second output of the third encoding unit is connected to the input of the third downsampling unit. The output of the third downsampling unit is connected to the input of the fourth encoding unit. The second output of the seventh encoding unit is connected to the input of the sixth downsampling unit. The output of the sixth downsampling unit is connected to the input of the eighth encoding unit.
[0091] The outputs of the fourth coding unit and the eighth coding unit are both connected to the input of the spatial frequency fusion module.
[0092] Specifically, the first coding unit and the fifth coding unit have the same structure;
[0093] The first coding unit includes a first convolutional block, a first half-instance normalized residual block, a second half-instance normalized residual block, and a third half-instance normalized residual block connected in sequence.
[0094] The input of the first convolutional block is the input of the first coding unit, and the output of the third half-instance normalized residual block is the first output and the second output of the first coding unit.
[0095] In this embodiment of the invention, the first convolutional block in the first coding unit and the fifth coding unit is composed of a convolutional layer with a kernel size of 3, a stride size of 2, and padding of 1; the first half-instance normalized residual block, the second half-instance normalized residual block, and the third half-instance normalized residual block in the first coding unit and the fifth coding unit are all composed of a first convolutional layer, a first LeakyReLU activation function, an instance normalization layer, a second convolutional layer, and a second LeakyReLU activation function connected in sequence, wherein the kernel size of the first convolutional layer and the second convolutional layer are both 3, the stride size is 1, and the padding size is 1;
[0096] The i-th optical image in the optical image data The first convolutional block in the first coding unit is input into the first convolutional block for convolution processing to obtain the first optical feature map. SAR image data The first convolutional block in the fifth coding unit is input for convolution processing to obtain the first SAR image feature map. The dimensions are 1×4×w×h. The dimensions are 1×2×w×h, where the size of w and h are both 256;
[0097] The first optical feature map The first half-instance normalized residual block, the second half-instance normalized residual block, and the third half-instance normalized residual block in the first encoding unit are processed sequentially to obtain the second optical feature map. The first SAR image feature map The first half-instance normalized residual block, the second half-instance normalized residual block, and the third half-instance normalized residual block in the fifth coding unit are sequentially input for processing to obtain the second SAR image feature map. The dimensions are all 1×128×
[0098]
[0099] Specifically, the first feature fusion unit has the same structure as the second and third feature fusion units;
[0100] The first feature fusion unit comprises a concatenated block, a second convolutional block, a batch normalization block, and an activation function block connected in sequence. The data processing procedure is as follows:
[0101] The feature maps output by the first and fifth coding units The data is concatenated along the channel dimension, and then the first fused feature F1 is obtained through a convolutional block, with a size of [size missing].
[0102] Since the second and third feature fusion units have the same structure as the first feature fusion unit, their data processing processes are also the same. Therefore, to simplify the description, this embodiment directly provides the input and output data of the second and third feature fusion units. The input of the second feature fusion unit is... The output is F2, and its size is The input to the third feature fusion unit is The output is F3, and its size is
[0103] In this embodiment of the invention, the first downsampling unit includes a convolutional layer and a batch normalization layer. The convolutional layer has a kernel size of 1 and a stride size of 2. Its data processing procedure is as follows:
[0104] feature map Input the first downsampling unit to obtain the feature map Its dimensions are
[0105] Since the second, third, fourth, fifth, and sixth downsampling units have the same structure as the first downsampling unit, their data processing procedures are also the same. Therefore, to simplify the description, this embodiment of the invention directly provides the input and output data of the second, third, fourth, fifth, and sixth downsampling units. The input of the second downsampling unit is... The output is Its size becomes The input of the third downsampling unit is The output is Its dimensions are The input of the fourth downsampling unit is The output is Its size becomes The input of the fifth downsampling unit is The output is Its size becomes The input of the sixth downsampling unit is The output is Its dimensions are
[0106] Specifically, the second coding unit has the same structure as the third, fourth, sixth, seventh, and eighth coding units;
[0107] The second coding unit includes a patch embedding block, a first Mamba block for feature learning, a patch inverse embedding block, and a third convolutional block connected in sequence.
[0108] The input of the patch embedding block is the input of the second coding unit, and the third convolution block is the output of the second coding unit.
[0109] In this embodiment of the invention, the patch embedding block consists of a convolutional layer with a kernel of 1 and a stride of 2, a tensor flattening layer and a normalization layer connected in sequence; the first Mamba block consists of a Mamba layer, a normalization layer and a ReLU activation function connected in sequence; the patch inverse embedding block consists of a transposed layer; and the third convolutional block consists of a convolutional layer with a kernel of 1 and a stride of 1, a normalization layer and a ReLU activation function connected in sequence.
[0110] In this embodiment of the invention, the principle of the second encoding unit is as follows:
[0111] Second SAR image feature map The third SAR image feature map is obtained by embedding the input patch into the block. Its dimensions are
[0112] Then the third SAR image feature map Input the Mamba block to obtain the fourth SAR image feature map. Its dimensions are
[0113] Next, the fourth SAR image feature map Input the patch inverse embedding block to obtain the feature map of the fifth SAR image. Its dimensions are
[0114] Finally, the fifth SAR image feature map was analyzed. Input the convolutional block to obtain the feature map of the sixth SAR image. Its dimensions are
[0115] Similarly, the working principle of the sixth coding unit is:
[0116] The second optical feature map The third optical feature map is obtained by embedding the input patch into the block. Its dimensions are
[0117] Then the third optical feature map Input the Mamba block to obtain the fourth optical feature map. Its dimensions are
[0118] Next, the fourth optical feature map Input the patch inverse embedding block to obtain the fifth optical feature map. Its dimensions are
[0119] Finally, the fifth optical feature map Input the convolutional block to obtain the sixth optical feature map. Its dimensions are
[0120] Similarly, since the second coding unit has the same structure as the third, fourth, sixth, seventh, and eighth coding units, the data processing process of the second coding unit is the same as that of the third, fourth, sixth, seventh, and eighth coding units. Therefore, to simplify the description, this embodiment of the invention directly provides the input and output data of the third, fourth, sixth, seventh, and eighth coding units. The input of the third coding unit is... The output is a feature map. Size is The input to the fourth coding unit is The output is Size is The input of the seventh coding unit is The output is Size is The input of the eighth coding unit is The output is Size is
[0121] Specifically, the spatial frequency fusion module includes:
[0122] Spatial domain fusion unit, frequency domain fusion unit, adaptive spatial frequency fusion unit;
[0123] The input of the spatial domain fusion unit is connected to the output of the fourth coding unit and the output of the eighth coding unit, respectively, and the output of the spatial domain fusion unit is connected to the input of the adaptive spatial frequency fusion module.
[0124] The input of the frequency domain fusion unit is connected to the output of the fourth coding unit and the output of the eighth coding unit, respectively, and the output of the frequency domain fusion unit is connected to the input of the adaptive spatial frequency fusion module.
[0125] The output of the adaptive spatial frequency fusion unit is connected to the first input of the dual-branch decoding module.
[0126] Specifically, such as Figure 3 As shown, the spatial domain fusion unit includes:
[0127] The first cross-attention block consists of a first normalization layer, a second normalization layer, a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, a first reshaping layer, a second reshaping layer, a third reshaping layer, a fourth reshaping layer, a first matrix multiplier, a second matrix multiplier, and a first adder.
[0128] A spatial feedforward network consisting of a third normalization layer, an eighth convolutional layer, a ninth convolutional layer, a tenth convolutional layer, an eleventh convolutional layer, a twelfth convolutional layer, a first GELU activation layer, a second adder, and a third adder;
[0129] The input of the first normalization layer is connected to the output of the fourth coding unit, the output of the first normalization layer is connected to the input of the first convolutional layer, the output of the first convolutional layer is connected to the input of the fourth convolutional layer, the output of the fourth convolutional layer is connected to the input of the first reshaping layer, and the output of the first reshaping layer is connected to the first input of the first matrix multiplier.
[0130] The input of the second normalization layer is connected to the output of the eighth coding unit and the first input of the first adder, respectively. The output of the second normalization layer is connected to the input of the second convolutional layer and the input of the third convolutional layer, respectively. The output of the second convolutional layer is connected to the input of the fifth convolutional layer. The output of the fifth convolutional layer is connected to the input of the second reshaping layer. The output of the second reshaping layer is connected to the second input of the first matrix multiplier. The output of the first matrix multiplier is connected to the first input of the second matrix multiplier.
[0131] The output of the third convolutional layer is connected to the input of the sixth convolutional layer. The output of the sixth convolutional layer is connected to the input of the third reshaping layer. The output of the third reshaping layer is connected to the second input of the second matrix multiplier. The output of the second matrix multiplier is connected to the input of the fourth reshaping layer. The output of the fourth reshaping layer is connected to the input of the seventh convolutional layer. The output of the seventh convolutional layer is connected to the second input of the first adder. The first output of the first adder is connected to the first input of the third adder. The second output of the first adder is connected to the input of the third normalization layer.
[0132] The output of the third convolutional layer is connected to the input of the eighth and ninth convolutional layers, respectively. The output of the eighth convolutional layer is connected to the input of the tenth convolutional layer, and the output of the tenth convolutional layer is connected to the first input of the second adder.
[0133] The output of the ninth convolutional layer is connected to the input of the eleventh convolutional layer. The output of the eleventh convolutional layer is connected to the input of the first GELU activation layer. The output of the first GELU activation layer is connected to the second input of the second adder. The output of the second adder is connected to the input of the twelfth convolutional layer. The output of the twelfth convolutional layer is connected to the second input of the third adder. The output of the third adder is connected to the input of the adaptive spatial frequency fusion unit.
[0134] In this embodiment of the invention, the feature map output by the fourth coding unit is... The first input to the cross-attention block is used to generate features for the query sequence Q, while the feature map output by the eighth coding unit... As the second input to the cross-attention block, it is used to generate features for the key sequence K and value sequence V. Specifically, the feature map output by the fourth encoding unit... After normalization and convolution processing through the first normalization layer, the first convolutional layer, and the fourth convolutional layer, the query sequence Q is obtained, and the feature map output by the eighth coding unit is obtained. After normalization and convolution through the second normalization layer, the second convolutional layer, and the fifth convolutional layer, the key sequence K is obtained, and the feature map output by the eighth coding unit is obtained. After normalization and convolution processing through the second normalization layer, the third convolutional layer, and the sixth convolutional layer, the value sequence V is obtained. Then, the query sequence Q, the key sequence K, and the value sequence V are input in parallel to the first reshaping layer, the second reshaping layer, and the third reshaping layer to flatten them into non-overlapping M×M blocks, respectively. The total number of patches is Then perform the remodeling patch operation to obtain the remodeling result. The formula for calculating cross-attention blocks is:
[0135]
[0136] Among them, Att Spa Represents the result of cross-attention calculation, Reshape represents the reshaping patch operation, Softmax(·) represents the activation function, and τ∈R M×M×1×1 This represents the learnable temperature matrix, and B represents the relative positional deviation.
[0137] For the spatial feedforward network, the cross-attention calculation result is doubled in feature channels and divided into two parallel branches through the third normalization layer. After convolution processing of the two branches, they are added together. Then, after convolution processing through the twelfth convolutional layer, it is added to the cross-attention calculation result to obtain the spatial domain fused feature map F. spa Its size is
[0138] Specifically, such as Figure 4 As shown, the frequency domain fusion unit includes:
[0139] The second cross-attention block consists of the fourth normalization layer, the fifth normalization layer, the sixth normalization layer, the seventh normalization layer, the thirteenth convolutional layer, the fourteenth convolutional layer, the fifteenth convolutional layer, the sixteenth convolutional layer, the seventeenth convolutional layer, the eighteenth convolutional layer, the nineteenth convolutional layer, the fifth reshaping layer, the sixth reshaping layer, the seventh reshaping layer, the eighth reshaping layer, the first Fourier transform layer, the second Fourier transform layer, the first inverse Fourier transform layer, the first element-wise multiplier, the second element-wise multiplier, and the fourth adder.
[0140] The feedforward network consists of the eighth normalization layer, the ninth reshaping layer, the tenth reshaping layer, the third Fourier transform layer, the second inverse Fourier transform layer, the twentieth convolutional layer, the twenty-first convolutional layer, the twenty-second convolutional layer, the twenty-third convolutional layer, the twenty-fourth convolutional layer, the second GELU activation layer, the third element multiplier, the fourth element multiplier, and the fifth adder.
[0141] The input of the fourth normalization layer is connected to the output of the eighth coding unit, the output of the fourth normalization layer is connected to the input of the thirteenth convolutional layer, the output of the thirteenth convolutional layer is connected to the input of the sixteenth convolutional layer, the output of the sixteenth convolutional layer is connected to the input of the fifth reshaping layer, the output of the fifth reshaping layer is connected to the input of the first Fourier transform layer, and the output of the first Fourier transform layer is connected to the first input of the first element multiplier.
[0142] The input of the fifth normalization layer is connected to the output of the fourth encoding unit and the first input of the fourth adder, respectively. The output of the fifth normalization layer is connected to the input of the fourteenth convolutional layer and the fifteenth convolutional layer, respectively. The output of the fourteenth convolutional layer is connected to the input of the seventeenth convolutional layer. The output of the seventeenth convolutional layer is connected to the input of the sixth reshaping layer. The output of the sixth reshaping layer is connected to the input of the second Fourier transform layer. The output of the second Fourier transform layer is connected to the second input of the first element-wise multiplier. The output of the first element-wise multiplier is connected to the input of the first inverse Fourier transform layer. The output of the first inverse Fourier transform layer is connected to the input of the eighth reshaping layer. The output of the eighth reshaping layer is connected to the input of the sixth normalization layer. The output of the sixth normalization layer is connected to the first input of the second element-wise multiplier.
[0143] The output of the fifteenth convolutional layer is connected to the input of the eighteenth convolutional layer, the output of the eighteenth convolutional layer is connected to the input of the seventh remodeling layer, the output of the seventh remodeling layer is connected to the second input of the second element multiplier, the output of the second element multiplier is connected to the input of the nineteenth convolutional layer, and the output of the nineteenth convolutional layer is connected to the second input of the fourth adder.
[0144] The output of the fourth adder is connected to the input of the seventh normalization layer and the first input of the fifth adder. The output of the seventh normalization layer is connected to the input of the ninth reshaping layer. The output of the ninth reshaping layer is connected to the input of the third Fourier transform layer. The output of the third Fourier transform layer is connected to the first input of the third element-wise multiplier. The second input of the third element-wise multiplier is connected to the learnable parameter matrix. The output of the third element-wise multiplier is connected to the input of the second inverse Fourier transform layer. The output of the second inverse Fourier transform layer is connected to the input of the tenth reshaping layer. The output of the tenth reshaping layer is connected to the inputs of the twentieth and twenty-first convolutional layers. The output of the twentieth convolutional layer is connected to the input of the twenty-second convolutional layer. The output of the twenty-second convolutional layer is connected to the first input of the fourth element-wise multiplier.
[0145] The output of the 21st convolutional layer is connected to the input of the 23rd convolutional layer. The output of the 23rd convolutional layer is connected to the input of the second GELU activation layer. The output of the second GELU activation layer is connected to the second input of the fourth element multiplier. The output of the fourth element multiplier is connected to the input of the 24th convolutional layer. The output of the 24th convolutional layer is connected to the input of the adaptive spatial frequency fusion unit.
[0146] In this embodiment of the invention, the second cross-attention block is also the feature map output by the fourth coding unit. Feature map output by the eighth coding unit Features are generated for the query sequence Q, key sequence K, and value sequence V. To fully enhance the potential of frequency domain features in attentional interactions, a first Fourier transform layer is applied to reshape the results. The feature map is converted to the frequency domain, then passed through two element-wise multipliers, and finally converted to the spatial domain feature map through the first inverse Fourier transform layer. The calculation formula is as follows:
[0147]
[0148] Among them, Att Fre This represents the result of cross-attention calculation. This indicates the Fourier transform operation. This indicates the inverse Fourier transform operation;
[0149] To avoid excessive noise in the frequency domain information extracted through cross-attention blocks, this embodiment of the invention also adds Fourier transform and inverse Fourier transform to the feedforward neural network, and introduces a learnable parameter matrix to update the weights to adapt to the frequency domain features. The calculation process is as follows:
[0150]
[0151] FFN Fre (f')=Gating(f'W f′ )·(f'W f ′)
[0152] Where f' is the result of the Fast Fourier Transform of f (as described above) The result is derived from the learnable parameter matrix. The frequency domain fused feature map F is obtained by processing with inverse Fast Fourier Transform and then adding the frequency domain information with the aforementioned cross-attention calculation results. fre Its dimensions are 1×1024×
[0153] Specifically, the adaptive spatial frequency fusion unit consists of a series of convolutional layers, a swapped fusion block, and another convolutional layer, used to fuse feature maps F in the spatial domain.spa Feature map F fused with frequency domain fre The fusion is performed to obtain the spatial and frequency domain feature fusion result F4, with a size of
[0154] Specifically, the dual-branch decoding module includes:
[0155] First decoding unit, second decoding unit, third decoding unit, fourth decoding unit, fifth decoding unit, sixth decoding unit, seventh decoding unit, eighth decoding unit, height regression unit, contour extraction unit;
[0156] The first input terminal of the first decoding unit and the first input terminal of the second decoding unit are both connected to the output terminal of the adaptive spatial frequency fusion unit, and the second input terminal of the first decoding unit and the second input terminal of the second decoding unit are both connected to the output terminal of the third feature fusion unit.
[0157] The first input terminal of the third decoding unit is connected to the output terminal of the first decoding unit, the first input terminal of the fourth decoding unit is connected to the output terminal of the second decoding unit, and the second input terminals of the third decoding unit and the fourth decoding unit are both connected to the output terminal of the second feature fusion unit.
[0158] The first input terminal of the fifth decoding unit is connected to the output terminal of the third decoding unit, the first input terminal of the sixth decoding unit is connected to the output terminal of the fourth decoding unit, and the second input terminals of the fifth decoding unit and the sixth decoding unit are both connected to the output terminal of the third feature fusion unit.
[0159] The input of the seventh decoding unit is connected to the output of the fifth decoding unit, the input of the eighth decoding unit is connected to the output of the sixth decoding unit, the output of the seventh decoding unit is connected to the input of the height regression unit, the output of the height regression unit is the first output of the dual-branch decoding module, the output of the eighth decoding unit is connected to the input of the contour extraction unit, and the output of the contour extraction unit is the second output of the dual-branch decoding module.
[0160] Specifically, the first decoding unit has the same structure as the second, third, fourth, fifth, sixth, seventh, and eighth decoding units;
[0161] The first decoding unit includes a bilinear interpolation block, a third convolutional block, and a second Mamba block connected in sequence.
[0162] The input of the bilinear interpolation block is the input of the first decoding unit, and the output of the second Mamba block is the output of the first decoding unit.
[0163] In this embodiment of the invention, the spatial frequency fusion feature map F4 is input into the first decoding unit and the second decoding unit respectively. First, it passes through a bilinear interpolation layer to obtain an upsampled feature map, which is then concatenated with the fusion feature F3 to obtain feature map F'4, with a size of [missing information]. The concatenated feature map is input into the second Mamba block to obtain feature map F'. 4-1 Its size is
[0164] Since the first decoding unit in this embodiment of the invention has the same structure as the second, third, fourth, fifth, sixth, seventh, and eighth decoding units, the data processing process of the first decoding unit is also the same as that of the second, third, fourth, fifth, sixth, seventh, and eighth decoding units. Therefore, in order to shorten the length of the specification, this embodiment of the invention directly provides the input and output data of the third, fourth, fifth, sixth, seventh, and eighth decoding units, and the feature map F' 4-1 The inputs to the third and fourth decoding units are respectively processed by bilinear interpolation, concatenated with feature map F2, and then input into the second Mamba block to obtain feature map F'. 3-1 Its size is feature map F' 3-1 The fifth and sixth decoding units are input, and after bilinear interpolation, they are concatenated with feature map F1 and then input into the second Mamba block to obtain feature map F'. 2-1 Its size is feature map F' 2-1 The seventh and eighth decoding units are input, and after bilinear interpolation, they are input into the second Mamba block to obtain the feature map F'1, which has a size of 1×16×w×h.
[0165] In this embodiment of the invention, the height regression unit and the contour extraction unit respectively use a convolutional layer with a kernel of 1 to obtain the height extraction result and the contour extraction result of the target building. and
[0166] This invention trains a collaborative extraction network model using a dataset of building height and contour extraction, resulting in a trained collaborative extraction network model. SAR and optical image data of the target building are then input into the trained collaborative extraction network model for information extraction, yielding the target building's height and contour extraction results. The collaborative extraction network model includes: a dual-branch coding module for feature extraction from SAR and optical image data; a spatial-frequency fusion module for fusing dual-domain features; and a dual-branch decoding module for outputting the building height and contour extraction results. Compared to existing technologies, this invention introduces feature extraction in both the spatial and frequency domains, and the spatial-frequency fusion module promotes the full fusion of complementary features from optical and SAR image data, thereby improving the accuracy of collaborative extraction of building height and contour.
[0167] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A method for collaborative extraction of building height and outline, characterized in that, include: Step 1: Obtain the building height profile extraction dataset, which includes vertical polarization band data and vertical-horizontal polarization band data from the SAR image data of the building, four band data from the optical image data of the building, reference height data and reference profile data of the building; Step 2: Use the building height contour extraction dataset to train the constructed collaborative extraction network model to obtain the trained collaborative extraction network model; Step 3: Input the SAR image data and optical image data of the target building into the trained collaborative extraction network model to extract information, and obtain the height extraction result and the contour extraction result of the target building. The collaborative extraction network model includes: a bi-branch coding module for feature extraction from SAR image data and optical image data, a spatial frequency fusion module for fusing dual-domain features, and a bi-branch decoding module for outputting building height extraction results and building contour extraction results; The dual-branch encoding module includes: First coding unit, second coding unit, third coding unit, fourth coding unit, fifth coding unit, sixth coding unit, seventh coding unit, eighth coding unit; First downsampling unit, second downsampling unit, third downsampling unit, fourth downsampling unit, fifth downsampling unit, sixth downsampling unit; First feature fusion unit, second feature fusion unit, third feature fusion unit; The input terminals of the first encoding unit and the fifth encoding unit are both input terminals of the dual-branch encoding module; The first output terminal of the first encoding unit and the first output terminal of the fifth encoding unit are both connected to the input terminal of the first feature fusion unit. The output terminal of the first feature fusion unit is connected to the second input terminal of the dual-branch decoding module. The second output terminal of the first encoding unit is connected to the input terminal of the first downsampling unit. The output terminal of the first downsampling unit is connected to the input terminal of the second encoding unit. The second output terminal of the fifth encoding unit is connected to the input terminal of the fourth downsampling unit. The output terminal of the fourth downsampling unit is connected to the input terminal of the sixth encoding unit. The first output terminal of the second encoding unit and the first output terminal of the sixth encoding unit are both connected to the input terminal of the second feature fusion unit. The output terminal of the second feature fusion unit is connected to the third input terminal of the dual-branch decoding module. The second output terminal of the second encoding unit is connected to the input terminal of the second downsampling unit. The output terminal of the second downsampling unit is connected to the input terminal of the third encoding unit. The second output terminal of the sixth encoding unit is connected to the input terminal of the fifth downsampling unit. The output terminal of the fifth downsampling unit is connected to the input terminal of the seventh encoding unit. The first output terminal of the third encoding unit and the first output terminal of the seventh encoding unit are both connected to the input terminal of the third feature fusion unit. The output terminal of the third feature fusion unit is connected to the fourth input terminal of the dual-branch decoding module. The second output terminal of the third encoding unit is connected to the input terminal of the third downsampling unit. The output terminal of the third downsampling unit is connected to the input terminal of the fourth encoding unit. The second output terminal of the seventh encoding unit is connected to the input terminal of the sixth downsampling unit. The output terminal of the sixth downsampling unit is connected to the input terminal of the eighth encoding unit. The output terminals of the fourth coding unit and the eighth coding unit are both connected to the input terminal of the spatial frequency fusion module. The spatial frequency fusion module includes: Spatial domain fusion unit, frequency domain fusion unit, adaptive spatial frequency fusion unit; The input terminal of the spatial domain fusion unit is connected to the output terminal of the fourth coding unit and the output terminal of the eighth coding unit, respectively, and the output terminal of the spatial domain fusion unit is connected to the input terminal of the adaptive spatial frequency fusion module. The input terminal of the frequency domain fusion unit is connected to the output terminal of the fourth coding unit and the output terminal of the eighth coding unit, respectively, and the output terminal of the frequency domain fusion unit is connected to the input terminal of the adaptive spatial frequency fusion module. The output of the adaptive spatial frequency fusion unit is connected to the first input of the dual-branch decoding module.
2. The method for collaborative extraction of building height and contour according to claim 1, characterized in that, The first coding unit has the same structure as the fifth coding unit; The first encoding unit includes a first convolutional block, a first half-instance normalized residual block, a second half-instance normalized residual block, and a third half-instance normalized residual block connected in sequence. The input of the first convolutional block is the input of the first coding unit, and the output of the third half-instance normalized residual block is the first output and the second output of the first coding unit.
3. The method for collaborative extraction of building height and contour according to claim 2, characterized in that, The first feature fusion unit has the same structure as the second feature fusion unit and the third feature fusion unit; The first feature fusion unit includes a splicing block, a second convolutional block, a batch normalization block, and an activation function block connected in sequence.
4. The method for collaborative extraction of building height and contour according to claim 3, characterized in that, The second coding unit has the same structure as the third coding unit, the fourth coding unit, the sixth coding unit, the seventh coding unit, and the eighth coding unit; The second encoding unit includes a patch embedding block, a first Mamba block for feature learning, a patch inverse embedding block, and a third convolutional block connected in sequence. The input of the patch embedding block is the input of the second coding unit, and the third convolution block is the output of the second coding unit.
5. The method for collaborative extraction of building height and contour according to claim 4, characterized in that, The spatial domain fusion unit includes: The first cross-attention block consists of a first normalization layer, a second normalization layer, a first convolutional layer, a second convolutional layer, a third convolutional layer, a fourth convolutional layer, a fifth convolutional layer, a sixth convolutional layer, a seventh convolutional layer, a first reshaping layer, a second reshaping layer, a third reshaping layer, a fourth reshaping layer, a first matrix multiplier, a second matrix multiplier, and a first adder. A spatial feedforward network consisting of a third normalization layer, an eighth convolutional layer, a ninth convolutional layer, a tenth convolutional layer, an eleventh convolutional layer, a twelfth convolutional layer, a first GELU activation layer, a second adder, and a third adder; The input of the first normalization layer is connected to the output of the fourth encoding unit, the output of the first normalization layer is connected to the input of the first convolutional layer, the output of the first convolutional layer is connected to the input of the fourth convolutional layer, the output of the fourth convolutional layer is connected to the input of the first reshaping layer, and the output of the first reshaping layer is connected to the first input of the first matrix multiplier. The input of the second normalization layer is connected to the output of the eighth encoding unit and the first input of the first adder, respectively. The output of the second normalization layer is connected to the input of the second convolutional layer and the input of the third convolutional layer, respectively. The output of the second convolutional layer is connected to the input of the fifth convolutional layer. The output of the fifth convolutional layer is connected to the input of the second reshaping layer. The output of the second reshaping layer is connected to the second input of the first matrix multiplier. The output of the first matrix multiplier is connected to the first input of the second matrix multiplier. The output of the third convolutional layer is connected to the input of the sixth convolutional layer. The output of the sixth convolutional layer is connected to the input of the third reshaping layer. The output of the third reshaping layer is connected to the second input of the second matrix multiplier. The output of the second matrix multiplier is connected to the input of the fourth reshaping layer. The output of the fourth reshaping layer is connected to the input of the seventh convolutional layer. The output of the seventh convolutional layer is connected to the second input of the first adder. The first output of the first adder is connected to the first input of the third adder. The second output of the first adder is connected to the input of the third normalization layer. The output of the third normalization layer is connected to the input of the eighth convolutional layer and the input of the ninth convolutional layer, respectively. The output of the eighth convolutional layer is connected to the input of the tenth convolutional layer, and the output of the tenth convolutional layer is connected to the first input of the second adder. The output of the ninth convolutional layer is connected to the input of the eleventh convolutional layer, the output of the eleventh convolutional layer is connected to the input of the first GELU activation layer, the output of the first GELU activation layer is connected to the second input of the second adder, the output of the second adder is connected to the input of the twelfth convolutional layer, the output of the twelfth convolutional layer is connected to the second input of the third adder, and the output of the third adder is connected to the input of the adaptive spatial frequency fusion unit.
6. The method for collaborative extraction of building height and contour according to claim 5, characterized in that, The frequency domain fusion unit includes: The second cross-attention block consists of the fourth normalization layer, the fifth normalization layer, the sixth normalization layer, the seventh normalization layer, the thirteenth convolutional layer, the fourteenth convolutional layer, the fifteenth convolutional layer, the sixteenth convolutional layer, the seventeenth convolutional layer, the eighteenth convolutional layer, the nineteenth convolutional layer, the fifth reshaping layer, the sixth reshaping layer, the seventh reshaping layer, the eighth reshaping layer, the first Fourier transform layer, the second Fourier transform layer, the first inverse Fourier transform layer, the first element-wise multiplier, the second element-wise multiplier, and the fourth adder. The feedforward network consists of the eighth normalization layer, the ninth reshaping layer, the tenth reshaping layer, the third Fourier transform layer, the second inverse Fourier transform layer, the twentieth convolutional layer, the twenty-first convolutional layer, the twenty-second convolutional layer, the twenty-third convolutional layer, the twenty-fourth convolutional layer, the second GELU activation layer, the third element multiplier, the fourth element multiplier, and the fifth adder. The input of the fourth normalization layer is connected to the output of the eighth coding unit, the output of the fourth normalization layer is connected to the input of the thirteenth convolutional layer, the output of the thirteenth convolutional layer is connected to the input of the sixteenth convolutional layer, the output of the sixteenth convolutional layer is connected to the input of the fifth reshaping layer, the output of the fifth reshaping layer is connected to the input of the first Fourier transform layer, and the output of the first Fourier transform layer is connected to the first input of the first element-wise multiplier. The input of the fifth normalization layer is connected to the output of the fourth encoding unit and the first input of the fourth adder, respectively. The output of the fifth normalization layer is connected to the input of the fourteenth convolutional layer and the fifteenth convolutional layer, respectively. The output of the fourteenth convolutional layer is connected to the input of the seventeenth convolutional layer. The output of the seventeenth convolutional layer is connected to the input of the sixth reshaping layer. The output of the sixth reshaping layer is connected to the input of the second Fourier transform layer. The output of the second Fourier transform layer is connected to the second input of the first element-wise multiplier. The output of the first element-wise multiplier is connected to the input of the first inverse Fourier transform layer. The output of the first inverse Fourier transform layer is connected to the input of the eighth reshaping layer. The output of the eighth reshaping layer is connected to the input of the sixth normalization layer. The output of the sixth normalization layer is connected to the first input of the second element-wise multiplier. The output of the fifteenth convolutional layer is connected to the input of the eighteenth convolutional layer, the output of the eighteenth convolutional layer is connected to the input of the seventh remodeling layer, the output of the seventh remodeling layer is connected to the second input of the second element-wise multiplier, the output of the second element-wise multiplier is connected to the input of the nineteenth convolutional layer, and the output of the nineteenth convolutional layer is connected to the second input of the fourth adder. The output of the fourth adder is connected to the input of the seventh normalization layer and the first input of the fifth adder. The output of the seventh normalization layer is connected to the input of the ninth reshaping layer. The output of the ninth reshaping layer is connected to the input of the third Fourier transform layer. The output of the third Fourier transform layer is connected to the first input of the third element-wise multiplier. The second input of the third element-wise multiplier is connected to the learnable parameter matrix. The output of the third element-wise multiplier is connected to the input of the second inverse Fourier transform layer. The output of the second inverse Fourier transform layer is connected to the input of the tenth reshaping layer. The output of the tenth reshaping layer is connected to the inputs of the twentieth and eleventh convolutional layers. The output of the twentieth convolutional layer is connected to the input of the twentieth convolutional layer. The output of the twentieth convolutional layer is connected to the input of the twentieth convolutional layer. The output of the twentieth convolutional layer is connected to the first input of the fourth element-wise multiplier. The output of the 21st convolutional layer is connected to the input of the 23rd convolutional layer. The output of the 23rd convolutional layer is connected to the input of the 2nd GELU activation layer. The output of the 2nd GELU activation layer is connected to the second input of the fourth element multiplier. The output of the fourth element multiplier is connected to the input of the 24th convolutional layer. The output of the 24th convolutional layer is connected to the input of the adaptive spatial frequency fusion unit.
7. The method for collaborative extraction of building height and contour according to claim 6, characterized in that, The dual-branch decoding module includes: First decoding unit, second decoding unit, third decoding unit, fourth decoding unit, fifth decoding unit, sixth decoding unit, seventh decoding unit, eighth decoding unit, height regression unit, contour extraction unit; The first input terminal of the first decoding unit and the first input terminal of the second decoding unit are both connected to the output terminal of the adaptive spatial frequency fusion unit, and the second input terminal of the first decoding unit and the second input terminal of the second decoding unit are both connected to the output terminal of the third feature fusion unit. The first input terminal of the third decoding unit is connected to the output terminal of the first decoding unit, the first input terminal of the fourth decoding unit is connected to the output terminal of the second decoding unit, and the second input terminals of the third decoding unit and the fourth decoding unit are both connected to the output terminal of the second feature fusion unit. The first input terminal of the fifth decoding unit is connected to the output terminal of the third decoding unit, the first input terminal of the sixth decoding unit is connected to the output terminal of the fourth decoding unit, and the second input terminals of the fifth decoding unit and the sixth decoding unit are both connected to the output terminal of the third feature fusion unit. The input terminal of the seventh decoding unit is connected to the output terminal of the fifth decoding unit, the input terminal of the eighth decoding unit is connected to the output terminal of the sixth decoding unit, the output terminal of the seventh decoding unit is connected to the input terminal of the height regression unit, the output terminal of the height regression unit is the first output terminal of the dual-branch decoding module, the output terminal of the eighth decoding unit is connected to the input terminal of the contour extraction unit, and the output terminal of the contour extraction unit is the second output terminal of the dual-branch decoding module.
8. The method for collaborative extraction of building height and contour according to claim 7, characterized in that, The first decoding unit has the same structure as the second decoding unit, the third decoding unit, the fourth decoding unit, the fifth decoding unit, the sixth decoding unit, the seventh decoding unit, and the eighth decoding unit; The first decoding unit includes a bilinear interpolation block, a third convolution block, and a second Mamba block connected in sequence; The input terminal of the bilinear interpolation block is the input terminal of the first decoding unit, and the output terminal of the second Mamba block is the output terminal of the first decoding unit.