Building roof extraction method based on optical and sar remote sensing images
By using adaptive feature alignment and cross-modal multi-scale feature fusion modules, the problem of low-level features and semantic gap in multimodal remote sensing image fusion was solved, and the accuracy of building roof extraction was improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-16
AI Technical Summary
Existing multimodal remote sensing image fusion methods neglect low-level features and other supplementary information in building roof extraction, and semantic gaps exist between different modalities, resulting in the failure to fully utilize the complementary advantages of information.
An adaptive feature alignment module and a cross-modal multi-scale feature fusion module are adopted. The adaptive feature alignment module aligns high-level features of optical and SAR images, and the cross-modal multi-scale feature fusion module performs feature fusion. Combined with the channel self-attention mechanism, discriminative features are selectively fused.
It improves the accuracy of building roof extraction, makes full use of the complementary information of optical and SAR images, and enhances the feature representation capability.
Smart Images

Figure CN120014257B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of remote sensing image semantic segmentation technology, specifically relating to a method for extracting building roofs based on optical and SAR remote sensing images. Background Technology
[0002] The construction of distributed rooftop photovoltaic systems plays an important role in the transition to renewable energy power generation. The extraction of building roof area is the first problem to be solved in the assessment of the potential of distributed rooftop photovoltaics, which directly affects the accuracy of rooftop photovoltaic resource assessment.
[0003] High-resolution remote sensing images mainly include optical images and synthetic aperture radar (SAR) images. Optical images provide high spatial resolution and rich spectral and textural information, but are easily affected by weather conditions. SAR, on the other hand, can operate in all weather conditions and penetrate certain ground objects. Therefore, optical and SAR images are complementary in terms of ground information, and multimodal segmentation that fuses the two can improve segmentation accuracy compared to single-modal image segmentation.
[0004] In recent years, research on multimodal fusion methods has become a hot topic among scholars, and some multimodal fusion semantic segmentation algorithms have been proposed and applied to the extraction of building roofs. However, they still face challenges when fusing different modalities of data. First, existing methods typically focus on high-level features while neglecting low-level features and other supplementary information. Second, due to different imaging mechanisms, there is a semantic gap between different modalities of data, and direct feature fusion will not fully utilize the complementary advantages of information between different modalities. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a method for extracting building rooftops based on optical and SAR remote sensing images. It designs an adaptive feature alignment module and a cross-modal multi-scale feature fusion module, which are used for the alignment of heterogeneous modal data and the feature fusion of multimodal data, respectively, making full use of the complementary information provided by optical and SAR images.
[0006] The technical solution for achieving the objective of this invention is as follows:
[0007] A method for extracting building rooftops based on optical and SAR remote sensing imagery, comprising:
[0008] Obtain low-level features from optical and SAR images and high-level features
[0009] Based on the high-level features Identify relevant information between two modalities in a spatial dimension and align modal features;
[0010] High-level features of the aligned optical and SAR images obtained Multi-scale feature representations are obtained through dilated spatial convolution pooling pyramids.
[0011] and Each modality is grouped separately, and a channel self-attention mechanism is used to adaptively fuse discriminative features by assigning weights to each modality and selectively discarding irrelevant components, thus obtaining the fused features.
[0012] Will After 4x upsampling and After being combined and stitched together, the network passes through a convolutional layer and is then upsampled four times to form a building roof extraction network.
[0013] According to the method described above, the adaptive feature alignment process is as follows:
[0014] High-level features of the acquired optical and SAR images were converted to RKHS space;
[0015] Feature alignment is achieved by minimizing the distance between the transformed feature distributions of the two systems. The calculation formula is as follows:
[0016]
[0017] Wherein: F opt For advanced optical features, F sar For high-level SAR features, A unified representation for feature transformation to RKHS;
[0018] The semantic distribution alignment loss is added to the loss function, and its calculation formula is as follows:
[0019] L = L dice +L sda
[0020] L sda =βMMD 2 (F opt F sar )
[0021] Where: L is the loss function, which consists of two parts, namely L dice and L sda L dice Dice Loss, as the loss function for class-imbalanced semantic segmentation, L sda It is the semantic distribution alignment loss, which is obtained by minimizing the square of the distance MMD between the feature distributions of optical and SAR images and multiplying it by the coefficient β.
[0022] According to the method described above, the cross-modal multi-scale feature fusion process is as follows:
[0023] Superimposed F opt and F sar Become
[0024] The feature compression operation encodes the entire spatial features of a Channel into a global feature.
[0025] An adaptive calibration operation is performed on the global features to learn the relationship between each channel and obtain the weights of different channels;
[0026] Perform a weight calibration operation, multiplying the weights E of different channels by the original feature map;
[0027] Will Perform convolution operations to finally obtain the fused features F opt-sar .
[0028] According to the method described above, the calculation formula for feature compression operation is as follows:
[0029]
[0030] Wherein: F sq This indicates a feature compression operation, where H represents the height of the input image, W represents the width of the input image, and i and j represent the current pixel positions.
[0031] The calculation formula for adaptive calibration of global features is as follows:
[0032] E = F ex (S, W)=σ(g(S, W))=σ(W2σ(W1S))
[0033] Wherein: F ex This indicates an adaptive calibration operation. Let represent the weight matrices of the two fully connected layers, where δ is the ReLU function and σ is the Sigmoid function;
[0034] The formula for weight calibration is as follows:
[0035]
[0036] Wherein: F sacle This indicates the weight calibration operation. denoted as the superimposed features, and E as the learned weights.
[0037] A device for extracting building rooftops based on optical and SAR remote sensing imagery, comprising:
[0038] The feature acquisition module is used to obtain low-level features of optical and SAR images. and high-level features
[0039] Feature alignment module, used for the high-level features Identify relevant information between two modalities in a spatial dimension and align modal features;
[0040] The multi-scale feature acquisition module is used to acquire high-level features from optical and SAR images. Multi-scale feature representations are obtained through dilated spatial convolution pooling pyramids.
[0041] The feature fusion module is used to acquire high and low fused features separately, and then... and Each modality is grouped separately, and a channel self-attention mechanism is used to adaptively fuse discriminative features by assigning weights to each modality and selectively discarding irrelevant components, thus obtaining the fused features.
[0042] The upsampling module is used to... After 4x upsampling and After being combined and stitched together, the network passes through a convolutional layer and is then upsampled four times to form a building roof extraction network.
[0043] According to the aforementioned apparatus, the adaptive feature alignment process is as follows:
[0044] High-level features of the acquired optical and SAR images were converted to RKHS space;
[0045] Feature alignment is achieved by minimizing the distance between the transformed feature distributions of the two systems. The calculation formula is as follows:
[0046]
[0047] Wherein: F opt For advanced optical features, F sar For high-level SAR features, A unified representation for feature transformation to RKHS;
[0048] The semantic distribution alignment loss is added to the loss function, and its calculation formula is as follows:
[0049] L = L dice +L sda
[0050] L sda =βMMD2 (F opt F sar )
[0051] Where: L is the loss function, which consists of two parts, namely L dice and L sda L dice Dice Loss, as the loss function for class-imbalanced semantic segmentation, L sda It is the semantic distribution alignment loss, which is obtained by minimizing the square of the distance MMD between the feature distributions of optical and SAR images and multiplying it by the coefficient β.
[0052] According to the aforementioned apparatus, the cross-modal multi-scale feature fusion process is as follows:
[0053] Superimposed F opt and F sar Become
[0054] The feature compression operation encodes the entire spatial features of a Channel into a global feature.
[0055] An adaptive calibration operation is performed on the global features to learn the relationship between each channel and obtain the weights of different channels;
[0056] Perform a weight calibration operation, multiplying the weights E of different channels by the original feature map;
[0057] Will Perform convolution operations to finally obtain the fused features F opt-sar .
[0058] According to the aforementioned apparatus, the calculation formula for the feature compression operation is as follows:
[0059]
[0060] Wherein: F sq This indicates a feature compression operation, where H represents the height of the input image, W represents the width of the input image, and i and j represent the current pixel positions.
[0061] The calculation formula for adaptive calibration of global features is as follows:
[0062] E = F ex (S, W)=σ(g(S, W))=σ(W2δ(W1S))
[0063] Wherein: F ex This indicates an adaptive calibration operation. Let represent the weight matrices of the two fully connected layers, where δ is the ReLU function and σ is the Sigmoid function;
[0064] The formula for weight calibration is as follows:
[0065]
[0066] Wherein: F sacle This indicates the weight calibration operation. denoted as the superimposed features, and E as the learned weights.
[0067] Based on the above technical solution, the present invention has the following beneficial effects:
[0068] This invention addresses the semantic gap between different modal data. Through adaptive feature alignment and cross-modal multi-scale feature fusion, it can effectively fuse features from optical and SAR images, fully utilize the complementary information provided by optical and SAR images, and improve the accuracy of building roof extraction. Attached Figure Description
[0069] Figure 1 This is a diagram of the network model structure of the present invention.
[0070] Figure 2 This is a diagram of the cross-modal multi-scale feature fusion structure of the present invention. Detailed Implementation
[0071] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. The specific embodiments described herein are merely illustrative of the invention and do not limit the scope of protection of the invention.
[0072] Figure 1 The network model structure diagram of this invention is shown, employing a classic encoder-decoder structure. It includes the following steps:
[0073] (1) In the encoder section, optical and SAR images are input into two independent feature extractors to obtain low-level and high-level features of optical and SAR images, respectively:
[0074] (1.1) Preprocess the optical image and SAR image, perform spatial registration, and crop the image size to 512X512. Input the preprocessed images into a convolutional network with ResNet-50 as the backbone.
[0075] (1.2) Obtaining low-level features of optical and SAR images in the second layer Its size is 128x128, and high-level features are obtained in the fourth layer. Its size is 32x32.
[0076] (2) Due to different imaging mechanisms, different modal images represent the same object in different ways. These differences cause the model to ignore the fact that two input images have the same semantic meaning, making it difficult to capture the hidden semantic representation. The adaptive feature alignment module aims to better learn the features of a single modality before fusion, transforming the high-level features of optical and SAR images into the same semantic space, and performing feature alignment by minimizing the distance between the feature distributions of the two. The high-level features of the obtained optical and SAR images are then... By connecting an adaptive feature alignment module, relevant information between the two modalities is identified in the spatial dimension, and modal features are aligned:
[0077] (2.1) High-level features of the acquired optical and SAR images are converted to RKHS space;
[0078] (2.2) The maximum mean difference is used to estimate the distance between the distributions, and feature alignment is performed by minimizing the distance between the two feature distributions. The calculation formula is as follows:
[0079]
[0080] Wherein: F opt For advanced optical features, F sar For high-level SAR features, This is a unified representation after feature transformation;
[0081] (2.3) Add semantic distribution alignment loss to the loss function, and its calculation formula is as follows:
[0082] L = L dice +L sda
[0083] L sda =βMMD 2 (F opt F sar )
[0084] Where: L is the loss function, which consists of two parts, namely L dice and L sda L dice Dice Loss, commonly used as a loss function for class-imbalanced semantic segmentation, L sda It is the semantic distribution alignment loss, which is obtained by multiplying the square of the previously calculated MMD by the coefficient β.
[0085] (3) High-level features of the 32x32 optical and SAR images obtained in the fourth layer in (1.2) Multi-scale feature representations of the two images were obtained by using dilated spatial convolution pooling pyramids. The dilated spatial convolution pooling pyramid samples the given input in parallel with dilated convolutions at different sampling rates, which is equivalent to capturing the context of the feature image at multiple scales.
[0086] (4) The size of 128x128 obtained in (1.2) on the second layer. and (3) obtained Each group is formed separately, such as Figure 2 As shown, the input is a cross-modal multi-scale feature fusion module. This module is based on a channel attention mechanism and adaptively assigns fusion weights to each modal feature according to the contribution of the multimodal features. The weight coefficients are learned based on statistical information extracted from global features and are used to adjust the feature response of the channel, reduce interference, and thus enhance the representational power of the fused features. It adaptively fuses discriminative features to obtain the fused features. Here, optical characteristics will be uniformly represented as F. opt SAR features are represented as F sar The processing procedure is as follows:
[0087] (4.1) Superposition of F opt and F sar Become
[0088] (4.2) Feature compression encodes the entire spatial features of a channel into a global feature, and the calculation formula is as follows:
[0089]
[0090] Wherein: F sq This indicates a feature compression operation, where H represents the height of the input image, W represents the width of the input image, and i and j represent the current pixel positions.
[0091] (4.3) Perform adaptive calibration on the global features, as shown in Formula 2, to learn the relationship between each channel and obtain the weights of different channels. The calculation formula is as follows:
[0092] E = F ex (S, W)=σ(g(S, W))=σ(W2δ(W1S))
[0093] Wherein: F ex This indicates an adaptive calibration operation. Let represent the weight matrices of the two fully connected layers, where δ is the ReLU function and σ is the Sigmoid function;
[0094] (4.4) Perform weight calibration by multiplying the weights E of different channels by the original feature maps. The calculation formula is as follows:
[0095]
[0096] Wherein: F sacle This indicates the weight calibration operation. E represents the features superimposed in step (1), and E represents the weights learned in step (3); (4.5) Perform convolution operations to finally obtain the fused features F opt-sar The low-level features are High-level characteristics are
[0097] (5) The high-level fusion feature of size 32x32 obtained in (4.5) After a 4x upsampling, the size becomes 128x128, which is the same as the low-level fusion feature of size 128x128 obtained in (4.5). The data is then combined and concatenated, followed by a convolutional layer to output a size of 128x128. After a 4x upsampling, a semantic segmentation result of size 512 is obtained.
[0098] A device for extracting building rooftops based on optical and SAR remote sensing imagery, comprising:
[0099] The feature acquisition module is used to obtain low-level features of optical and SAR images. and high-level features
[0100] Feature alignment module, used for the high-level features Identify relevant information between two modalities in a spatial dimension and align modal features;
[0101] The multi-scale feature acquisition module is used to acquire high-level features from optical and SAR images. Multi-scale feature representations are obtained through dilated spatial convolution pooling pyramids.
[0102] The feature fusion module is used to acquire high and low fused features separately, and then... and Each modality is grouped separately, and a channel self-attention mechanism is used to adaptively fuse discriminative features by assigning weights to each modality and selectively discarding irrelevant components, thus obtaining the fused features.
[0103] The upsampling module is used to... After 4x upsampling and After being combined and stitched together, the network passes through a convolutional layer and is then upsampled four times to form a building roof extraction network.
[0104] The adaptive feature alignment process of the aforementioned device is as follows:
[0105] High-level features of the acquired optical and SAR images were converted to RKHS space;
[0106] Feature alignment is achieved by minimizing the distance between the transformed feature distributions of the two systems. The calculation formula is as follows:
[0107]
[0108] Where: D opt For advanced optical features, F sar For high-level SAR features, A unified representation for feature transformation to RKHS;
[0109] The semantic distribution alignment loss is added to the loss function, and its calculation formula is as follows:
[0110] L = L dice +L sda
[0111] L sda =βMMD 2 (F opt F sar )
[0112] Where: L is the loss function, which consists of two parts, namely L dice and L sda L dice Dice Loss, as the loss function for class-imbalanced semantic segmentation, L sda It is the semantic distribution alignment loss, which is obtained by minimizing the square of the distance MMD between the feature distributions of optical and SAR images and multiplying it by the coefficient β.
[0113] According to the aforementioned apparatus, the cross-modal multi-scale feature fusion process is as follows:
[0114] Superimposed F opt and F sar Become
[0115] The feature compression operation encodes the entire spatial features of a Channel into a global feature.
[0116] An adaptive calibration operation is performed on the global features to learn the relationship between each channel and obtain the weights of different channels;
[0117] Perform a weight calibration operation, multiplying the weights E of different channels by the original feature map;
[0118] Will Perform convolution operations to finally obtain the fused features F opt-sar .
[0119] The calculation formula for the feature compression operation of the aforementioned device is as follows:
[0120]
[0121] Wherein: F sq This indicates a feature compression operation, where H represents the height of the input image, W represents the width of the input image, and i and j represent the current pixel positions.
[0122] The calculation formula for adaptive calibration of global features is as follows:
[0123] E = F ex (S, W)=σ(g(S, W))=σ(W2σ(W1S))
[0124] Wherein: F ex This indicates an adaptive calibration operation. Let represent the weight matrices of the two fully connected layers, where α is the ReLU function and σ is the Sigmoid function;
[0125] The formula for weight calibration is as follows:
[0126]
[0127] Wherein: F sacle This indicates the weight calibration operation. denoted as the superimposed features, and E as the learned weights.
[0128] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of the present invention can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.
[0129] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0130] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0131] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0132] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.
[0133] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A method for extracting building rooftops based on optical and SAR remote sensing imagery, characterized in that, include: Obtain low-level features from optical and SAR images , and high-level features , ; Based on the high-level features , Modal features are aligned by transforming them to the same semantic space and minimizing the feature distribution distance between the two. High-level features of the aligned optical and SAR images obtained , Multi-scale feature representations are obtained through dilated spatial convolution pooling pyramids. , ; , and , Each modality is grouped separately, and a channel self-attention mechanism is used to adaptively fuse discriminative features by assigning weights to each modality and selectively discarding irrelevant components, thus obtaining the fused features. , ; Will After 4x upsampling and After being combined and stitched together, the network passes through a convolutional layer and is then upsampled four times to form a building roof extraction network.
2. The method according to claim 1, characterized in that, The alignment process is as follows: High-level features of the acquired optical and SAR images were converted to RKHS space; Feature alignment is achieved by minimizing the distance between the transformed feature distributions of the two systems. The calculation formula is as follows: ; in: For advanced optical features, For high-level SAR features, A unified representation for feature transformation to RKHS; The semantic distribution alignment loss is added to the loss function, and its calculation formula is as follows: ; Where: L is the loss function, which consists of two parts, namely: and , Dice Loss is used as the loss function for class-imbalanced semantic segmentation. It is the semantic distribution alignment loss, which is the square of the distance MMD between the feature distributions of optical and SAR images, multiplied by a coefficient. The result.
3. The method according to claim 1, characterized in that, The cross-modal multi-scale feature fusion process is as follows: Overlay and Become ; The feature compression operation encodes the entire spatial features of a Channel into a global feature. An adaptive calibration operation is performed on the global features to learn the relationship between each channel and obtain the weights of different channels; Perform a weight calibration operation, multiplying the weights E of different channels by the original feature maps to obtain the weighted features. ; Will Perform convolution operations to obtain the fused features. .
4. The method according to claim 3, characterized in that, The calculation formula for feature compression operation is as follows: ; in: This indicates a feature compression operation, where H represents the height of the input image, W represents the width of the input image, and i and j represent the current pixel positions. The calculation formula for adaptive calibration of global features is as follows: ; in: This indicates an adaptive calibration operation. , represent the weight matrices of the two fully connected layers, For ReLU function, For the Sigmoid function; The formula for weight calibration is as follows: ; in: This indicates the weight calibration operation. denoted as the superimposed features, and E as the learned weights.
5. A device for extracting building rooftops based on optical and SAR remote sensing images, characterized in that, include: The feature acquisition module is used to obtain low-level features of optical and SAR images. , and high-level features , ; Feature alignment module, used for the high-level features , Modal features are aligned by transforming them to the same semantic space and minimizing the feature distribution distance between the two. The multi-scale feature acquisition module is used to acquire high-level features from optical and SAR images. , Multi-scale feature representations are obtained through dilated spatial convolution pooling pyramids. , ; The feature fusion module is used to acquire high and low fused features separately, and then... , and , Each modality is grouped separately, and a channel self-attention mechanism is used to adaptively fuse discriminative features by assigning weights to each modality and selectively discarding irrelevant components, thus obtaining the fused features. , ; The upsampling module is used to... After 4x upsampling and After being combined and stitched together, the network passes through a convolutional layer and is then upsampled four times to form a building roof extraction network.
6. The apparatus according to claim 5, characterized in that, The alignment process is as follows: High-level features of the acquired optical and SAR images were converted to RKHS space; Feature alignment is achieved by minimizing the distance between the transformed feature distributions of the two systems. The calculation formula is as follows: ; in: For advanced optical features, For high-level SAR features, A unified representation for feature transformation to RKHS; The semantic distribution alignment loss is added to the loss function, and its calculation formula is as follows: ; Where: L is the loss function, which consists of two parts, namely: and , Dice Loss is used as the loss function for class-imbalanced semantic segmentation. It is the semantic distribution alignment loss, which is the square of the distance MMD between the feature distributions of optical and SAR images, multiplied by a coefficient. The result.
7. The apparatus according to claim 5, characterized in that, The cross-modal multi-scale feature fusion process is as follows: Overlay and Become ; The feature compression operation encodes the entire spatial features of a Channel into a global feature. An adaptive calibration operation is performed on the global features to learn the relationship between each channel and obtain the weights of different channels; Perform a weight calibration operation, multiplying the weights E of different channels by the original feature maps to obtain the weighted features. ; Will Perform convolution operations to obtain the fused features. .
8. The apparatus according to claim 7, characterized in that, The calculation formula for feature compression operation is as follows: ; in: This indicates a feature compression operation, where H represents the height of the input image, W represents the width of the input image, and i and j represent the current pixel positions. The calculation formula for adaptive calibration of global features is as follows: ; in: This indicates an adaptive calibration operation. , represent the weight matrices of the two fully connected layers, For ReLU function, For the Sigmoid function; The formula for weight calibration is as follows: ; in: This indicates the weight calibration operation. denoted as the superimposed features, and E as the learned weights.