Method and system applied to automatic extraction of building bottom polygon in remote sensing image

By introducing offset vector prediction branches and polygon optimization modules into remote sensing images, the accuracy and format issues of extracting building base surfaces from remote sensing images are solved, achieving efficient and accurate vector polygon output suitable for GIS and 3D modeling.

CN122244467APending Publication Date: 2026-06-19MOGANSHAN DIXIN LABORATORY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
MOGANSHAN DIXIN LABORATORY
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for extracting the base of buildings from remote sensing images suffer from insufficient accuracy, incompatible formats, and low efficiency. In particular, roof and side areas are difficult to distinguish in oblique images, leading to mask contour shifts and making them unsuitable for direct use in GIS or 3D modeling.

Method used

An offset vector prediction branch and a Transformer-based polygon optimization module are introduced. The instance segmentation network synchronously predicts the roof mask and the offset vector from the roof to the bottom surface of the building. Combined with the contour iteration optimization and post-processing module, the geometrically structured polygon of the building's bottom surface is directly output.

Benefits of technology

It significantly improves the geometric accuracy and format consistency of building base extraction, realizes direct output from pixels to vectors, improves extraction efficiency, and generates polygons that are closer to the professional level of manual drawing. It adapts to remote sensing images with different resolutions and shooting angles, and meets the needs of large-scale and rapid updates.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244467A_ABST
    Figure CN122244467A_ABST
Patent Text Reader

Abstract

This invention belongs to the fields of artificial intelligence and geographic information, specifically a method and system for automatically extracting polygons from the base of buildings in remote sensing images. The method includes the following steps: feature extraction from the input remote sensing image to obtain a multi-scale feature map; prediction of the building roof mask and the offset vector from the roof to the base based on an instance segmentation network with an integrated offset vector prediction branch; iterative contour optimization of the roof mask to obtain an optimized contour point sequence and information on whether the contour points belong to inflection points or not; and contour simplification, offset overlay, and polygon merging based on the contour point sequence, category information, and offset vectors to output a vector-formatted base polygon. By introducing an offset vector prediction branch into the instance segmentation network and Transformer-based polygon optimization, it automatically outputs geometrically structured polygons from remote sensing images without manual intervention.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of artificial intelligence and geographic information, specifically a method and system for automatically extracting polygons from the base of buildings in remote sensing images. Background Technology

[0002] With the widespread acquisition of high-resolution remote sensing data, detailed building extraction has become an important foundation for applications such as basic mapping updates, 3D reconstruction, and digital twins.

[0003] Currently, vector data of building base surfaces mostly rely on manual or semi-automatic drawing. Although this method has high accuracy, it is inefficient, labor-intensive, and highly subjective, making it difficult to meet the needs of updating large-scale, multi-temporal remote sensing data. Especially in densely populated urban areas, manual operation often requires marking the outline of each building, which takes far longer than the time required for image acquisition and preprocessing.

[0004] In recent years, deep learning-based building segmentation methods (such as U-Net, Mask R-CNN, and Mask2Former) have made significant progress in semantic and instance segmentation tasks of remote sensing images. These methods can output pixel-level building masks end-to-end, greatly improving extraction efficiency. However, for data with a stereoscopic perspective, such as oblique images, these models often extract the roof and side areas of buildings simultaneously, resulting in outward offset of the mask outline, significant differences between the outline and the building's base, irregular boundary shapes, and insufficient geometric accuracy. Furthermore, the mask output is raster data, merely a pixel-level binary image, a result of the image domain, lacking explicit geometric structure representation, and cannot be directly used in GIS or 3D modeling systems, requiring additional vectorization and simplification steps. Summary of the Invention

[0005] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method and system for automatically extracting polygons from the base of buildings in remote sensing images. By introducing an offset vector prediction branch and a Transformer-based polygon optimization module into the instance segmentation network, it can automatically output geometrically structured polygons from the base of buildings in remote sensing images without human intervention.

[0006] To solve the above-mentioned technical problems, the present invention adopts the following technical solution: On one hand, the present invention discloses a method for automatically extracting polygons from the base of buildings in remote sensing images, comprising the following steps: S1. Multi-scale feature extraction: Extract features from the input remote sensing image to obtain a multi-scale feature map; S2. Joint prediction of mask and offset vector: Based on multi-scale feature maps, an instance segmentation network with an integrated offset vector prediction branch is used to simultaneously predict the building roof mask and the offset vector from the roof to the bottom. S3. Contour Iteration Optimization: Perform contour iteration optimization on the roof mask to obtain the optimized contour point sequence and the category information of contour points belonging to inflection points and non-inflection points. S4. Post-processing and vectorization: Based on the contour point sequence, category information and offset vector, contour simplification, offset overlay and polygon merging are performed to output the building base polygon in vector format.

[0007] Preferably, step S1 specifically includes: S11, normalizing the input remote sensing image; S12, extracting image features at multiple levels using a pre-trained visual Transformer backbone network; S13, unifying the image features at multiple levels to the same number of channels through convolutional layers to generate a multi-scale feature map.

[0008] More preferably, the multiple layers of image features extracted in step S12 include feature maps output from four different network layers selected from the visual Transformer backbone network.

[0009] Preferably, step S2 specifically includes: S21, generating a building roof mask through an instance segmentation network based on multi-scale feature maps; S22, adding an offset vector prediction branch to the instance segmentation network to generate an offset vector from the roof to the bottom surface based on the multi-scale feature maps; S23, inputting the roof mask and the offset vector into a spatial transformation network to generate a bottom surface mask; S24, training the network using a composite loss function that includes roof mask category loss, roof mask loss, offset vector loss, and bottom surface mask loss.

[0010] Preferably, the offset vector prediction branch is composed of multiple fully connected layers connected sequentially. Its input is the object query features processed by the Transformer decoder in the instance segmentation network, and its output is a two-dimensional offset vector corresponding to each building instance.

[0011] Preferably, step S3 specifically includes: S31, extracting the outer contour of the roof mask and interpolating it into a fixed number of contour points to obtain an initial contour point sequence; S32, combining the multi-scale feature map, performing multiple rounds of iterative optimization on the initial contour point sequence using a Transformer decoder to obtain an optimized contour point sequence and the probability of each contour point belonging to an inflection point or a non-inflection point category.

[0012] More preferably, the number of iterations in step S32 is six, each round uses a feature map of different resolutions, and the scaling factor of the offset vector for coordinate update decreases in each round, with the scaling factors being 1024, 512, 256, 128, 64, and 32 respectively.

[0013] Preferably, step S4 specifically includes: S41, adaptively simplifying the contour points according to the inflection point probability in the category information of the contour points, retaining key inflection points and controlling the change in contour area before and after simplification to within a preset proportion, wherein the inflection point category probability of the key inflection point is higher than the dynamically adjusted threshold, and deleting the inflection point will cause the change in contour area before and after simplification to exceed the preset proportion; S42, adding the simplified contour point coordinates to the offset vector point by point to obtain the bottom contour point set; S43, merging the ground polygons that overlap, and outputting a vector polygon set with consistent topology.

[0014] On the other hand, the present invention discloses a system for automatically extracting polygons from the base of buildings in remote sensing images, comprising: The feature extraction module is used to extract and fuse features from the input remote sensing images and output multi-scale feature maps. The mask offset prediction module is based on an instance segmentation network and integrates an offset vector prediction branch to synchronously output the building roof mask and the offset vector from the roof to the bottom surface. The contour iteration optimization module is used to perform contour initialization and multi-round iterative optimization on the roof mask, and output the optimized contour point sequence and contour point category information. The post-processing module is used to perform contour simplification, offset mapping and polygon merging based on the contour point sequence, contour point category information and offset vector to generate the building bottom polygon.

[0015] Preferably, the mask offset prediction module includes: The instance segmentation network unit, based on the Mask2Former architecture, includes a pixel decoder and a Transformer decoder, used to generate object query features and corresponding roof masks; The offset vector prediction branch unit is composed of multiple fully connected layers connected sequentially. Its input is the object query feature output by the Transformer decoder, and its output is the offset vector corresponding to each building instance. A spatial transformation network unit is used during the training phase to perform a differentiable geometric transformation on the roof mask based on the offset vector field, generating a predicted bottom mask to provide a geometric supervision signal.

[0016] The beneficial effects of this invention are that it not only effectively solves the two major industry problems of "inaccurate outlines" and "incompatible formats", but also achieves significant improvements in three dimensions: accuracy, efficiency and practicality, providing a highly efficient, accurate and feasible integrated solution for remote sensing building extraction.

[0017] 1. Solved the problem of geometric offset in tilted images. By introducing a mask offset prediction branch and a spatial transformation network, the model can explicitly learn and correct the roof outline offset caused by stereoscopic viewpoint, directly establishing a precise geometric correspondence between the roof and the base at the image level, significantly improving geometric accuracy.

[0018] 2. It achieves direct output from pixels to vectors. Through the contour iteration optimization module and the post-processing module, the traditional separate "segmentation-vectorization-simplification" process is integrated into a one-step end-to-end learnable process, directly outputting structured and regularized vector polygons without manual intervention.

[0019] 3. Significantly improved the quality and practicality of the results. Through multi-scale contour iterative optimization and adaptive simplification based on inflection point categories, the generated building base polygons have fewer vertices and more regular shapes, approaching the professional level of manual drawing, greatly improving the usability and aesthetics of the results. Furthermore, through multi-scale feature fusion and end-to-end training, the model can adapt to remote sensing images of different resolutions, building types, and shooting angles, maintaining stable extraction accuracy in various complex scenarios such as densely populated urban areas and urban-rural fringe areas.

[0020] 4. Significantly improves operational efficiency and reduces application costs. It completely replaces traditional manual or semi-automatic sketching, meeting the business needs of rapid updates of large-scale, multi-temporal remote sensing data; the output results are in standard vector format, which can be directly imported into the GIS platform for spatial analysis and area calculation, eliminating the intermediate steps of format conversion and data processing. Attached Figure Description

[0021] Figure 1 This is a flowchart of the method of the present invention; Figure 2 This is a flowchart of a multi-scale feature extraction method; Figure 3 This is a flowchart of the mask offset prediction method; Figure 4 This is a flowchart of the mask offset vector prediction head; Figure 5 This is a flowchart of the contour iterative optimization method; Figure 6 This is a flowchart of the single contour optimization method; Figure 7 This is a flowchart of the post-processing module; Figure 8 This is a flowchart of the simplified dynamic threshold profile; Figure 9 It is the original image of the building extracted from remote sensing imagery; Figure 10 To Figure 9 Preliminary predictions yielded initial roof mask real-world and vector images; Figure 11 To Figure 10 Optimized roof mask real-world image and vector image obtained through contour iteration optimization; Figure 12 To Figure 11 Simplified roof mask real-world image and vector image obtained after contour simplification; Figure 13 To Figure 12 The simplified bottom mask real-world image and vector image obtained after offset vector prediction processing. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0023] This invention provides a method for automatically extracting polygons from the base of buildings in remote sensing images, such as... Figure 1 As shown, it includes the following steps: Step 1: Data Preparation and Preprocessing: Acquire high-resolution remote sensing images of the target area (e.g., 0.5-meter resolution RGB images) and crop them into fixed-size image patches (e.g., ... (pixels). For each image patch, professional annotators draw the roof vector polygon of the building, measure the offset vector pixel values ​​of the roof to the bottom surface, and calculate and generate the corresponding building bottom vector polygon based on the image's solid geometry model. The resulting dataset is divided into training and test sets in a ratio (e.g., 9:1).

[0024] Step 2: Multi-scale feature extraction: The process for this step is as follows... Figure 2 As shown, its main function is to extract features from input remote sensing images to obtain multi-scale feature maps. Specifically, this includes: 2.1 The input image patch is normalized to scale the pixel values ​​to the range required for model training. In this embodiment, the pixel values ​​are scaled to the [0,1] interval and standardized using ImageNet mean [123.675, 116.28, 103.53] and standard deviation [58.395, 57.12, 57.375].

[0025] 2.2 Backbone Network Feature Extraction: A visual Transformer model pre-trained on a large remote sensing dataset is used as the backbone network. In this embodiment, the ViT-L model pre-trained on the Million-AID dataset is used as the backbone network. The normalized image is input into the backbone network for forward propagation. Multiple layers' outputs are selected from different levels of the backbone network as multi-scale features. For example, features from layers 7, 11, 15, and 23 are selected, with corresponding spatial resolutions of... The number of channels is 1024.

[0026] 2.3 Pass The convolutional layer unifies the image feature dimensions of the above four levels to 256 dimensions, generating multi-scale feature maps.

[0027] Step 3: Joint prediction of mask and offset vector: The process of this step is as follows Figure 3 As shown, the core improvement lies in the enhancement of the Mask2Former instance segmentation network. Specifically, this includes: 3.1 Network Construction: Based on the standard Mask2Former architecture, an offset vector prediction branch is added in parallel after each "object query" feature output by its Transformer decoder. The structure of this branch is as follows: Figure 4 As shown, it is usually composed of multiple fully connected layers (such as 5 layers). The first few layers (such as 1-4) have an input and output dimension of 256. They are then activated by the ReLU function to add nonlinearity. The last layer (the 5th layer) has an input dimension of 256 and an output dimension of 2, directly outputting a two-dimensional offset vector (Δx, Δy), which represents the global offset vector of the building instance.

[0028] 3.2 Dual-Task Output: The improved network outputs two results simultaneously: Roof Mask: Output from the original mask prediction head, representing a binary segmentation map of the building's roof area; Offset Vector: Output by the newly added offset vector prediction head, representing the amount of translation required to move from the roof profile to the bottom profile.

[0029] 3.3 Bottom Mask Generation: To effectively supervise offset vector learning, a Spatial Transformer Network (STN) is introduced. During training, the predicted roof mask and the predicted offset vector field are input into the STN. The STN performs a differentiable spatial transformation on the roof mask based on the offset vector, generating a predicted bottom mask. This predicted bottom mask is used to calculate the loss between the predicted bottom mask and the real bottom mask, and this loss only applies to the offset vector prediction branch. The roof mask needs to undergo a detach operation to prevent gradient backpropagation to the roof mask prediction branch.

[0030] 3.4 Network Training: The network is trained end-to-end using a multi-task loss function, with the total loss defined as L. total The total loss ,in, L cls For the roof mask category loss, Cross Entropy loss is used, with a building weight of 1.0, a background weight of 0.1, and an overall weight of 2.0. L mask roof For the roof masking loss, Cross Entropy loss and Dice loss are used, both with a weight of 5.0; L offset The offset vector loss is calculated using Smooth L1 loss with a weight of 100.0. L maskfoot For the bottom mask loss, Cross Entropy loss and Dice loss are used, both with a weight of 5.0. The bottom loss is only backpropagated to the offset vector branch, which is constrained to ensure that the predicted offset vector can generate the bottom at the correct position through STN. λ1, λ2, λ3 and λ4 are the weighting coefficients for each loss.

[0031] Example 1: In this example, to predict the geometric offset between the roof and the bottom surface of a building and obtain an accurate bottom surface mask, this invention proposes a conversion method based on offset vector prediction and spatial transformation network (STN), which specifically includes the following sub-steps: Definition of offset vector: Let M be the roof mask extracted from the remote sensing image. roof The manually marked building base mask is M. footprint For any pixel p = (x, y) on the boundary of the roof mask, the corresponding point on the boundary of the bottom mask is q = (x, y). ’ ,y ’ ), then in the two-dimensional offset vector , .

[0032] Offset prediction branch network structure: The offset vector prediction head includes four fully connected layers (256 input channels, 256 output channels) with ReLU activation, and one fully connected layer (256 input channels, 2 output channels), outputting two-channel offset prediction maps V. pred The number is the same as the input image.

[0033] Spatial transformation network generates initial bottom mask: the predicted V pred As sampling parameters for the spatial transformation network. For the roof mask M roof For each pixel position p in the dataset, calculate the target position p. ’ =p+V pred(p), and then bilinear interpolation is used from M roof M was obtained from sampling init (p) ’ Thus, the initial bottom mask M is obtained. init This transformation is differentiable, enabling end-to-end training.

[0034] Then, the loss function is used to calculate the network training within the effective area of ​​the roof mask boundary.

[0035] This embodiment enables the automatic conversion of roof masks of arbitrary shapes into initial masks that are closer to the actual building floor, effectively solving the geometric offset problems caused by roof overhangs, parapet wall obstructions, etc.

[0036] Step 4: Contour Iteration Optimization: The process for this step is as follows... Figure 5 As shown, the aim is to optimize a rough roof mask into a regular sequence of contour points. Specifically, this includes: 4.1 Contour Initialization: For the roof mask predicted in step 3, use a contour extraction algorithm (such as OpenCV's findContours) to obtain the outer contour of its largest connected region. Perform uniform interpolation on this contour to obtain an ordered contour point sequence P_init={(x1, y1),......(x128, y128)} with a fixed number of points (e.g., 128).

[0037] 4.2 Optimizing Network Construction: A contour iterative optimization network is constructed. This network uses the multi-scale feature map formed in step two as contextual information. After processing by a deformable attention encoder, it outputs four multi-scale features with resolutions of [resolutions to be filled in]. The number of channels is 256.

[0038] Example 2: Feature maps F1, F2, F3, and F4 at four scales are extracted from the encoder of the backbone network. Their resolutions are 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image, respectively, and each has 1024 channels. Then... Convolutional processing compresses the channels of features at each scale, uniformly adjusting the number of channels to 256. Using the adjusted multi-scale features as input, a deformable attention encoder performs cross-scale feature interaction and fusion, enabling the encoder to perform attention calculations on only a small number of key sampling points at each query location, resulting in fused feature maps at four scales with resolutions of [resolutions to be filled in]. , , and Furthermore, the number of channels in the feature maps at each scale is 256.

[0039] 4.3 Multi-round Iterative Optimization: The initial contour point sequence P_init from step 4.1 and the multi-scale features from step 4.2 are input into an optimization module composed of multiple Transformer decoders concatenated together, and multi-round (e.g., 6 rounds) iterative optimization is performed. The optimization process for a single round is as follows: Figure 6 As shown: 4.3.1 Point Feature Extraction: Based on the current initial point coordinates, features at the corresponding positions are extracted from feature maps of different scales using bilinear interpolation; 4.3.2 Transformer Decoding: Point features and multi-scale features are fed into the Transformer decoder, enabling each point feature to perceive global context and local details; 4.3.3 Dual-head prediction: The decoded point features are fed into two prediction heads respectively: Offset prediction head: predicts the coordinate offset vector of each point in the current round, and after being activated by Tanh and scaled by coefficients that decrease in each round, it is added to the current coordinates to obtain the optimized point coordinates; Category prediction head: Predicts the probability that each point is an inflection point (building corner) or a non-inflection point (building straight edge).

[0040] In Example 3, the point category prediction branch consists of 5 linear layers during each optimization. The first 4 layers have an input and output dimension of 256, and are then activated by the ReLU function. The 5th layer has an input dimension of 256 and an output dimension of 2. The point offset vector prediction branch also consists of 5 linear layers. The first 4 layers are the same as the point category prediction branch. The 5th layer has an input dimension of 256 and an output dimension of 2, and is then activated by the Tanh function. The output is a normalized offset vector, which is then multiplied by a coefficient to obtain the pixel value of the offset vector. This value is then added to the contour point coordinates of the current input to obtain the updated contour points.

[0041] In the first iteration, the input multi-scale feature resolution is 32×32, and the normalized point offset vector is multiplied by a coefficient of 1024. In the second iteration, the input multi-scale feature resolution is 64×64, and the normalized point offset vector is multiplied by a coefficient of 512. In the third iteration, the input multi-scale feature resolution is 128×128, and the normalized point offset vector is multiplied by a coefficient of 256. In the fourth iteration, the input multi-scale feature resolution is 256×256, and the normalized point offset vector is multiplied by a coefficient of 128. In the fifth iteration, the input multi-scale feature resolution is 256×256, and the normalized point offset vector is multiplied by a coefficient of 64. In the sixth iteration, the input multi-scale feature resolution is 256×256, and the normalized point offset vector is multiplied by a coefficient of 32.

[0042] The output of round t serves as the input of round t+1. Through multiple iterations, the vertex position gradually approaches the boundary of the real bottom surface and is regularized into a regular geometric shape.

[0043] 4.4 Loss Optimization: During training of this module, a composite loss function is used to supervise the offset of point coordinates and class prediction, ensuring that the optimized point sequence closely approximates the real regularized roof outline in both location and structure. The composite loss function L... poly It consists of three weighted parts: ,in, L_ offset_all The global point offset loss is calculated using the SmoothL1 loss on the predicted offset vectors of all 128 contour points. L_ offset_corner For inflection point offset loss, Smooth L1 loss is calculated only for the contour points corresponding to the manually annotated true inflection points. L_ cls_point For point category loss, the category (inflection point / non-inflection point) of all contour points is predicted and calculated using the Cross Entropy loss function with category weights. The weight of the inflection point category is set to 1.0, and the weight of the non-inflection point category is set to 0.1. α1, α2 and α3 are the weight coefficients of each loss function, where α1 is 1.0, α2 is 0.5 and α3 is 1.0.

[0044] Step 5: Post-processing and vectorization output: The process for this step is as follows... Figure 7 As shown, the optimized intermediate results are converted into the final vector polygon. Specifically, this includes: 5.1 Contour Simplification: Based on the inflection point category probability of each contour point output in step four, adaptive contour simplification is performed, as follows: Figure 8 As shown.

[0045] 5.1.1 Set an initial probability threshold (e.g., 0.5), and retain points with probabilities higher than the threshold as candidate inflection points; 5.1.2 Calculate the rate of change of polygon area before and after simplification. If the change exceeds the set threshold (e.g., 5%), appropriately reduce the probability threshold and re-filter until the area change is less than the threshold. 5.1.3 The filtered point sequence is further geometrically simplified using the Douglas-Peucker algorithm to remove collinear points and obtain a simplified roof outline that expresses the shape with the fewest points.

[0046] 5.2 Base Polygon Generation: The coordinates (x0, y0) of each point in the simplified roof outline are added point by point to the offset vector (Δx, Δy) of the corresponding instance predicted in step 3 to obtain the coordinate set of the base outline points. Connecting these points in sequence forms the base polygon of the building.

[0047] 5.3 Polygon Merging and Topology Repair: Check all generated bottom polygons. If there are overlaps or inclusion relationships, merge them according to their spatial location to ensure that each physical building corresponds to only one independent polygon. Remove internal loops and retain only the outer contour points of the polygons. Finally, output a set of building bottom vector polygons with consistent topology and no redundancy.

[0048] Using the above method, this invention also discloses a system for automatically extracting polygons from the base of buildings in remote sensing images, comprising: a feature extraction module for extracting and fusing features from the input remote sensing image and outputting a multi-scale feature map; a mask offset prediction module, which is based on an instance segmentation network and integrates an offset vector prediction branch, for simultaneously outputting a building roof mask and an offset vector from the roof to the base; a contour iteration optimization module for performing contour initialization and multi-round iterative optimization on the roof mask, outputting an optimized contour point sequence and contour point category information; and a post-processing module for performing contour simplification, offset mapping, and polygon merging based on the contour point sequence, contour point category information, and offset vector to generate polygons from the base of the building.

[0049] To verify the effectiveness of the method of this invention, tests were conducted on multiple public and self-built datasets. Using 0.5m resolution non-orthorectified satellite remote sensing images of a certain location in 2023 as inference data, a total of 182 1024×1024 pixel images were selected for inference. Some visualization results are shown below. Figure 9 As shown, from left to right, the image displays the original image, the initial roof mask, the optimized roof polygon, the simplified roof polygon, and the finally generated simplified base polygon. The comparison demonstrates that the method of this invention successfully optimizes and simplifies the initial coarse and offset mask step by step into a geometrically accurate, regularly shaped base vector polygon that can be directly used in GIS.

[0050] Figure 9 The image in the center is the original: The original remote sensing image contains neatly arranged buildings of varying heights, as well as background interference such as roads and vegetation.

[0051] Figure 10 The images show the initial roof mask: the building roof area initially predicted by the improved Mask2Former (in step two). As can be seen from the images and corresponding vector diagrams, the model can identify most buildings, but the mask outline is rough, irregular, and jagged. Furthermore, due to the significantly tilted viewing angle, the mask deviates noticeably from the actual ground contour of the buildings (i.e., outline offset).

[0052] Figure 11To optimize the real-world image and vector image of the roof mask: After contour iteration optimization (in step 3), it can be seen from the real-world image and the corresponding vector image that the contour lines fit the real edge of the roof more closely, and the inflection points of the contours become clear and distinct, laying the foundation for subsequent simplification.

[0053] Figure 12 To simplify the real-world image and vector image of the roof mask: After the contour simplification step in the post-processing module (step 5), it can be seen from the real-world image and the corresponding vector image that the number of vertices has been greatly reduced, and the roof contour is represented by a very small number of key inflection points and the straight lines connecting them.

[0054] Figure 13 Simplified bottom mask: This is the result of translating the simplified roof outline as a whole based on the predicted offset vector. As can be seen from the real-view image and the corresponding vector image, the polygon has been precisely moved from the roof position and aligned with the bottom position where the building contacts the ground, solving the initial mask offset problem. The final output is a regular, simplified, and geometrically accurate vector polygon that can be directly read by GIS software for area calculation, map drawing, or 3D modeling.

[0055] In addition, in this embodiment, the model in this invention was trained and validated using the BONAI dataset (containing 3300 aerial images of six representative cities, covering multiple scenes and building forms, with annotations for building roof vectors and offset vectors). The test used F1 score, precision, and recall as indicators. Precision is shown in Table 1, and the results validate the effectiveness of this invention. Wherein, F1 = Precision and recall are combined into a single score to evaluate the model. The score ranges from 0 to 1, with a value closer to 1 indicating a better model performance. Precision refers to the proportion of areas that the model identifies as buildings, and those areas are actually buildings. Recall refers to the proportion of all real buildings in an image that the model identifies.

[0056] Table 1. Accuracy Indicators of This Method The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the technical principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A method for automatically extracting polygons from the base of buildings in remote sensing images, characterized in that: Includes the following steps: S1. Multi-scale feature extraction: Extract features from the input remote sensing image to obtain a multi-scale feature map; S2. Joint prediction of mask and offset vector: Based on multi-scale feature maps, an offset vector prediction branch is added to the instance segmentation network to generate offset vectors pointing from each pixel position of the roof mask to the corresponding bottom position. The roof mask and offset vectors are input into the spatial transformation network to generate the bottom mask. The network is trained end-to-end using a multi-task composite loss function. S3. Contour Iteration Optimization: Perform contour iteration optimization on the roof mask to obtain the optimized contour point sequence and the category information of contour points belonging to inflection points and non-inflection points. Each round uses feature maps of different resolutions, and the scaling factor of the offset vector updated in each round decreases in successive rounds. S4. Post-processing and vectorization: Based on the contour point sequence, category information and offset vector, contour simplification, offset overlay and polygon merging are performed to output the building base polygon in vector format.

2. The method for automatically extracting polygons from the base of buildings in remote sensing images according to claim 1, characterized in that: Step S1 specifically includes: S11. Normalize the input remote sensing images; S12. Use a pre-trained visual Transformer backbone network to extract image features at multiple levels; S13. The image features of the multiple layers are unified to the same number of channels through the convolutional layer to generate a multi-scale feature map.

3. The method for automatically extracting polygons from the base of buildings in remote sensing images according to claim 2, characterized in that: The image features extracted in step S12 include feature maps output from four different network layers selected from the visual Transformer backbone network.

4. The method for automatically extracting polygons from the base of buildings in remote sensing images according to claim 1, characterized in that: Step S2 specifically includes: S21. Based on multi-scale feature maps, generate building roof masks through an instance segmentation network; S22. Add an offset vector prediction branch to the instance segmentation network to generate the offset vector from the roof to the bottom based on the multi-scale feature map; S23. Input the roof mask and the offset vector into a spatial transformation network to generate a bottom mask; S24. Train the network using a composite loss function that includes roof mask category loss, roof mask loss, offset vector loss, and bottom mask loss.

5. The method for automatically extracting polygons from the base of buildings in remote sensing images according to claim 1, characterized in that: The offset vector prediction branch is composed of multiple fully connected layers connected sequentially. Its input is the object query features processed by the Transformer decoder in the instance segmentation network, and its output is a two-dimensional offset vector corresponding to each building instance.

6. The method for automatically extracting polygons from the base of buildings in remote sensing images according to claim 1, characterized in that: Step S3 specifically includes: S31. Extract the outer contour of the roof mask and interpolate it to a fixed number of contour points to obtain an initial contour point sequence; S32. Combining the multi-scale feature map, the initial contour point sequence is iteratively optimized through the Transformer decoder to obtain the optimized contour point sequence and the probability of each contour point belonging to an inflection point or a non-inflection point.

7. The method for automatically extracting polygons from the base of buildings in remote sensing images according to claim 6, characterized in that: In step S32, there are six iterations, each using a feature map of different resolutions, and the scaling factor of the offset vector for coordinate updates decreases progressively with each iteration. The scaling factors are 1024, 512, 256, 128, 64, and 32 respectively.

8. The method for automatically extracting polygons from the base of buildings in remote sensing images according to claim 1, characterized in that: Step S4 specifically includes: S41. Based on the inflection point probability in the category information of the contour points, the contour points are adaptively simplified, key inflection points are retained, and the change in contour area before and after simplification is controlled within a preset ratio. The key inflection point refers to the contour point that is retained after screening based on a dynamically adjusted probability threshold, which can control the change in contour area before and after simplification within a preset ratio. S42. Add the simplified contour point coordinates to the offset vector point by point to obtain the bottom contour point set; S43. Merge overlapping ground polygons and output a set of vector polygons with consistent topology.

9. A system for automatically extracting polygons from the base of buildings in remote sensing images, employing the method described in any one of claims 1-8, characterized in that: include: The feature extraction module is used to extract and fuse features from the input remote sensing images and output multi-scale feature maps. The mask offset prediction module is based on an instance segmentation network and integrates an offset vector prediction branch to synchronously output the building roof mask and the offset vector from the roof to the bottom surface. The contour iteration optimization module is used to perform contour initialization and multi-round iterative optimization on the roof mask, and output the optimized contour point sequence and contour point category information. The post-processing module is used to perform contour simplification, offset mapping and polygon merging based on the contour point sequence, contour point category information and offset vector to generate the building bottom polygon.

10. A system for automatically extracting polygons from the base of buildings in remote sensing images according to claim 9, characterized in that: The mask offset prediction module includes: The instance segmentation network unit, based on the Mask2Former architecture, includes a pixel decoder and a Transformer decoder, used to generate object query features and corresponding roof masks; The offset vector prediction branch unit is composed of multiple fully connected layers connected sequentially. Its input is the object query feature output by the Transformer decoder, and its output is the offset vector corresponding to each building instance. A spatial transformation network unit is connected to the output of the instance segmentation network unit and the offset vector prediction branch unit. During the training phase, this unit performs a spatial transformation on the roof mask according to the offset vector to generate a bottom mask, which is used to calculate the training loss. During the inference phase, this unit does not participate in the forward computation.