A 3D terrain construction method based on multimodal semantic constraints
By using a multimodal semantic constraint method, the problem of unmodeled semantic structural characteristics of land features in existing 3D terrain construction is solved. This enables geometric reconstruction of areas such as rivers and buildings and the associated mapping of engineering cost information, thereby improving the spatial consistency of 3D terrain data and the traceability of engineering attributes.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU DINONI INFORMATION TECH CO LTD
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-30
AI Technical Summary
Existing 3D terrain construction methods fail to effectively and systematically model the semantic structural characteristics of features in space, resulting in elevation anomalies, structural overlap, or unclear semantic boundaries in areas such as rivers and buildings. Furthermore, there is a lack of stable correspondence between engineering cost information and the spatial extent of 3D terrain, which limits the expansion capabilities of 3D terrain data in engineering management and comprehensive applications.
A multimodal semantic constraint method is adopted to construct semantic constraint information through optical remote sensing image texture enhancement and geometric consistency correction, generate a set of semantic masks, and establish a mapping relationship between semantic categories and surface physical attributes. Combined with the semantic parsing of engineering cost documents, the spatial consistency and semantic consistency association mapping of three-dimensional terrain entities is realized.
It achieves regional geometric reconstruction of rivers, buildings and non-building areas, avoiding structural distortion caused by semantic and geometric decoupling, and associates and maps engineering cost information with three-dimensional terrain entities, enabling three-dimensional terrain data to have terrain expression capabilities and engineering attribute carrying capabilities.
Smart Images

Figure CN121883752B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of 3D terrain modeling technology, and specifically to a 3D terrain construction method based on multimodal semantic constraints. Background Technology
[0002] With the development of remote sensing mapping and 3D terrain modeling technologies, 3D terrain data has been widely applied in engineering planning, spatial analysis, and decision support. Existing 3D terrain construction methods typically use optical remote sensing imagery or digital surface models as the main data source, generating 3D terrain through image matching, elevation interpolation, or surface fitting. Their focus is primarily on the continuity and overall accuracy of terrain geometry. However, these methods often fail to systematically model the semantic structural characteristics of land features in space and lack effective constraints on the spatial representation differences of different land cover types. This leads to problems such as elevation anomalies, structural overlap, or unclear semantic boundaries in areas with typical land features such as rivers and buildings.
[0003] On the other hand, engineering attribute data such as engineering cost usually exist independently in the form of text or tables, and their spatial location depends on manual experience or rough annotation. Existing technologies lack effective means to semantically analyze engineering cost information and establish a stable correspondence with the three-dimensional terrain spatial range, making it difficult to ensure the consistency and traceability between engineering attributes and terrain entities, thus limiting the expansion capabilities of three-dimensional terrain results in engineering management and comprehensive applications. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention proposes a three-dimensional terrain construction method based on multimodal semantic constraints.
[0005] The technical solution to achieve the purpose of this invention is as follows:
[0006] A method for constructing 3D terrain based on multimodal semantic constraints includes the following steps:
[0007] Acquire optical remote sensing images and digital land model of the target area, perform texture enhancement processing on the optical remote sensing images, use the digital land model as a unified elevation benchmark, perform geometric consistency correction on the enhanced optical remote sensing images, and form spatial benchmark data.
[0008] Based on spatial reference data, semantic constraint information is constructed to define the semantic boundaries, structural morphology and spatial representation priors of ground features. Under the constraint of the semantic constraint information, joint semantic parsing and structural consistency correction are performed on optical remote sensing images to generate a set of semantic masks corresponding to each ground feature cover type.
[0009] Based on a set of semantic masks, a mapping relationship between semantic categories and surface physical attributes is established, and regional geometric reconstruction is performed on the digital surface model according to the mapping relationship to generate three-dimensional terrain entities.
[0010] Obtain the engineering cost document corresponding to the target area, and perform layout analysis, character recognition, semantic understanding and table structure restoration on the engineering cost document to form an engineering cost data set;
[0011] Based on the engineering cost data set, under the constraints of spatial and semantic consistency, the engineering cost information is associated with the corresponding three-dimensional terrain entities to form a three-dimensional terrain data result.
[0012] Furthermore, semantic constraint information construction and joint semantic parsing include:
[0013] Based on spatial benchmark data, this study analyzes the spatial distribution patterns, continuity characteristics, and correlations between typical land features such as rivers and buildings within the target area and topographic elevation. It extracts the spatial representation characteristics of various typical land features and, based on this, constructs a multi-dimensional semantic constraint rule base encompassing semantic boundaries, structural morphology, and spatial representation priors. The semantic boundaries define the spatial boundaries of different land feature coverage types, the structural morphology constrains the continuity and geometric features of land features in space, and the spatial representation priors define the inherent correspondence between land features and topographic elevation and slope.
[0014] The constraint rules in the semantic constraint rule base are quantified and transformed into constraint factors that can participate in model optimization calculations. These constraint factors are then embedded into the semantic segmentation model training and inference process. Pixel-level joint semantic parsing is performed on optical remote sensing images to obtain the initial distribution results of various land features in space. Subsequently, based on the inherent consistency rules of land feature spatial expression, structural consistency correction is performed on the breaks and missegmented areas in the initial distribution results caused by image noise, local occlusion, or imaging differences to generate structurally continuous and semantically consistent land feature distribution results.
[0015] Furthermore, establishing a mapping relationship between semantic categories and surface physical attributes to perform regional geometric reconstruction includes:
[0016] Combining the topographic features of the target area and the engineering attributes corresponding to different land features, the inherent physical nature of the core semantic categories such as rivers, buildings and non-buildings in the semantic mask set is defined, and a four-dimensional mapping rule library containing semantic categories, physical attribute types, quantitative indicators and constraint types is constructed to clarify the physical constraints of different semantic categories in the geometric reconstruction process.
[0017] The mapping rule base is assigned pixel by pixel based on the semantic mask set, and the semantic category and corresponding physical attribute constraint are transformed into a semantic-physical attribute constraint matrix with precise pixel-by-pixel constraints. Under the constraint of this matrix, the corresponding regional geometric reconstruction processing is performed on the river area, building area and non-building area respectively.
[0018] Further, the geometric reconstruction of the river region includes:
[0019] Based on the extraction of the set of river-covered pixels using a river semantic mask, and combined with the elevation information of the corresponding pixels in the spatial reference data, a hydrological analysis algorithm is used to analyze the elevation distribution of the river area and determine the main direction of the river in space.
[0020] Under the constraint of the determined main river direction, a directional consistency constraint that is continuous along the main direction and maintains equipotential in the lateral direction is applied to the set of pixels covering the river. A river area elevation correction model is constructed to correct elevation anomalies caused by image matching errors or noise. Local gaps are filled by interpolation to generate a continuous three-dimensional elevation surface of the river that meets the spatial expression characteristics of the river.
[0021] Further, the geometric reconstruction of the building area includes:
[0022] The building cover pixel set is extracted based on the building semantic mask, and the ground base elevation corresponding to the building area is calculated by using the neighborhood elevation mean method with the elevation information of the non-building area around the building as a reference.
[0023] Based on the ground base elevation, the main building and the ground base are separated. A second-order trend surface fitting is performed on the ground base to restore the continuous ground shape. A transition zone is set in the building edge area to smooth the elevation connection between the main building and the ground base, so as to ensure the rationality and continuity of the geometric structure of the building area.
[0024] Further geometric reconstruction of non-building areas includes:
[0025] For non-building areas in the semantic mask set, an elevation continuity constraint is applied to the whole area, and a constraint is applied to the local undulations. A global trend surface analysis algorithm is used to fit the overall elevation distribution of the non-building areas and identify unstructured disturbance points caused by noise or error.
[0026] The perturbation points are corrected by a weighted smoothing algorithm, and the terrain of non-building areas is adjusted in a targeted manner according to the parameter settings of the corresponding semantic category in the semantic-physical attribute constraint matrix, so that the reconstructed terrain maintains the overall continuity and conforms to the spatial expression characteristics of various land features.
[0027] Further, the analysis of the layout and character recognition of engineering cost documents includes:
[0028] The acquired multi-format engineering cost documents are subjected to uniform format normalization processing, and image enhancement algorithms are used to optimize the image quality of paper scans in order to eliminate adverse factors such as noise and distortion.
[0029] Based on the optimized document image, a deep learning model based on Mask R-CNN is used to segment the document layout elements. Optical character recognition processing is then performed on the segmented text blocks and table areas to generate text units bound to the corresponding spatial locations and page number information.
[0030] Furthermore, semantic understanding and table structure restoration of engineering cost documents include:
[0031] A language model finely tuned from a corpus in the field of engineering cost is used to parse the field correspondence and context inheritance relationships between the text units;
[0032] By combining the parsing results with the row and column coordinate information and semantic association rules of the document layout, the table structure in the document is restored, and cross-page tables are spliced together. At the same time, the discrete fields in the restored tables are reorganized to form logically continuous standardized table data.
[0033] Furthermore, the engineering cost data set is formed, including:
[0034] Core cost information such as project name, geographical location description, quantity of work, unit of measurement and cost are extracted from the standardized table data and text units, and the extracted cost information is converted into numerical values and standardized in terms of unit of measurement.
[0035] A geocoding engine is used to convert textual geographic location descriptions in engineering cost information into latitude and longitude coordinates, and semantic mask boundaries are used to correct the coordinate range, thus constructing an engineering cost data set with clear spatial orientation.
[0036] Furthermore, by mapping engineering cost information with three-dimensional terrain entities, three-dimensional terrain data results are generated, including:
[0037] The spatial location descriptions in the engineering cost data set are parsed and broken down into standardized location elements to determine the spatial range of the engineering project. Under the dual constraints of spatial consistency and semantic consistency, the spatial range is matched to the corresponding three-dimensional terrain entity area.
[0038] Based on three-dimensional terrain entity areas, a multi-dimensional association mapping relationship is established between engineering cost data and the three-dimensional terrain entities to which the area belongs. Engineering cost information is written into the attribute fields of the corresponding three-dimensional terrain entities and association mapping records are generated. After integrating terrain geometric information and engineering economic attributes, consistency verification is performed, and finally, three-dimensional terrain data results integrating spatial geometric information and engineering economic attributes are obtained.
[0039] Compared with the prior art, the significant advantages of this invention are:
[0040] 1. By constructing multidimensional semantic constraints and quantifying and embedding them into the semantic parsing and terrain reconstruction process, the semantic categories and surface physical attributes are kept consistent in the three-dimensional terrain construction, realizing the regional geometric reconstruction of rivers, buildings and non-building areas, and avoiding the structural distortion problem caused by the decoupling of semantics and geometry.
[0041] 2. By performing semantic parsing on engineering cost documents and constructing spatially oriented engineering cost data, under the constraints of spatial and semantic consistency, engineering cost information is associated and mapped with three-dimensional terrain entities, so that the three-dimensional terrain data has both terrain expression capabilities and engineering attribute carrying capabilities. Attached Figure Description
[0042] Figure 1 This is a flowchart of a 3D terrain construction method based on multimodal semantic constraints.
[0043] Figure 2 This is a flowchart of the semantic constraint construction and joint semantic parsing process in this invention;
[0044] Figure 3 This is a flowchart of the regional geometric reconstruction process in this invention;
[0045] Figure 4 This is a flowchart illustrating the mapping process between engineering cost information and three-dimensional terrain entities in this invention. Detailed Implementation
[0046] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0047] This invention discloses a method for constructing three-dimensional terrain based on multimodal semantic constraints, comprising the following steps:
[0048] S1: Acquire optical remote sensing images and digital land model of the target area, perform texture enhancement processing on the optical remote sensing images, and use the digital land model as the only elevation reference to perform geometric consistency correction on the enhanced optical remote sensing images, so that the image pixels and the land elevation correspond one-to-one within the same spatial reference frame to form spatial reference data.
[0049] S2: Based on spatial reference data and according to the spatial expression characteristics of target land cover types, semantic constraint information is constructed. The semantic constraint information is used to limit the semantic boundaries, structural forms and spatial expression priors of different land cover types in space. Under the constraint of the semantic constraint information, joint semantic analysis is performed on optical remote sensing images to obtain the initial spatial distribution results of different land cover types, including rivers and buildings. According to the inherent consistency rules of land cover spatial expression, the initial distribution results are corrected for structural consistency to repair semantic breaks and missegmentation caused by image noise, local occlusion and imaging differences, and generate semantic mask sets corresponding to each land cover type.
[0050] S3: Based on a set of semantic masks, a mapping relationship between semantic categories and surface physical attributes is established, and regional geometric reconstruction is performed on the digital surface model according to this mapping relationship to generate three-dimensional terrain entities. Specifically, under the constraint of the river semantic mask, a directional consistency constraint that is continuous along the mainstream direction and maintains equipotential laterally is applied to the elevation of the river area to correct elevation anomalies caused by image matching errors. Under the constraint of the building semantic mask, terrain structure separation and reconstruction is performed on the building-covered area to restore the continuous ground base. In non-building areas, a continuity constraint that is continuously maintained overall and locally undulating is applied to the terrain elevation to suppress unstructured disturbances.
[0051] S4: Obtain the engineering cost document corresponding to the target area, perform layout analysis and optical character recognition processing on the engineering cost document to form text units containing text content and spatial location information; based on the text units, use a language model to parse the field correspondence and context inheritance relationship between the text units, perform table structure restoration and field consistency reorganization on the discrete text content to form logically continuous engineering cost table data; based on the engineering cost table data, extract engineering cost information, including project name, geographical location description, engineering quantity, unit of measurement and total cost, perform numerical processing on the extracted engineering cost information to form a spatially oriented engineering cost data set;
[0052] S5: Based on the engineering cost data set, the spatial location description of the engineering project is analyzed to determine its corresponding spatial range in the three-dimensional terrain. Under the constraints of spatial consistency and semantic consistency, the engineering cost information is associated and mapped with the corresponding three-dimensional terrain entities to form a three-dimensional terrain data result containing terrain geometric information and engineering economic attributes.
[0053] refer to Figure 1 , Figure 1 This is a flowchart of a 3D terrain construction method based on multimodal semantic constraints.
[0054] Step S1 involves acquiring and enhancing optical remote sensing images, performing geometric correction based on a digital land model, and ultimately generating spatial reference data with a one-to-one correspondence between pixels and elevations, including:
[0055] S101: Target area data source acquisition and preprocessing.
[0056] Acquire optical remote sensing images and digital surface models of the target area. The optical remote sensing images are panchromatic remote sensing images with a resolution of not less than 0.5 meters, covering the entire spatial range of the target area, and the images are free from large-area cloud shadows, fog obscuration, and stripe noise.
[0057] The digital surface model is a product generated based on lidar measurement or stereo remote sensing image matching. Its elevation accuracy is better than 1 meter, and its spatial resolution is consistent with that of optical remote sensing images. Its spatial reference system adopts the WGS-84 coordinate system and the EGM96 geoid model.
[0058] The acquired optical remote sensing images are preprocessed by converting the image grayscale values into surface reflectance through radiometric calibration to eliminate the influence of sensor response differences. At the same time, the digital land surface model is preprocessed by removing abnormal elevation points and filling in the data gaps through Kriging interpolation to generate complete and continuous digital land surface model preprocessing results. Abnormal elevation points include convex points that are higher than three times the standard deviation of the surrounding surface and concave points that are lower than three times the standard deviation of the surrounding surface.
[0059] S102: Texture enhancement processing of optical remote sensing images.
[0060] A multi-scale Retinex algorithm is used to enhance the texture of preprocessed optical remote sensing images. A multi-scale Gaussian pyramid is constructed to decompose the images into low-frequency and high-frequency components at different scales. The low-frequency components are adaptively compressed to suppress texture blurring caused by uneven illumination. The high-frequency components are weighted and enhanced using the Laplacian enhancement operator to strengthen the edges of ground features and texture details. The enhancement coefficient is adaptively adjusted according to the image gray-level variance. The larger the gray-level variance, the smaller the enhancement coefficient to avoid noise amplification; the smaller the gray-level variance, the larger the enhancement coefficient to improve texture recognition.
[0061] The enhanced image undergoes adaptive contrast adjustment, and a histogram equalization algorithm is used to optimize the image grayscale distribution. Based on the statistical characteristics of the image grayscale histogram, the grayscale values are mapped to the [0,255] range, so that both dark and bright details of the image are clearly presented. Finally, the adjusted image is smoothed by a median filtering algorithm with a window size of 3×3 to remove Gaussian noise introduced during the enhancement process, preserve the integrity of the edges of ground features, and generate an enhanced optical remote sensing image.
[0062] S103: Geometric consistency correction based on digital surface model.
[0063] Using the preprocessed digital land model as the sole elevation benchmark, a unified spatial reference frame is established to determine the spatial correspondence between image pixels and digital land model grids. First, the geographic coordinate information (latitude and longitude) and elevation information of the digital land model are extracted to construct a three-dimensional spatial grid for the target area. The grid spacing is consistent with the pixel spacing of the optical remote sensing image, so that each grid unit corresponds to a unique elevation value and geographic coordinate.
[0064] Geometric distortion correction is performed on the texture-enhanced optical remote sensing image. A rational function model (RFM) is used to establish the mapping relationship between image pixel coordinates and geographic coordinates. Based on the geographic coordinates of the digital land surface model, no less than 20 evenly distributed ground control points are selected. These control points must be clearly distinguishable in both the optical remote sensing image and the digital land surface model, and cover the edge and center of the target area. The parameters of the rational function model are solved by the least squares method to correct the translation, rotation, scaling and affine distortion of the image, so that the image pixel coordinates are accurately mapped to the corresponding geographic coordinates.
[0065] Elevation-pixel consistency calibration is performed. For each image pixel, the corresponding elevation value is extracted from the digital land model based on its mapped geographic coordinates, and a three-dimensional relationship of "pixel coordinates-geographic coordinates-elevation value" is established. The matching accuracy between pixels and elevation is optimized by using a bilinear interpolation algorithm to eliminate matching deviations caused by grid misalignment, ensuring that each image pixel corresponds one-to-one with a unique surface elevation value within the same spatial reference frame, without spatial offset or elevation mismatch issues.
[0066] S104: Spatial reference data generation and verification.
[0067] By integrating geometrically consistent optical remote sensing images with associated digital surface elevation information, spatial reference data is generated. This data includes image texture information, geographic coordinate information, and corresponding elevation information. The three maintain spatial consistency and can be directly used for subsequent semantic constraint construction and 3D terrain reconstruction.
[0068] The generated spatial reference data is subjected to accuracy verification. Fifty verification points are randomly selected, and the elevation values of the verification points in the spatial reference data are compared with the original elevation values of the digital land model. The root mean square deviation of elevation (RMSE) is calculated, and the RMSE is required to be ≤0.3 meters. At the same time, the mapping accuracy between image pixels and geographic coordinates is checked to ensure that the pixel positioning error is ≤1 pixel. If the verification fails, the process returns to step S103 to re-optimize the rational function model parameters until the accuracy requirements are met, and finally, qualified spatial reference data is output.
[0069] like Figure 2As shown, step S2 involves constructing semantic constraints based on spatial reference data to guide joint semantic parsing of images to obtain initial ground feature distributions. After structural consistency correction, a precise set of semantic masks is finally output, including:
[0070] S201: Extraction of spatial representation characteristics of target features and construction of semantic constraint information.
[0071] Based on spatial reference data, the spatial characteristics of typical land features within the target area are extracted, including: the linear continuity of rivers, their orientation along low-lying areas, the range of width gradients, and their adjacency with surrounding land features. The width gradient range is 2-50 meters, adjusted according to the regional topography. The blocky discreteness of buildings, their verticality (elevation more than 0.5 meters above the surrounding ground surface), their regularity of outlines, and their clustered distribution characteristics in dense areas are also extracted. At the same time, the spatial features of auxiliary land features such as vegetation and roads are extracted as semantic distinction references.
[0072] Construct multi-dimensional semantic constraint information to form a constraint rule base that includes semantic boundary constraints, structural morphological constraints, and spatial representation prior constraints. Specifically:
[0073] Semantic boundary constraints are defined as follows: spatial topological relationships are used to limit the boundaries of features. River boundaries must be consistent with the direction of topographic contour lines. Building boundaries must not cross the semantic range of rivers and must maintain a distance threshold of ≥1 meter from road boundaries.
[0074] The structural morphology constraints are defined as follows: rivers must satisfy the requirement of a continuous linear structure, and a break gap of ≤2 pixels is considered valid continuity; buildings must satisfy the requirement of an outline concavity / convexity of ≤0.3, where concavity / convexity is the value of the actual length of the outline divided by the perimeter of the smallest circumscribed rectangle; the pixel area of a single building is ≥8 pixels, corresponding to an actual area of ≥2 square meters;
[0075] The spatial representation prior constraints are defined as follows: rivers are preferentially distributed in areas where the elevation is 3 times the standard deviation of the surrounding area, buildings are preferentially distributed in gentle areas with a terrain slope of ≤15°, and the difference between the building elevation and the ground base elevation is ≥0.5 meters.
[0076] S202: Semantic constraint information quantification and total loss calculation.
[0077] To achieve accurate guidance of semantic constraint information for semantic parsing models, the aforementioned multi-dimensional semantic constraint rules are quantified into constraint factors that can be incorporated into the model training process. By constructing a weighted constraint loss system, semantic rules are embedded into the model loss function. This allows the model to follow the spatial representation rules of ground features from the training level, avoiding the semantic bias caused by traditional semantic segmentation models that rely solely on pixel texture features, and significantly improving the accuracy and rationality of the parsing results.
[0078] The three types of constraints—semantic boundary, structural morphology, and spatial representation prior—are numerically transformed. Qualitative rules such as topological relationships, morphological parameters, and elevation priors are converted into computable loss terms. Among these, the semantic boundary constraint loss... Based on the calculation of topological relationship deviations between ground features, the difference between the predicted boundary and the standard topological boundary is quantified using two-way Hausdorff distance. The standard topological boundary is determined by semantic boundary constraint rules, such as river boundaries aligning with contour lines and building-road spacing ≥ 1 meter. The calculation formula is shown below:
[0079] ,
[0080] in, For the model to predict the set of pixels representing ground feature boundaries, For the standard topological boundary pixel set, For one-way Hausdorff distance, via Euclidean distance Calculate the maximum value of the minimum distance between boundary pixels; a larger distance indicates a more significant boundary deviation. Set of pixels representing the boundaries of ground features Any pixel coordinate point in the space, Standard topological boundary pixel set Any pixel coordinate point in the space, A higher loss value results in higher accuracy in the model's boundary prediction.
[0081] The structural form constraint loss is calculated using a weighted summation method that integrates the constraints of building outline regularity and river linear continuity. The formula is as follows:
[0082] ,
[0083] in, This is a weighting coefficient, with a value of 0.5, used to balance the morphological constraints of the two types of land features. Loss due to architectural form constraints This is the calculated value of the actual concavity / convexity of the building's outline, which is the quotient of the actual length of the building's outline divided by the perimeter of the building's smallest circumscribed rectangle. The standard concavity / convexity threshold for the building outline is 0.3; if the threshold is exceeded, the loss value increases progressively. To constrain river morphology loss, quantification is based on river fault gaps. The number of pixels in the fracture gap. The total number of pixels in the river, with a gap of less than or equal to 2 pixels. The value approaches 0, while exceeding it results in a significant increase in the loss value, thus achieving precise constraints on the regularity of building outlines and the linear continuity of rivers.
[0084] The spatial representation prior constraint loss is calculated based on the matching deviation between the distribution of ground features and prior rules such as elevation and slope. The cross-entropy function is used to quantify the proportion of pixels that do not conform to the prior distribution. The formula is as follows:
[0085] ,
[0086] in, The total number of pixels in the target area. For the first If the prior label of a pixel conforms to the prior rule, then If the elevation of the river pixel is less than three times the standard deviation of the surrounding area, or the slope of the building pixel is ≤15°, otherwise... , This formula is used to predict the probability that a pixel conforms to the prior rules. The model learns the inherent relationship between ground features and terrain features through this formula, thereby reducing the probability of predictions that do not conform to the prior rules.
[0087] Based on the iterative optimization of the influence weights of each constraint on the semantic parsing results, structural morphology constraints are determined as the core influencing factor, while semantic boundary constraints and spatial representation prior constraints are auxiliary supplements. Finally, the total loss function is formed, and the calculation formula is shown below:
[0088] ,
[0089] in, This is the total loss value of the model, used to measure the comprehensive deviation between the model's prediction results and the actual distribution of ground features and semantic constraint rules; , , The weights of the semantic boundary constraint factor, structural morphology constraint factor, and spatial representation prior constraint factor are respectively determined. After multiple experiments, the values were verified to be 0.3, 0.4, and 0.3, respectively, which ensures the dominant role of the core constraints and the supplementary effect of the auxiliary constraints. For semantic boundary constraint loss, For structural morphology constraint loss, To express the prior constraint loss in space.
[0090] S203: Construct a multi-scale semantic segmentation model and perform pre-training.
[0091] A multi-scale semantic segmentation model based on Transformer is constructed, adopting a three-stage architecture of "feature extraction-hierarchical fusion-accurate prediction" to adapt to the multi-dimensional feature input requirements of spatial benchmark data and achieve deep coupling of land texture, elevation and geographic coordinate information.
[0092] The specific structural design and quantitative parameters are as follows:
[0093] The input layer receives the historical spatial reference data processed in step S1. This historical spatial reference data is 4-channel input data, including 3 RGB texture channels and 1 elevation information channel. The input data is subjected to standardization preprocessing. Z-Score normalization is used to map the data of each channel to an interval with a mean of 0 and a variance of 1, eliminating the interference of dimensional differences on model training. The size of the preprocessed data is consistent with that of the input spatial reference data.
[0094] The encoder module uses Vision Transformer (ViT-Base) as the basic backbone network. The quantization parameters are set as follows: 12 layers of encoders are stacked, each layer is configured with 12 attention heads, the dimension of a single attention head is 64, and the total attention dimension is 768. The input data is divided into image patches according to the 16×16 pixel specification. It is transformed into a feature embedding vector with a dimension of 768 through a 16×16×768 linear projection matrix. During the embedding process, a learnable location code with a dimension of 768 is added, and geographic coordinate latitude and longitude normalization information is incorporated to enhance the spatial location correlation. Each encoder layer is configured with a layer normalization (LayerNorm) module, and Dropout regularization is introduced to suppress overfitting. Finally, global semantic features and spatial dependencies of ground features are extracted.
[0095] To balance local texture details with global semantics, a Feature Pyramid Network (FPN) is connected after the encoder to construct four feature levels, corresponding to 1 / 4, 1 / 8, 1 / 16, and 1 / 32 resolutions of the input images. The number of feature channels in each level is set to 256, 512, 1024, and 2048, respectively. The encoder output feature channels are uniformly mapped to 256 dimensions through lateral connections. Then, bilinear interpolation from top to bottom and lateral feature fusion are performed to complete the feature information of land features at different scales. Small-scale features focus on details such as building edges and river ridges, while large-scale features enhance the global discrimination of land cover types.
[0096] The decoder module uses a 4-layer deconvolutional network to perform upsampling. Each deconvolutional kernel is 4×4 in size, with a stride of 2 and 256 output channels. At the same time, skip connections are embedded to concatenate the FPN fusion features of each level of the encoder with the corresponding level features of the decoder by channel. Then, dimensionality reduction and integration are performed by a 1×1 convolutional kernel to correct the feature loss during the upsampling process and gradually restore the feature map to the resolution of the input image.
[0097] The prediction head adopts a "1×1 convolution + BatchNorm + Softmax" structure, where the number of 1×1 convolution kernels corresponds to the number of land cover types, the convolution stride is 1, the momentum of the BatchNorm module is set to 0.9, and finally the probability map of each pixel corresponding to the land cover is output through the Softmax activation function.
[0098] Throughout the model training process, a dual semantic constraint mechanism of "loss constraint + output guidance" is constructed, deeply binding the semantic constraint rule base with the model computation process to achieve precise control of the semantic parsing results by constraint information. During the training phase, the weighted constraint total loss function constructed by S202 and the cross-entropy loss function are weighted and fused as the model's total objective loss. The model parameters are iteratively optimized through backpropagation, enabling the model to learn the spatial representation patterns of land features. Specifically, for river categories, river morphology constraint loss is used. Suppress isolated pixel prediction and enhance linear continuous feature learning; for building categories, combine elevation channel features with building form constraint loss. The area with an elevation 0.5 meters higher than the surrounding ground is selected as the candidate set for buildings, and misjudgments such as low vegetation and mounds are eliminated through loss penalty.
[0099] S204: Joint semantic parsing and initial distribution result acquisition based on semantic constraints.
[0100] Based on the above model architecture and constraint mechanism, a multi-dimensional joint semantic parsing process is executed to achieve pixel-level accurate identification of the land cover type in the target area.
[0101] The preprocessed spatial reference data is used as input data and input into the trained semantic segmentation model. After the encoder, feature fusion and decoder operations, the probability distribution matrix of the land cover type corresponding to each pixel is output. The land cover type includes rivers, buildings, vegetation, roads and bare land.
[0102] The probability distribution matrix is binarized using a preset probability threshold of 0.5 to filter out candidate pixel regions for various land features. At the same time, a three-dimensional correlation matrix of "pixel coordinates-land feature coverage type-probability value" is established by combining geographic coordinate information. For the two core land features, rivers and buildings, additional constraint verification is performed to retain regions that meet the morphological and continuity constraints, remove isolated misjudged pixels and small interference areas, and highlight their spatial distribution range and boundary contours.
[0103] The integrated and verified classification results form the initial distribution results of different land cover types. These results are stored in the form of a two-dimensional matrix and bound to the corresponding geographic coordinates and elevation information.
[0104] S205: Structural consistency correction based on intrinsic consistency rules.
[0105] Establish rules for the inherent consistency of spatial representation of ground features, including rules for spatial continuity, morphological rationality, and elevation correlation, among which:
[0106] The spatial continuity rule of ground features can be understood as the proportion of adjacent pixels of the same type of ground feature. The formula for calculating the ratio of adjacent pixels is as follows:
[0107] ,
[0108] in, This represents the number of pixels of the same type of land cover within the neighborhood of the target pixel. This represents the total number of pixels in the neighborhood.
[0109] The morphological rationality rule can be understood as conforming to the structural morphological constraints of S201; the elevation correlation rule can be understood as the degree of matching between land cover type and elevation characteristics. The matching degree formula is as follows:
[0110] ,
[0111] in, To match the number of pixels corresponding to the elevation features of the ground features, This represents the total number of pixels for this type of land feature.
[0112] Based on consistency rules, the initial classification results that conflict with semantic constraints are eliminated. For example, vegetation pixels located within the semantic range of rivers are reclassified as rivers, and building pixels in areas with a slope greater than 15° are corrected to bare land or vegetation, ensuring that the overall distribution of land features is logically consistent.
[0113] S206: Semantic Mask Set Generation and Accuracy Verification.
[0114] Based on the corrected land cover distribution results, binary semantic masks are generated for each land cover type. A pixel value of 1 in the mask indicates the corresponding land cover type, and a value of 0 indicates that it is not in that category. Each land cover type corresponds to an independent semantic mask, forming a set of semantic masks. The mask resolution is consistent with the spatial reference data and carries the corresponding geographic coordinate information.
[0115] A stratified random sampling method was used to verify the accuracy of the semantic mask set, ensuring that the sample points uniformly covered the center, edges, and typical distribution areas of various land features in the target area. A total of 100 verification sample points were selected, including 40 for rivers and 40 for buildings, and 20 for other land feature coverage types such as vegetation, roads, and bare land. Each sample point corresponds to a unique pixel coordinate and land feature coverage type label. The manually labeled land feature coverage types were used as the ground truth values, and the semantic mask output results were used as the predicted values. Three core accuracy indicators were calculated using the confusion matrix: overall accuracy (OA), Kappa coefficient (K), and intersection-over-union ratio (IoU) for rivers and buildings. To ensure the overall accuracy of classification; The goal is to achieve a high level of consistency between the characterization mask results and manual annotations; for the core categories of rivers and buildings, specific requirements are required. This ensures the accuracy of spatial distribution boundaries and range delineation for key features. During the verification process, the calculation data for each indicator is recorded simultaneously. If there are ambiguous sample point labels or outliers, they are removed and corresponding category sample points are added to ensure the reliability of the verification results.
[0116] If the accuracy does not meet the requirements, return to S202 to adjust the semantic boundary constraint factor weights, and re-execute the semantic parsing and correction process until the accuracy meets the requirements. Finally, a qualified semantic mask set is output for subsequent 3D terrain reconstruction.
[0117] like Figure 3 As shown, step S3, based on the semantic mask set, establishes a mapping relationship between semantic categories and surface physical attributes, and performs regional geometric reconstruction on the digital surface model according to this mapping relationship to generate three-dimensional terrain entities, including:
[0118] S301: Establishment of the mapping relationship between semantic categories and surface physical attributes.
[0119] Based on the semantic mask set generated in step S2, and combined with the topographic features and engineering attributes of the target area, the inherent physical essence of the core semantic categories (rivers, buildings, vegetation, roads, bare land) is clarified, and a four-dimensional mapping rule base of "semantic category - surface physical attribute - quantitative index - constraint type" is constructed. This mapping relationship is based on the objective physical characteristics of the land features, integrating engineering standards from hydrology, civil engineering, and surveying, to transform qualitative semantic information into quantifiable and computable physical constraint parameters. This provides precise attribute constraint support for subsequent regional topographic geometric reconstruction, ensuring that the reconstruction results both accurately reflect the true form of the land features and meet the accuracy requirements of engineering applications.
[0120] The specific definition and quantization parameters of the mapping relationship are as follows:
[0121] River semantic category: The corresponding surface physical attributes are "flow-adaptive low-resistivity surface, linear continuous elevation field, and lateral equipotential distribution". The core quantitative indicators and constraints are defined as follows: the lateral elevation gradient is less than or equal to 0.1 to ensure the lateral balance of water flow, the mainstream elevation slope is less than or equal to 5° to conform to the gentle extension law of natural rivers, and the constraint type is dual constraint of elevation continuity and physical characteristics.
[0122] Building semantic category: The corresponding surface physical attributes are "discrete rigid protruding structure, separable from the base topography, and regular outline". The core quantitative indicator and constraint are defined as: the elevation difference between the main building and the ground base. Meters, consistent with S2 semantic constraints, distinguishing between buildings and low-rise features, the verticality tolerance of building outline is less than or equal to 0.5°, controlling the vertical accuracy of building entities, the continuous error of ground base elevation is less than or equal to 0.3 meters, ensuring the smoothness of base reconstruction, and the constraint type is structural separation and morphological accuracy constraint.
[0123] Vegetation semantic category: The corresponding surface physical attribute is "flexible cover attached to the terrain, locally low and undulating, without rigid structure". The core quantitative indicators and constraints are defined as follows: the local undulation amplitude is less than or equal to 1.5 meters, which is suitable for the height range of typical vegetation such as herbaceous plants and shrubs. The elevation connection error between the vegetation area and the surrounding non-building area is less than or equal to 0.2 meters to ensure the continuity of the terrain. The constraint type is undulation amplitude and connection smoothness constraint.
[0124] Road semantic category: The corresponding surface physical attributes are "terrain-adaptive gentle passage, balanced lateral elevation, and natural transition". The core quantitative indicators and constraints are defined as follows: longitudinal slope less than or equal to 8°, in accordance with highway engineering design specifications; lateral elevation gradient less than or equal to 0.05, to ensure driving stability; and the length of the transition section with the surrounding terrain greater than or equal to 3 pixels, corresponding to an actual length of 1.5 meters, to avoid abrupt elevation changes. The constraint type is slope and transition smoothness constraint.
[0125] The semantic category of bare land is: the corresponding physical attributes of the land surface are "naturally formed undulating surface, no artificial intervention structure, and controllable curvature". The core quantitative indicators and constraints are defined as follows: the local terrain curvature is less than or equal to 0.02, which represents the flatness of the terrain and avoids false undulations; the deviation between the elevation change gradient and the global terrain trend is less than or equal to 0.3, and the constraint type is curvature and trend consistency constraint.
[0126] The above four-dimensional mapping rule base is transformed into a computable semantic-physical attribute constraint matrix. The matrix structure is designed as ,in, , These correspond to the height and width of the semantic mask set, i.e., the pixel dimension. To constrain the parameter dimensions, Dimension 1 corresponds to roughness / roughness coefficient, dimension 2 corresponds to elevation gradient, dimension 3 corresponds to slope / undulation amplitude, dimension 4 corresponds to curvature, and dimension 5 corresponds to connection error. Each element in the matrix... Assign values pixel-by-pixel using a semantic mask: if pixel If it belongs to a certain semantic category, then Take the corresponding quantitative indicator threshold; if it does not belong to the threshold, then... Taking infinity represents the absence of this constraint, forming coupled data with precise constraints pixel by pixel. This constraint matrix will serve as the core input, embedding the objective function and accuracy criteria of subsequent terrain reconstruction algorithms for each region, thus achieving quantified implementation and full-process control of the mapping relationship.
[0127] S302: Elevation Anomaly Correction and Reconstruction Based on River Semantic Mask.
[0128] First, the set of river pixels within the target area is extracted based on the river semantic mask. Then, combined with the elevation information in the spatial reference data, the main flow direction of the river is extracted using a hydrological analysis algorithm.
[0129] Considering that the main river direction is determined by the natural convergence trend of water flow, the direction with the largest cumulative water flow is the true main river direction. Therefore, a formula for determining the main river direction is constructed using a flow direction accumulation matrix to accurately capture the river's extension trend. The formula is designed based on the "principle of maximizing cumulative water flow," that is, by statistically analyzing the cumulative water flow of eight potential flow directions, the direction corresponding to the optimal cumulative value is selected as the main river direction. The calculation formula is shown below:
[0130] ,
[0131] in, Direction of the main stream of the river (unit: degrees). The potential flow direction is defined by a value ranging from [0° to 315°], with a value interval of 45°. These represent the eight standard flow directions preset by the hydrological analysis algorithm. The cumulative water flow in the corresponding direction is obtained by calculating the cumulative flow matrix based on the elevation information of the spatial reference data. The matrix is generated with the region defined by the river semantic mask as the boundary.
[0132] By applying directional consistency constraints that are continuous along the main trend and maintain equipotential laterally, an elevation correction model is constructed, and the constraint matrix is invoked throughout the process. The corresponding parameters for the middle river region are used as hard constraints, i.e. , , .
[0133] For the main river flow direction, to ensure elevation continuity and comply with slope constraints, a two-point linear interpolation algorithm is used to correct outlier points. The formula is designed based on the "gradual elevation change characteristics of the main river flow direction," meaning that the elevation of a normal river changes gradually and linearly along its flow direction. The elevation of the point to be corrected is derived through the linear relationship between adjacent normal elevation points, avoiding abrupt elevation changes. The correction formula is shown below:
[0134] ,
[0135] in, Points to be corrected The corrected elevation, , The coordinates of adjacent normal elevation points on the main upward trend must satisfy the following condition: the slope between the two points must be less than or equal to... ; , These are the elevations of the two normal elevation points mentioned above; Points to be corrected Time The straight-line distance.
[0136] For the lateral direction, strictly follow The elevation gradient constraint is used to adjust outliers using a mean smoothing algorithm, while the Manning roughness coefficient of the corrected region is verified to ensure it does not exceed the limit. If the conditions are not met, a second correction will be triggered.
[0137] A secondary outlier check was performed on the corrected river area elevation to constrain the matrix. The core criterion is as follows: A 3×3 neighborhood window is used to statistically analyze the standard deviation of elevation, and residual outliers exceeding the mean ± 2 standard deviations are removed; simultaneously, the mainstream slope is verified to be less than or equal to... Lateral elevation gradient less than or equal to This ensures that all parameters conform to the semantic mapping properties of the river. Gap filling is achieved using Kriging interpolation, ultimately generating a continuous, smooth, and physically constrained 3D elevation surface for the river, enabling a natural transition with the surrounding terrain.
[0138] S303: Terrain structure separation and reconstruction based on architectural semantic mask;
[0139] Building-covered pixel sets are extracted based on a semantic mask, and a morphological dilation algorithm is used to expand the boundaries, avoiding the omission of building edge pixels. To accurately obtain the ground base elevation below the building-covered area, considering that the elevation of the non-building areas surrounding the building reflects the actual ground trend, the neighborhood elevation mean method is used to calculate the ground base elevation. The formula is designed based on the "continuity principle of ground elevation around the building," reducing local disturbances by expanding the neighborhood range and ensuring that the ground base elevation closely matches the actual terrain. The calculation formula is shown below:
[0140] ,
[0141] in, The ground elevation of the building's coverage area is the average result calculated by the formula; The neighborhood surrounding the building is set to a 5×5 pixel range, which is extended outward based on the boundary of the building semantic mask. The parameters of the extended range are optimized and determined based on the actual terrain correlation of the 0.5-meter resolution image. For the neighborhood The number of non-building pixels is obtained by pixel-by-pixel determination and statistics using the building semantic mask. For the neighborhood Non-building pixels The elevation.
[0142] Perform terrain structure separation and reconstruction, separating the main building from the ground base, and call the constraint matrix. The corresponding parameters for the building area are used as the accuracy control standard. For the main building, based on the mapping relationship... The parameters are set to meters, preserving the main building elevation, constructing a block-shaped building entity model, with the outline strictly following the architectural semantic mask boundary, and the verticality deviation less than or equal to... For the ground base, the elevation influence of the building's main pixels is eliminated, and a quadratic trend surface fitting algorithm is used to restore the continuous ground. The formula is designed based on the characteristic that "ground terrain often exhibits a quadratic curved surface distribution." The fitting formula is shown below:
[0143] ,
[0144] in, For a certain point on the ground base The fitted elevation; , These are the pixel coordinates of that point; The fitting coefficients are obtained using the least squares method. After fitting, the continuous error of the ground base elevation is verified to be less than or equal to... Ensure that the mapping attribute constraints conform to the architectural semantics.
[0145] The reconstructed building area is then seamlessly integrated with the ground base. A 2-pixel wide transition zone is created at the building edges, and a gradient interpolation algorithm is used to eliminate abrupt elevation changes. During the interpolation process, a constraint matrix is used... Based on the connection error parameters, ensure that the connection error between the transition zone and the main building and the surrounding ground is less than or equal to the specified value. This allows the building to blend naturally with the ground without any obvious splicing marks, while maintaining the verticality of the building outline in accordance with the constraints.
[0146] Example: 5×5 neighborhood of a building The number of non-building pixels within. The elevations of the non-building pixels are 25.3 meters, 25.5 meters, 25.4 meters, ..., 25.6 meters (a total of 20 values). The ground base elevation was calculated. This elevation, measured in meters, will serve as a reference point for separating the main building from the ground foundation.
[0147] S304: Continuity constraints and reconstruction based on non-building areas.
[0148] Non-building areas are defined as other areas in the semantic mask set besides building semantics, including rivers, vegetation, roads, and bare land. A dual-constraint reconstruction model is constructed by applying a continuity constraint to this area that is continuous overall but restricts local undulations, taking into account both the natural characteristics of the terrain and the suppression of disturbances.
[0149] A global trend surface analysis algorithm is used to fit the elevation of non-building areas. The residual between the fitted elevation and the original elevation is calculated. Pixels with an absolute residual value ≥ 0.5 meters are identified as unstructured disturbance points. To suppress disturbances while preserving reasonable terrain undulations, a weighted smoothing algorithm is used for correction. The formula is designed based on the principle that "the larger the residual of a disturbance point, the stronger its impact on terrain realism, and therefore it needs to be assigned a lower weight." By designing the weights to be inversely proportional to the absolute value of the residual, excessive smoothing that could lead to the loss of terrain details is avoided. The correction formula is shown below:
[0150] ,
[0151] in, Points to be corrected Smoothed elevation, The neighborhood window for smoothing is set to 5×5 pixels, and the parameters are determined based on the balance of smoothing effect and terrain detail optimization. pixels within the neighborhood window The weighting coefficient is calculated using the following formula: ; pixels within the neighborhood window The original elevation; For neighboring pixels The elevation residual is the difference between the original elevation and the elevation fitted to the global trend surface.
[0152] Recall constraint matrices precisely according to semantic categories The corresponding dimension parameter is used as the threshold: the local undulation amplitude of the vegetation area is called. Meter, connection error Meters; Longitudinal slope of road area Lateral elevation gradient ; Local curvature of bare land areas .
[0153] A local terrain filtering algorithm is used to smoothly adjust areas exceeding the threshold, and the adjusted area is then verified pixel by pixel. The matrix parameters ensure that the connection error between the vegetation area and the surrounding area is less than or equal to 0.2 meters, the length of the road transition section is greater than or equal to 3 pixels, and the curvature of the bare land is less than or equal to 0.02. This preserves reasonable terrain undulations while strictly adhering to the mapping attributes of each semantic category, effectively suppressing unstructured disturbances.
[0154] S305: 3D terrain entity integration and accuracy verification;
[0155] The reconstruction results of rivers, buildings, and non-building areas are integrated, and regional fusion is performed based on semantic mask boundaries to ensure smooth elevation transitions and consistent structural logic among regions. This generates an initial 3D terrain entity containing various 3D landforms. The entity data format uses dual storage of point cloud and mesh, with point cloud density consistent with the resolution of the digital land surface model and mesh accuracy of 0.5m × 0.5m.
[0156] Accuracy verification was performed on the initial 3D terrain entity, and two types of verification indicators were set, both based on semantic-physical attribute constraint matrices. The core references are: firstly, elevation accuracy. Eighty verification points were randomly selected (20 for rivers, 30 for buildings, and 30 for non-building areas). The reconstructed elevations were compared with the original elevations in the digital land model. The root mean square error (RMSE) of the elevation deviation was required to be less than or equal to 0.25 meters, and each point was verified to conform to the corresponding semantic category. The first constraint is the matrix parameter constraint; the second is the structural integrity and attribute compliance, which are checked by combining manual annotation with matrix verification to see if the verticality of the building outline, river slope, and bare ground curvature meet the requirements. Matrix thresholds ensure that the structural compliance rate meets the standards and that the mapping relationship is implemented effectively.
[0157] If the verification fails, adjustments are made to the unqualified areas: if the elevation accuracy is not up to standard, the corresponding area reconstruction step is returned to optimize the constraint parameters, such as the river lateral elevation difference threshold and the smoothing weight of non-building areas; if the structural integrity is not up to standard, the semantic category and physical attribute mapping relationship is corrected, and the reconstruction process is re-executed until the verification requirements are met, and finally qualified 3D terrain entities are output.
[0158] Step S4 involves obtaining the engineering cost documents corresponding to the target area, and constructing an engineering cost data set through layout analysis and identification, semantic parsing, table structure restoration, information extraction, and numerical processing, including:
[0159] S401: Acquisition and Preprocessing of Engineering Cost Documents;
[0160] We acquire the complete engineering cost documents for the 3D terrain construction project of the target area, including paper scans, electronic PDFs, Word documents, and Excel files, ensuring that no key information is missing and no large areas are damaged. We perform format normalization on the documents, uniformly converting them to a 300DPI standardized PDF format. For paper scans, we optimize image quality using image enhancement algorithms, employ adaptive binarization to eliminate paper noise and shadows, and correct scanning distortion using perspective correction to ensure the accuracy of subsequent OCR recognition.
[0161] Based on project association rules, document validity is verified, and the consistency between the project name, geographical location, and the target area in step S1 is checked. Redundant documents are removed, and the integrity of metadata is verified. Qualified documents are subject to number management, and an association index consistent with the spatial partitioning rules of the semantic mask set in step S2 is established, laying the foundation for spatially directional association.
[0162] S402: Page Layout Analysis and Optical Character Recognition Processing;
[0163] A deep learning layout analysis model based on Mask R-CNN is used to perform pixel-level segmentation on the preprocessed document pages. The model follows a four-stage architecture of "feature extraction - region proposal - classification and regression - mask generation".
[0164] The input layer receives a 300 DPI normalized PDF page image, which is normalized and mapped to the [0,1] pixel value range, and outputs a 3-channel RGB feature map.
[0165] The backbone network uses ResNet50 and consists of 5 residual stages. Stage 1 to Stage 4 contain 1, 3, 4 and 6 residual blocks respectively. Each residual block is equipped with batch normalization and ReLU activation function, and finally outputs a deep feature map with 2048 channels. Stage 5 introduces dilated convolution to expand the receptive field and ensure complete capture of features in large table regions.
[0166] The Feature Pyramid (FPN) connects the outputs of each stage of the backbone network across layers, constructing five feature levels from P2 to P6. The number of channels is uniformly adjusted to 256. Through top-down interpolation and bottom-up feature fusion, the feature representation of elements at different scales, such as tables and text blocks, is enhanced.
[0167] The model employs a dual-branch detection head. The classification branch consists of two 1×1 convolutional layers, outputting the probability distribution of eight types of layout elements. The regression branch predicts the bounding box coordinate offsets using four 3×3 convolutional layers. The masking branch combines 3×3 convolutions with transposed convolutions to generate a 14×14 pixel mask image, achieving pixel-level segmentation. The region proposal network generates candidate regions based on an anchor box strategy. The anchor boxes are set with three scales and three aspect ratios. Candidate boxes are filtered using non-maximum suppression, and the IoU threshold is set to 0.7.
[0168] The segmentation process incorporates prior constraints on the layout of engineering cost documents, and strengthens the spatial correlation between titles and tables through an attention mechanism, thereby improving the accuracy of table area segmentation and text-non-text area segmentation. The segmented elements are marked with the coordinates of the top-left and bottom-right corners of the page and page number information, forming a standardized list of layout elements.
[0169] A multilingual recognition engine is employed to recognize text blocks and table areas. The recognition model is optimized by integrating a professional engineering cost terminology database, reducing the misrecognition rate of technical terms. A structured recognition strategy is used for table areas, extracting content cell by cell while preserving row and column relationships. Text blocks are split into independent segments based on semantic paragraphs. The recognition results are bound to page element coordinates and page numbers to generate text units, which are then processed according to confidence levels: units with a confidence level of 0.95 or higher are directly retained; those between 0.8 and 0.95 undergo manual review and correction; and those below 0.8 are re-recognized after optimizing preprocessing parameters, ensuring a text unit accuracy rate of at least 99%.
[0170] S403: Semantic parsing of text units and restoration of table structure;
[0171] The text unit semantic parsing is performed based on the fine-tuned BERT model. The model adopts a three-layer architecture of "embedding layer-encoder layer-output layer" to adapt to the semantic understanding needs of the engineering cost field.
[0172] The embedding layer consists of word embedding, position embedding, and segment embedding. The word embedding dimension is set to 768. The engineering cost professional vocabulary is expanded based on the BERT-base pre-trained vocabulary, and new terms such as "work quantity" and "comprehensive unit price" are added, expanding the total vocabulary size to 30,000. The position embedding is generated by sine and cosine functions, covering the maximum length of 512 text units on the document page. The segment embedding is used to distinguish different text units and realize cross-unit semantic association.
[0173] The encoder layer contains 12 stacked units, each with a built-in multi-head self-attention mechanism and a feedforward neural network. The multi-head attention mechanism has 12 attention heads, each with a dimension of 64. It calculates the semantic weights within and between text units by scaling the dot product attention, thus mitigating the semantic decay of long texts. The feedforward neural network consists of two linear transformation layers with a GELU activation function in between. The hidden layer dimension is set to 3072 to enhance the model's non-linear fitting ability. After each encoder layer, layer normalization and Dropout regularization are configured, with the Dropout probability set to 0.1 to suppress overfitting.
[0174] The output layer uses linear transformation and the Softmax activation function to output semantic similarity scores and entity association probabilities. During model fine-tuning, a corpus from the engineering cost domain is used for training. The corpus covers various cost document texts, tabular data, and field association samples. The AdamW optimizer is used, with an initial learning rate of 2e-5, a batch size of 16, and 50 iterations. An early stopping strategy is employed to prevent overfitting. A domain semantic association rule base is constructed, clearly defining the core logic of one-to-one correspondence between project names and geographical locations, binding of quantities with units of measurement and comprehensive unit prices, the cumulative relationship between total cost and sub-item costs, and the relationship between sub-item engineering and geographical region affiliation. Combined with the semantic similarity scores and entity linking techniques output by the model, a text unit association index is established to address the problem of fragmented logic in discrete text.
[0175] Based on page layout row and column coordinates and semantic association rules, a matrix matching algorithm is used to reconstruct the table structure. The alignment of rows and columns is reconstructed using the title row and column identifiers as a basis, filling in missing and misaligned cells. For multi-page tables, page numbers and border features are used for splicing. The reconstructed table undergoes field consistency reorganization, standardizing field naming conventions, removing duplicate fields and blank rows. The reorganized table data is stored in a standardized JSON format, preserving the row and column indices and their association with the original text units, thus supporting information extraction.
[0176] S404: Key information extraction and cost information digitization;
[0177] A dual strategy of rule matching and entity recognition is employed to selectively extract five core information categories from tabular data and text units: project name, geographical location description, quantity of work, unit of measurement, and total cost. The extraction process incorporates spatial semantic constraints, and the accuracy of the geographical location description is optimized by combining the semantic mask set of feature boundaries in step S2. Quantities of work and units of measurement are extracted in pairs according to the row and column relationships in the table to avoid matching bias.
[0178] The extracted cost information undergoes numerical and standardization processing. Chinese capitalized cost figures are converted to Arabic numerals, and the currency unit is standardized to RMB with two decimal places. For cost ranges, the median value is used as the standard value, and the range attribute is labeled. For ambiguous descriptions of quantities, modifiers are removed, and the core values are retained and recorded. Standardized national units of measurement are adopted, and non-standard units are converted, ultimately forming standardized information entries containing complete attributes.
[0179] S405: Generation and Verification of Spatial Direction Engineering Cost Data Sets;
[0180] Based on the spatial reference data from step S1, a geocoding engine is used to fuse map APIs and a proprietary spatial database to convert text-based geographic location descriptions into latitude and longitude bounding box coordinates. The coordinate accuracy is consistent with the spatial reference data, retaining six decimal places. For geographic locations containing feature references, the coordinate range is corrected by associating with the river semantic mask boundary from step S2 to ensure accurate alignment with the three-dimensional terrain spatial range.
[0181] By integrating standardized information entries and geographic coordinate ranges, a dual-structure dataset of "spatial index + attribute information" is constructed. The spatial index is associated with the S3 3D terrain entity spatial grid, and the attribute information covers all core fields. The data is encapsulated in GeoJSON-LD format and backed up in JSON format to ensure compatibility with mainstream GIS platforms and engineering software.
[0182] The dataset undergoes dual precision verification. Spatial accuracy verification involves randomly selecting 30 entries, requiring at least 95% spatial overlap with the semantic mask region and a coordinate deviation of no more than 0.5 meters. Attribute consistency verification requires a single cost error of no more than 0.1%, a total cost error of no more than 0.05%, and a field association accuracy of no less than 99%. If any verification fails, the corresponding steps are backtracked and optimized until the precision requirements are met, at which point a qualified engineering cost dataset is output.
[0183] like Figure 4 As shown, step S5 involves determining the three-dimensional terrain range by parsing the spatial location in the engineering cost data, and associating the cost information with the terrain entity under dual constraints to form a three-dimensional terrain data result, including:
[0184] S501: Analysis of spatial location description and determination of spatial range for engineering projects.
[0185] Based on the engineering cost data set, the project name and geographical location description in each information item are extracted. The geographical location description is then subjected to structured parsing, and administrative division terms, road and water system reference terms, mileage marker terms, orientation and distance terms, and landmark constraint terms are broken down into standardized location elements.
[0186] Based on location elements, the geocoding engine is invoked to output a set of candidate latitude and longitude coordinates. The spatial reference frame of the spatial benchmark data is used to perform coordinate unification and precision normalization on the candidate latitude and longitude coordinate set. On the candidate latitude and longitude coordinate set, the range words and scale words contained in the geographic location description are combined to generate a set of candidate spatial ranges for the project. The candidate spatial ranges are expressed in two forms: latitude and longitude rectangles and polygon boundaries. The candidate spatial ranges are then projected onto the spatial grid of the three-dimensional terrain entity to obtain a grid index range consistent with the three-dimensional terrain entity.
[0187] For example: Parse the geographical location description of "XX Industrial Park Jing San Road (from XX Avenue in the north to XX River in the south)" and break it down to obtain standardized location elements: administrative division "XX City XX District XX Industrial Park", road name "Jing San Road", and boundary reference "from XX Avenue in the north to XX River in the south". Convert it into latitude and longitude rectangular coordinates (118°23′10″-118°24′30″ east longitude, 32°18′20″-32°19′40″ north latitude) through a geocoding engine. After projecting it onto the three-dimensional terrain entity space grid, the corresponding grid index range is X: 1200-1500, Y: 800-1100.
[0188] S502: Spatial consistency constraint verification and spatial range correction.
[0189] Based on the semantic mask set and spatial reference data, spatial consistency constraint verification is performed on the candidate spatial range set. The spatial consistency constraint is used to limit the correspondence between the candidate spatial range and the spatial boundary, spatial distribution of ground features, and grid index continuity of the target area. By calculating the spatial overlap and boundary fit between the candidate spatial range and the river semantic mask, building semantic mask, and road semantic mask in the semantic mask set, candidate spatial ranges that significantly conflict with the semantic mask set are identified, and range correction processing is performed on the conflicting candidate spatial ranges. The range correction processing includes: performing boundary clipping processing on the candidate spatial range according to the boundary of the semantic mask set, performing hole filling and fragment merging processing on the candidate spatial range according to the geographic coordinate continuity of the spatial reference data, and performing connected component filtering processing on the candidate spatial range according to the connectivity of the grid index, and finally determining the project spatial range that meets the spatial consistency constraint.
[0190] S503: Semantic consistency constraint verification and 3D terrain entity matching.
[0191] Based on the quantities and units of measurement in the engineering cost dataset, engineering attribute type discrimination rules are constructed to map quantities and units of measurement to engineering attribute types. Under the constraints of engineering attribute types, the dominant feature coverage type within the spatial range of the engineering project is determined based on a semantic mask set, and a semantic consistency constraint is established between the engineering attribute type and the dominant feature coverage type. The candidate matching set between the spatial range of the engineering project and the three-dimensional terrain entity is filtered through the semantic consistency constraint. Three-dimensional terrain entity areas whose dominant feature coverage type and engineering attribute type satisfy the semantic consistency constraint are retained, while three-dimensional terrain entity areas with inconsistent semantics are eliminated, forming the three-dimensional terrain entity matching result corresponding to the engineering project. The three-dimensional terrain entity matching result is represented by the grid index set of the three-dimensional terrain entity and the boundary polygon.
[0192] S504: Establishment of the association mapping between engineering cost information and three-dimensional terrain entities.
[0193] Based on the matching results between the spatial scope of the engineering project and the three-dimensional terrain entities, an association mapping relationship is established between the engineering cost data set and the three-dimensional terrain entities. The association mapping relationship includes the mapping relationship from the project name to the three-dimensional terrain entity grid index set, the attribute attachment relationship from the total cost to the matching results of the three-dimensional terrain entities, and the attribute labeling relationship from the quantity and unit of measurement to the local area of the three-dimensional terrain entity. According to the association mapping relationship, the project name, quantity, unit of measurement and total cost in the engineering cost data set are written into the attribute fields corresponding to the three-dimensional terrain entities, and a spatial index identifier and a semantic category identifier are written for each mapping relationship to form a traceable association mapping record.
[0194] S505: Generation and Consistency Verification of 3D Terrain Data Results.
[0195] Based on the associated mapping record, the topographic geometric information of the three-dimensional terrain entity and the engineering cost information are fused and encapsulated to generate preliminary three-dimensional terrain data results. The preliminary three-dimensional terrain data results include the point cloud and grid data of the three-dimensional terrain entity, the spatial range of the engineering project, and the engineering economic attribute fields bound to the matching results of the three-dimensional terrain entity.
[0196] The initial 3D terrain data results undergo consistency verification, which includes: verifying the coordinate closure and grid index continuity of the project's spatial extent in the spatial reference data; verifying the consistency of the dominant feature coverage type between the project's spatial extent and the semantic mask set; and verifying the field integrity and numerical consistency of the project cost data set. If the consistency verification fails to meet the requirements, the process returns to step S502 to re-execute the spatial consistency constraint verification and spatial extent correction, and simultaneously returns to step S503 to re-execute the semantic consistency constraint verification and 3D terrain entity matching, until the consistency verification passes, and finally, the 3D terrain data results are output.
[0197] This invention discloses a three-dimensional terrain construction method based on multimodal semantic constraints. It constructs integrated spatial reference data of optical remote sensing imagery and digital surface models, performs joint semantic parsing of ground features under semantic constraints, generates a semantic mask set, and further performs regional geometric reconstruction of the digital surface model based on the mapping relationship between semantic categories and surface physical attributes, forming structurally reasonable three-dimensional terrain entities. Under spatial and semantic consistency constraints, the parsed engineering cost information is associated and mapped with the corresponding three-dimensional terrain entities, thereby achieving a unified expression of three-dimensional terrain geometric information and engineering economic attributes, improving the completeness and practicality of three-dimensional terrain results in engineering applications.
[0198] The above description is merely a preferred embodiment of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principles of the present invention should also be considered within the scope of protection of the present invention.
Claims
1. A method for constructing 3D terrain based on multimodal semantic constraints, characterized in that, Includes the following steps: Acquire optical remote sensing images and digital land model of the target area, perform texture enhancement processing on the optical remote sensing images, use the digital land model as a unified elevation benchmark, perform geometric consistency correction on the enhanced optical remote sensing images, and form spatial benchmark data. Based on spatial reference data, semantic constraint information is constructed to define the semantic boundaries, structural morphology and spatial representation priors of ground features. Under the constraint of the semantic constraint information, joint semantic parsing and structural consistency correction are performed on optical remote sensing images to generate a set of semantic masks corresponding to each ground feature cover type. Based on a set of semantic masks, a mapping relationship between semantic categories and surface physical attributes is established, and regional geometric reconstruction is performed on the digital surface model according to the mapping relationship to generate three-dimensional terrain entities. Obtain the engineering cost document corresponding to the target area, and perform layout analysis, character recognition, semantic understanding and table structure restoration on the engineering cost document to form an engineering cost data set; Based on the engineering cost data set, under the constraints of spatial consistency and semantic consistency, the engineering cost information is associated with the corresponding three-dimensional terrain entities to form a three-dimensional terrain data result. The semantic constraint information construction and joint semantic parsing include: extracting the spatial expression characteristics of typical land features based on spatial benchmark data, constructing a multi-dimensional semantic constraint rule base covering semantic boundaries, structural morphology and spatial expression priors; quantifying the constraint rules into computable constraint factors and embedding them into the semantic segmentation model; performing pixel-level joint semantic parsing on optical remote sensing images to obtain initial distribution results; correcting them according to the inherent consistency rules of land feature spatial expression; and generating structurally continuous and semantically consistent land feature distribution results. By mapping engineering cost information with 3D terrain entities, a 3D terrain data product is generated, including: parsing the spatial location description in the engineering cost data set, splitting standardized location elements to determine the spatial range of the project, and matching the corresponding 3D terrain entity area after double constraint verification of spatial consistency and semantic consistency; based on the 3D terrain entity area, establishing a multi-dimensional association mapping relationship between the engineering cost data and the 3D terrain entity to which the area belongs, writing attribute fields to generate association records; integrating the terrain geometric information and engineering economic attributes of the 3D terrain entity, performing consistency verification, and outputting the 3D terrain data product.
2. The three-dimensional terrain construction method based on multimodal semantic constraints as described in claim 1, characterized in that, Establishing a mapping relationship between semantic categories and surface physical attributes to perform regional geometric reconstruction includes: By combining topographic features and engineering attributes of landforms, the inherent physical essence of core semantic categories in the semantic mask set is defined, and a four-dimensional mapping rule base is constructed. The semantic mask is used to assign values to the mapping rule base pixel by pixel, which is then transformed into a semantic-physical attribute constraint matrix with precise pixel-by-pixel constraints. Geometric reconstruction of river region, building region and non-building region is then performed.
3. The three-dimensional terrain construction method based on multimodal semantic constraints as described in claim 2, characterized in that, Geometric reconstruction of the river region, including: The set of river-covered pixels is extracted based on river semantic mask, and the elevation information in spatial reference data is combined with hydrological analysis algorithms to determine the main direction of the river. Based on the application of a directional consistency constraint that is continuous and laterally equipotential along the mainstream direction of the river, an elevation correction model is constructed to correct the elevation anomalies corresponding to the pixel set and fill the gaps by interpolation to generate a three-dimensional elevation surface of the river.
4. The three-dimensional terrain construction method based on multimodal semantic constraints as described in claim 2, characterized in that, Geometric reconstruction of the building area, including: The set of building-covered pixels is extracted based on the building semantic mask, and the ground base elevation of the corresponding area is calculated by the neighborhood elevation mean method. Based on the ground base elevation, the main building and the ground base are separated. A second trend surface fitting is performed on the base to restore the continuous ground, and the elevation connection is handled by setting a transition zone at the edge of the building.
5. A three-dimensional terrain construction method based on multimodal semantic constraints as described in claim 2, characterized in that, Geometric reconstruction of non-building areas, including: A dual continuity constraint of overall continuity and local undulation restriction is applied to the non-building areas in the semantic mask set. A global trend surface analysis algorithm is used to fit the regional elevation and identify non-structural disturbance points. The perturbation points are corrected by a weighted smoothing algorithm, and the terrain is adjusted in a targeted manner according to the constraint matrix parameters of the corresponding semantic category.
6. The three-dimensional terrain construction method based on multimodal semantic constraints as described in claim 1, characterized in that, Engineering cost document layout analysis and character recognition, including: The format normalization process is performed on multi-format engineering cost documents, and the image enhancement algorithm is used to optimize the image quality of paper scans, eliminating noise and distortion. Based on the optimized document image, a deep learning model based on Mask R-CNN is used to segment the document layout elements. Optical character recognition is performed on the segmented text blocks and table regions to generate text units bound with spatial location and page number information.
7. A three-dimensional terrain construction method based on multimodal semantic constraints as described in claim 6, characterized in that, Semantic understanding and table structure restoration of engineering cost documents, including: A language model finely tuned from a corpus in the field of engineering cost was used to analyze the field correspondence and context inheritance relationships between text units. Based on the parsing results, page row and column coordinates, and semantic association rules, the table structure is restored and cross-page tables are spliced together. Consistent reorganization is performed on the discrete fields in the restored table to form logically continuous standardized table data.
8. The three-dimensional terrain construction method based on multimodal semantic constraints as described in claim 7, characterized in that, The engineering cost data set is formed, including: Extract core cost information from standardized tabular data and text cells; perform numerical conversion and unit standardization on the extracted information; A geocoding engine is used to convert textual geographic location descriptions in the information into latitude and longitude coordinates. Combined with semantic mask boundary correction, a spatially oriented engineering cost data set is constructed.