Three-dimensional model reconstruction method, device, storage medium and computer equipment of building roof
By extracting explicit geometric features from point cloud data and combining them with multimodal feature fusion technology, the problems of topological integrity and accuracy of category recognition in building roof reconstruction were solved, generating a high-quality 3D roof model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANFU JIANGXI LAB
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for reconstructing building roofs struggle to balance topological integrity with the accuracy of roof category identification, resulting in inconsistent geometric quality of the reconstructed roof model.
By extracting explicit geometric features from point cloud data, constructing multidimensional geometric features, and combining projected images from multiple perspectives, multimodal feature fusion is performed using a pre-trained visual language model and text encoder. Combined with a geometry-driven cue alignment module and a probabilistic geometric correction mechanism, a 3D roof model is generated.
It improves the accuracy of zero-sample roof classification, enhances classification robustness under noise, occlusion and local defects conditions, and generates topologically consistent and boundary-complete 3D roof models.
Smart Images

Figure CN122199833A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of remote sensing point cloud processing, 3D building reconstruction and computer vision technology, and in particular to a method, apparatus, storage medium and computer equipment for 3D model reconstruction of a building roof. Background Technology
[0002] With the rapid development of smart cities, digital twins, and 3D spatial modeling, the automatic identification and 3D reconstruction of building roofs based on airborne lidar point clouds has become an important research direction in photogrammetry and remote sensing. As a key component of buildings, the accuracy of the roof's category determination and geometric reconstruction directly determines the structural correctness and representational quality of the final model.
[0003] Existing methods for reconstructing building roofs can be broadly categorized into data-driven and model-driven approaches. Data-driven methods typically extract geometric elements such as planes and ridges from point clouds and then progressively assemble them into a mesh model. However, this type of method is sensitive to thresholds, errors tend to accumulate layer by layer, and it struggles to consistently maintain the topological integrity of regular roofs. Its modeling capabilities are particularly insufficient for curved roofs such as cylinders, cones, and hemispheres. Model-driven methods, on the other hand, use predefined roof geometric primitives for parameter fitting to generate regular CAD-style models. However, this type of method relies heavily on accurate roof type identification; otherwise, incorrect category identification will lead to significant topological mismatches in subsequent parameter fitting.
[0004] On the other hand, visual language models have made significant progress in zero-shot recognition in recent years. A typical approach is to project 3D point clouds into multi-view depth maps, and then use a pre-trained visual encoder to compare and match with text prompts to complete the category determination. However, this type of method is mainly suitable for general objects with obvious morphological differences, such as airplanes and chairs. It is less adaptable to targets like building roofs, where the appearance between classes is highly similar and the key differences lie in the 3D structural relationships. Different roof types often have similar outer contours and aspect ratios, making it difficult to distinguish ridge structures, slope relationships, and curvature features using only 2D projected images. At the same time, airborne LiDAR point clouds generally suffer from uneven point density, occlusion, and noise interference, further weakening the reliability of zero-shot classification.
[0005] As a result, existing methods for reconstructing building roofs struggle to balance the topological integrity of the roof with the accuracy of roof category identification, making it difficult to guarantee the geometric quality of the reconstructed roof model. Summary of the Invention
[0006] In view of this, the present application provides a method, apparatus, storage medium and computer equipment for reconstructing a three-dimensional model of a building roof. The main purpose is to solve the technical problem that the building roof reconstruction method is difficult to balance the topological integrity of the building roof and the accuracy of roof category identification, which leads to the difficulty in guaranteeing the geometric quality of the reconstructed roof model.
[0007] According to one aspect of this application, a method for reconstructing a three-dimensional model of a building roof is provided, the method comprising: The point cloud data of the building roof to be modeled is acquired and preprocessed. The point cloud data is acquired by an airborne lidar. Explicit geometric features are extracted from the preprocessed point cloud data, and multidimensional geometric features are constructed. The projected images of the building roof from multiple perspectives are generated. Each projected image is input into the visual encoder of a pre-trained visual language model to obtain visual embeddings, and visual features are constructed by concatenation. For each type of roof, a geometrically enhanced text cue is constructed, and the geometrically enhanced text cue for each roof category is input into the text encoder of the visual language model to obtain text embeddings. The text embeddings of the same category are concatenated multiple times to obtain text features with the same dimension as the visual features. The multidimensional geometric features are projected onto the feature space of the visual language model using a multilayer perceptron, the visual features are reorganized into multi-view visual features, and cross-attention from geometry to vision and reverse correction from vision to geometry are performed. Multimodal fusion features are obtained through residual fusion and layer normalization. Calculate the cosine similarity between the multimodal fusion feature and the text features of each category to obtain the basic semantic score, and inject the geometric matching score between the geometric descriptor of the building roof and the prior geometric template of each category into the basic semantic score to obtain the roof category of the building roof; In a predefined roof primitive library, parametric geometric primitives corresponding to the roof category of the building roof are selected. Based on the geometric deviation between the point cloud data and the parametric geometric primitives, a parameter optimization target is constructed. The shape parameters, position parameters, and orientation parameters of the parametric geometric primitives are optimized to generate a three-dimensional roof model of the building roof.
[0008] According to another aspect of this application, a three-dimensional model reconstruction device for a building roof is provided, the device comprising: The point cloud data acquisition module is used to acquire point cloud data of the building roof to be modeled and to preprocess the point cloud data, wherein the point cloud data is acquired by an airborne lidar. The geometric feature construction module is used to extract explicit geometric features from preprocessed point cloud data and construct multidimensional geometric features. The visual feature construction module is used to generate projected images of the building roof from multiple perspectives, input each projected image into the visual encoder of the pre-trained visual language model to obtain visual embeddings, and construct visual features by concatenation. The text feature construction module is used to construct geometrically enhanced text prompts for each type of roof, and input the geometrically enhanced text prompts of each roof category into the text encoder of the visual language model to obtain text embeddings. The text embeddings of the same category are repeatedly concatenated to obtain text features with the same dimension as the visual features. The multimodal feature fusion module is used to project the multidimensional geometric features onto the feature space of the visual language model using a multilayer perceptron, reshape the visual features into multi-view visual features, and perform geometry-to-vision cross attention and vision-to-geometry reverse correction. Multimodal fused features are obtained through residual fusion and layer normalization. The roof category identification module is used to calculate the cosine similarity between the multimodal fusion features and the text features of each category to obtain the basic semantic score, and to inject the geometric matching score between the geometric descriptor of the building roof and the prior geometric template of each category into the basic semantic score to obtain the roof category of the building roof. The roof model reconstruction module is used to select parametric geometric primitives corresponding to the roof category of the building roof from a predefined roof primitive library, construct parameter optimization targets based on the geometric deviation between the point cloud data and the parametric geometric primitives, optimize the shape parameters, position parameters and attitude parameters of the parametric geometric primitives, and generate a three-dimensional roof model of the building roof.
[0009] According to another aspect of this application, a storage medium is provided that stores a computer program thereon, which, when executed by a processor, implements the above-described method for reconstructing a three-dimensional model of a building roof.
[0010] According to another aspect of this application, a computer device is provided, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor executes the program to implement the above-described method for reconstructing a three-dimensional model of a building roof.
[0011] By employing the above technical solutions, the embodiments of this application provide a method, apparatus, storage medium, and computer device for reconstructing a 3D model of a building roof. By extracting explicit geometric features from raw point cloud data, constructing multi-dimensional geometric features of the building roof, and combining projected images from multiple perspectives, the multimodal representation's ability to perceive roof topological features can be enhanced. Furthermore, by utilizing a pre-trained visual encoder and text encoder to obtain visual and textual features of the building roof, and achieving trimodal fusion between geometry, vision, and semantics through a geometry-driven cue alignment module, and then replacing rigid rule filtering with a probabilistic geometric correction mechanism to complete roof category determination, the accuracy of zero-sample roof classification can be effectively improved, and the classification robustness under noise, occlusion, and local defects can be enhanced. In addition, by utilizing the semantic priors provided by the building roof classification results to drive the parametric fitting of corresponding roof geometric primitives, unified regularized modeling of planar and curved roofs can be achieved, thereby generating a topologically consistent, boundary-complete, and CAD-compatible 3D roof model. The above method takes into account the modeling needs of both planar and curved roofs, and can effectively improve the accuracy of roof classification and the geometric quality of the 3D roof model.
[0012] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the following are specific embodiments of this application. Attached Figure Description
[0013] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings: Figure 1 A flowchart illustrating a method for reconstructing a three-dimensional model of a building roof according to an embodiment of this application is shown. Figure 2 A schematic diagram illustrating a multidimensional geometric feature extraction process provided in an embodiment of this application is shown. Figure 3 This illustration shows another process for extracting multidimensional geometric features provided in an embodiment of this application; Figure 4 This illustration shows a schematic diagram of a multi-view projection of a building roof provided in an embodiment of this application; Figure 5 This illustration shows a flowchart of a geometrically driven cue alignment and probabilistic geometric correction for a building roof, according to an embodiment of this application. Figure 6 A schematic diagram of a parametric geometric primitive provided in an embodiment of this application is shown; Figure 7 This diagram illustrates a reconstruction result of a three-dimensional roof model provided in an embodiment of this application. Detailed Implementation
[0014] The present application will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in the embodiments of the present application can be combined with each other.
[0015] In one embodiment, refer to Figure 1 As shown in the flowchart, a method for reconstructing a 3D model of a building roof is provided. Taking the application of this method to a computer device as an example, the method includes the following steps: Step 1: Obtain point cloud data of the building roof to be modeled and preprocess the point cloud data. The point cloud data is acquired by airborne LiDAR.
[0016] Point cloud data refers to a set of three-dimensional spatial points obtained by scanning the roof of a single building using an airborne LiDAR. Each point in the point cloud data contains three-dimensional coordinate information. Preprocessing includes denoising, normalization, and coordinate alignment of the point cloud data. Denoising is used to remove isolated and outlier points from the point cloud, normalization is used to scale the point cloud coordinates to a uniform scale, and coordinate alignment is used to transform the point cloud to a standard coordinate system.
[0017] Specifically, when performing 3D reconstruction of a building roof, it is first necessary to obtain the airborne LiDAR point cloud data of the roof to be modeled, which can be denoted as:
[0018] in, For point cloud data, For serial number, For points, For the first The three-dimensional coordinates of the points The first The coordinates of a point on the three coordinate axes.
[0019] Furthermore, after acquiring the point cloud data, denoising operations can be performed on the input point cloud data, such as using statistical filtering or radius filtering to remove outliers. Then, coordinate normalization is performed on the point cloud data, that is, scaling the minimum bounding box of the point cloud to a unit scale. Finally, coordinate alignment is performed so that the principal axis direction of the point cloud data is consistent with the coordinate axis direction of the global coordinate system. By preprocessing the point cloud data, the input requirements for subsequent geometric feature extraction, multi-view projection generation, and parametric fitting can be met.
[0020] Step 2: Extract explicit geometric features from the preprocessed point cloud data and construct multidimensional geometric features.
[0021] Explicit geometric features refer to numerical indicators describing the structural attributes of the roof, which are directly calculated from point cloud data. Multidimensional geometric features refer to vectors composed of data from multiple dimensions, such as the total number of roof panels, the number of horizontal roof panels, the number of sloping roof panels, the number of triangular roof panels, the number of quadrilateral roof panels, symmetry indicators, contour concavity and convexity indicators, valley line structure indicators, and surface existence indicators.
[0022] Specifically, a region growing algorithm constrained by normal vector consistency and projection coplanarity can be used to extract a set of roof patches from the unordered point cloud data. For each roof patch in the set, the angle between its normal vector and its vertical vector is calculated. If the angle is less than a preset angle (e.g., less than 5°), the roof patch is counted as a horizontal roof patch; otherwise, it is counted as a sloping roof patch. The total number of roof patches is then calculated. Next, the point cloud of each roof patch is projected onto a local two-dimensional plane, and an initial vertex sequence is obtained using a boundary extraction operator. Redundant vertices are then removed using recursive angle discrimination. If the final number of polygon vertices retained is 3, the roof patch is counted as a triangular roof patch; if it is 4, it is counted as a quadrilateral roof patch.
[0023] Furthermore, the average distance between the point set after rotating the point cloud 180 degrees around its center point along the Z-axis and the original point set is calculated. If this distance is less than a preset threshold, the symmetry index is set to 1; otherwise, it is 0. Next, the roof outline area and its minimum bounding rectangle area are extracted, and their ratio is calculated. If the ratio is less than a preset threshold, the outline concavity / convexity index is set to 1; otherwise, it is 0. Then, for two spatially adjacent slopes, a concavity measure is constructed based on the relative relationship between their center point elevation and the intersection line elevation. If the concavity measure is greater than a preset threshold, the valley structure index is set to 1; otherwise, it is 0. Finally, for any point in the point cloud, the target tangential direction is solved within its neighborhood, and the tangential displacement and normal vector deviation of neighborhood points along this direction are calculated. Then, a quadratic curve is fitted. If the fitting error is consistently lower than a preset threshold in multiple neighborhoods, the surface existence index is set to 1; otherwise, it is 0.
[0024] In this embodiment, the thresholds for various parameters can be adaptively adjusted based on point cloud density and building scale; no specific limitations are imposed in this embodiment. This step, by quantizing the implicit three-dimensional topological structure in the point cloud data into explicit geometric descriptors, can compensate for the shortcomings of two-dimensional projection in expressing structural information such as ridges, valleys, and curvature, thereby providing interpretable geometric priors for subsequent multimodal fusion.
[0025] Step 3: Generate projected images of the building roof from multiple perspectives, input each projected image into the visual encoder of the pre-trained visual language model to obtain visual embeddings, and construct visual features by concatenation.
[0026] In this context, a projected image refers to a depth map or distance image generated by mapping 3D point cloud data onto a 2D plane using methods such as perspective projection or orthographic projection. A visual language model is a pre-trained model capable of simultaneously encoding images and text and achieving cross-modal alignment; specifically, it can be implemented using the CLIP model (Contrastive Language-Image Pre-training). The visual encoder is a neural network module in the visual language model used to convert images into feature vectors. Visual embedding refers to the fixed-dimensional feature vector output by the visual encoder, and concatenation refers to joining multiple vectors end-to-end to form a longer-dimensional vector.
[0027] Specifically, projection images of a building roof from multiple perspectives can be generated using methods such as perspective projection or orthographic projection. For example, depth projection images of the building roof can be generated from six main perspectives and four random auxiliary perspectives, totaling ten projection images. The main perspectives include typical directions such as frontal, side, and top views, while the auxiliary perspectives are generated at random angles in space to increase perspective diversity. Each projection image can then be input into the visual encoder of a pre-trained visual language model, which outputs a 512-dimensional visual embedding for each image. Next, the ten 512-dimensional visual embeddings can be concatenated in a fixed order to construct a 5120-dimensional global visual feature vector. In other embodiments, the number of main perspectives can be adjusted to four or eight, and the number of auxiliary perspectives can be adjusted to two or six, etc., and this embodiment does not impose specific limitations. This step, by acquiring projection images from multiple perspectives, can capture the outer contour and depth variation information of the roof from multiple spatial directions, thereby enhancing the ability of visual features to represent the three-dimensional topological structure of the roof and alleviating the problem of difficulty in distinguishing roofs with similar appearances from a single perspective.
[0028] Step 4: For each type of roof, construct geometrically enhanced text hints and input the geometrically enhanced text hints of each roof category into the text encoder of the visual language model to obtain text embeddings. Repeatedly concatenate the text embeddings of the same category multiple times to obtain text features with the same dimension as the visual features.
[0029] Among them, geometrically enhanced text prompts refer to a descriptive text designed for each type of roof, which simultaneously contains semantic information such as the roof's outline shape, slope relationship, number of ridges, depth variations, and surface features. The text encoder is a neural network module in the visual language model used to convert text into feature vectors, and text embedding refers to the fixed-dimensional feature vector output by the text encoder.
[0030] Specifically, for each predefined roof type, including gable roofs, hip roofs, cylindrical roofs, conical roofs, and hemispherical roofs, a geometrically enhanced text prompt is constructed. For example, for a gable roof, the constructed text prompt could be "a roof with two symmetrical sloping surfaces, a horizontal ridge, and a rectangular outline." Then, the geometrically enhanced text prompts for each category are input into the text encoder of a pre-trained visual-language model, which outputs a 512-dimensional text embedding for each category. Since the visual features constructed in step 103 are 5120-dimensional, to maintain dimensionality matching, the 512-dimensional text embeddings of the same category can be concatenated 10 times to form a 5120-dimensional text feature vector. This step, by explicitly encoding the unique structural semantics of the roof into the text prompt, allows the text feature space to no longer rely solely on category names but to contain rich topological attribute descriptions, thereby effectively improving the alignment between text features and visual and geometric features.
[0031] Step 5: Multilayer perceptron is used to project multidimensional geometric features onto the feature space of the visual language model, the visual features are reorganized into multi-view visual features, and cross-attention from geometry to vision and reverse correction from vision to geometry are performed. Multimodal fusion features are obtained through residual fusion and layer normalization.
[0032] Among them, a multilayer perceptron refers to a fully connected neural network consisting of an input layer, multiple hidden layers, and an output layer. The feature space refers to the common vector space used to represent visual and textual features in a visual language model. Cross-attention is an attention mechanism that uses query vectors to weighted aggregate key-value pairs to achieve information interaction between features from two different modalities. Residual fusion refers to adding the input features element-wise to the transformed output features. Layer normalization refers to standardizing each dimension of the feature vector to make the mean 0 and the variance 1.
[0033] Specifically, firstly, a multilayer perceptron with four hidden layers can be used to project the multidimensional geometric features obtained in step 102 onto the feature space dimension of the visual language model, such as 512 dimensions, to obtain geometric projection features. Then, the 5120-dimensional visual features obtained in step 103 are reorganized according to the dimensions of 10 views to form 10x512 multi-view visual features. Next, a geometry-to-vision cross-attention operation is performed, using the geometric projection features as the query vector and the multi-view visual features as the key and value, calculating attention weights, and weighted aggregation of the visual features to obtain geometrically enhanced visual features. Then, a vision-to-geometric reverse correction operation is performed, averaging the geometrically enhanced visual features along the view dimensions to obtain a 512-dimensional visual mean vector. This mean vector is used as the query vector, and the original geometric projection features are used as the key and value to calculate reverse attention, obtaining visually enhanced geometric features. Finally, the visually enhanced geometric features and the original geometric projection features are summed using residuals, and then processed through layer normalization to obtain the final multimodal fusion features.
[0034] In this embodiment, a balancing coefficient can be set during residual fusion. This balancing coefficient can be used to adjust the relative weight between visual semantic features and geometric constraint features, and its value is between 0 and 1, for example, 0.5. In other embodiments, the number of layers in the multilayer perceptron can also be set to 3 or 5, and the balancing coefficient can also be adjusted according to the performance of the validation set. This embodiment does not impose specific limitations. This step, through a bidirectional cross-attention mechanism, can achieve deep collaboration of geometric, visual, and textual features. Among them, geometric features, as structural priors, guide visual features to focus on key topological regions, while visual features in turn correct the geometric representation, so that the fused features simultaneously possess the interpretability of explicit geometry and the richness of visual semantics.
[0035] Step 6: Calculate the cosine similarity between the multimodal fusion features and the text features of each category to obtain the basic semantic score. Then, inject the geometric matching score between the geometric descriptor of the building roof and the prior geometric template of each category into the basic semantic score to obtain the roof category of the building roof.
[0036] Here, cosine similarity refers to the cosine of the angle between two vectors, used to measure the directional similarity of the vectors. The basic semantic score is the score vector composed of the cosine similarity between the multimodal fusion features and the text features of each category. The geometric descriptor refers to each component of the multidimensional geometric features constructed in step 102. The prior geometric template refers to the range of values or ideal feature values of typical geometric features pre-statistically calculated or defined for each roof type. The geometric matching score is the score obtained by calculating the similarity between the real-time geometric descriptor and the prior geometric template using a Gaussian kernel function. Additive bias refers to the operation of adding the geometric matching score as an additional term to the basic semantic score.
[0037] Specifically, after obtaining the multimodal fusion features, the cosine similarity between the multimodal fusion features and the text features of each category obtained in step 104 can be calculated to obtain a basic semantic score vector. Each element in this vector corresponds to the semantic matching degree of a roof category. Then, for each of the multiple geometric descriptors of the input sample and the corresponding component in the prior geometric template corresponding to any type of roof, the squared difference between the two is calculated and then substituted into the Gaussian kernel function to obtain the matching score between the geometric descriptor and the template. Next, the matching scores of all geometric descriptors are summed to obtain the total geometric matching score for the category. Then, a geometric intervention coefficient lambda is set, and the geometric matching score is multiplied by the lambda as an additive bias, which is then added to the basic semantic score of the corresponding category to obtain the corrected classification score. Finally, a normalized exponential function is applied to the corrected classification score, and the roof category with the highest probability is taken as the final category determination result for the building roof.
[0038] In this embodiment, the lambda value can be set according to the actual situation, such as 0.3. In other embodiments, the geometric matching score can also be calculated using the negative exponential form of Mahalanobis distance or Euclidean distance. This step, by employing a probabilistic geometric correction mechanism, allows the geometric prior to intervene in the classification decision process in a smooth additive bias manner, thereby avoiding the problem of hard threshold filtering misclassifying the correct category under point cloud noise, occlusion, or local defects, and significantly improving the classification robustness in complex scenes.
[0039] Step 7: Select the parametric geometric primitives corresponding to the building roof category from the predefined roof primitive library, construct the parameter optimization target based on the geometric deviation between the point cloud data and the parametric geometric primitives, optimize the shape parameters, position parameters and attitude parameters of the parametric geometric primitives, and generate a three-dimensional roof model of the building roof.
[0040] In one implementation, the selected parametric geometric primitives can be sampled as a model point set, and the bidirectional chamfer distance between the point cloud data and the model point set can be used as a geometric error term to fit and optimize the shape parameters, position parameters, and attitude parameters of the parametric geometric primitives, thereby generating a three-dimensional roof model of the building roof.
[0041] The roof primitive library refers to a pre-established set of parametric geometric models, each model corresponding to a type of roof structure. The library contains the mathematical expression and optimizable parameter vectors for that type of roof. A parametric geometric primitive is a mathematical model that describes a geometric shape using a finite number of parameters, and can include planar roof primitives and curved roof primitives. The model point set refers to the set of three-dimensional points obtained by sampling the parametric geometric primitives according to the current parameter values. The bidirectional chamfer distance refers to the sum of two unidirectional chamfer distances from the original point cloud to the model point set and from the model point set to the original point cloud, used to measure the shape difference between the two point sets.
[0042] Specifically, after obtaining the roof category of the building roof, the corresponding parametric geometric primitive can be selected from the predefined roof primitive library. For example, if the roof category is a planar roof type such as a gable roof or a four-sloped roof, then a planar roof primitive is selected, whose parameter vector can include vectors such as length, width, height, ridge position, and slope angle; if the category is a curved roof type such as a cylindrical roof, conical roof, or hemispherical roof, then a curved roof primitive is selected, whose parameter vector can include vectors such as length, width, height, surface radius, central axis direction, and curvature control.
[0043] Furthermore, a parameter optimization objective can be constructed based on the geometric deviation between the point cloud data and the parameterized geometric primitives. The shape, position, and orientation parameters of the parameterized geometric primitives are then optimized, and a 3D roof model of the building is generated based on the optimized parameters. For example, in one implementation, the selected parameterized geometric primitives can be sampled into a model point set according to their current parameter values, with a sampling density comparable to the original point cloud density. Then, the objective function is defined as the bidirectional chamfer distance between the original point cloud and the model point set, i.e., calculating the sum of the squared distances from each point in the original point cloud to the nearest point in the model point set, plus the sum of the squared distances from each point in the model point set to the nearest point in the original point cloud. Then, the objective parameters that minimize the objective function are solved using global optimization methods such as gradient descent or the Levenberg-Marquardt algorithm. After obtaining the objective parameters, a corresponding 3D roof computer-aided design model (CAD model) can be generated based on the objective parameters and output as an object file format (OBJ format) or a polygon file format (PLY format). In other implementations, Hausdorff distance or bulldozer distance can also be used as the objective function, and the optimization algorithm can employ particle swarm optimization or genetic algorithms.
[0044] This step uses the roof classification results to constrain the primitive types of parametric modeling, which avoids repeated trial and error between incompatible geometric models and fundamentally eliminates the topological mismatch problem caused by category errors. At the same time, it can accurately fit the regular boundaries of planar roofs and the curvature characteristics of curved roofs, and output a regularized model that can be used directly.
[0045] In a specific application example, suppose the input is point cloud data from an airborne LiDAR scanner on the roof of a single building, which is actually a gable roof with an attic. First, the point cloud data undergoes preprocessing such as denoising, normalization, and coordinate alignment. Then, a region growing algorithm is used to extract four roof patches: four inclined patches and zero horizontal patches. Next, contour vertices are extracted to obtain two quadrilateral patches and two triangular patches. Furthermore, a symmetry index of 1, a contour concavity / convexity index of 0, a valley structure index of 0, and a surface existence index of 0 are extracted, thus constructing a 9-dimensional geometric feature vector. Then, depth projection maps for six primary and four secondary views are generated and input into the visual encoder of the visual language model, resulting in ten 512-dimensional visual embeddings. These are concatenated to form 5120-dimensional visual features. Finally, geometrically enhanced text prompts are constructed and processed by the text encoder of the visual language model to obtain various types of text features. Next, after bidirectional cross-attention fusion, the multimodal fusion feature showed the highest cosine similarity to the text feature of the four-sloped roof, thus identifying the input as a four-sloped roof. Furthermore, geometric matching score calculations showed a high degree of matching between the sample's geometric features and the four-sloped roof template. Superposition and correction improved the score, ultimately resulting in the output category of a four-sloped roof. Finally, a parametric model of a four-sloped roof was selected from the primitive library, including parameters such as length, width, height, ridge offset, and slope angle. After bidirectional chamfer distance optimization, a regularized 3D roof model was generated and output as an OBJ file.
[0046] The above embodiments extract explicit geometric features from raw point cloud data to construct multi-dimensional geometric features of building roofs. By combining projected images from multiple perspectives, the perceptual ability of multimodal representation of roof topology features can be enhanced. Furthermore, by utilizing pre-trained visual and text encoders to acquire visual and textual features of building roofs, and through a geometry-driven cue alignment module, trimodal fusion of geometry, vision, and semantics is achieved. Finally, a probabilistic geometric correction mechanism replaces rigid rule filtering to complete roof category determination, effectively improving the accuracy of zero-shot roof classification and enhancing classification robustness under noise, occlusion, and local defects. In addition, by utilizing the semantic priors provided by the building roof classification results to drive the parametric fitting of corresponding roof geometric primitives, unified regularized modeling of planar and curved roofs can be achieved, thereby generating topologically consistent, boundary-complete, and CAD-compatible 3D roof models. The above methods address the modeling needs of both planar and curved roofs, effectively improving the accuracy of roof classification and the geometric quality of 3D roof models.
[0047] In one embodiment, in step 2, to enhance the interpretability and robustness of roof category determination, this embodiment extracts the following multidimensional geometric features from the point cloud data:
[0048] in, Representing multidimensional geometric features, Indicates the total number of roof panels. Indicates the number of horizontal roof panels. Indicates the number of sloping roof panels. Indicates the number of triangular roof panels. Indicates the number of quadrilateral roof panels. Indicator of symmetry Indicators representing the concavity and convexity of a profile. This indicates the valley structure indicator. Indicates the existence index of the surface. This is the vector space containing the multidimensional geometric features.
[0049] In this embodiment, step 2 includes the following steps: Step 2.1: Using a region growing method constrained by normal vector consistency and projection coplanarity, extract the roof patch set from the point cloud data and calculate the total number of roof patches in the roof patch set.
[0050] Step 2.2: Calculate the angle between the normal vector and the vertical vector of each roof panel in the set of roof panels, and determine the number of horizontal roof panels and the number of inclined roof panels based on the angle.
[0051] Specifically, a region growing method constrained by normal vector consistency and projection coplanarity can be used to extract a set of structurally stable roof patches from disordered point cloud data. Among them, the first... A roof panel It can be written as:
[0052] in, Indicates the first A roof panel, Let X represent the set of roof panels, and let X represent the set of points currently being grown. Indicates candidate points, This indicates an adjacent merge operation. This is an indicator function that takes the value 1 if the condition within the parentheses is true, and 0 otherwise. Represents the normal vector of the candidate point With the current patch normal vector The difference in angle, This represents the projected distance from the candidate point to the current patch fitting plane. and These represent the normal vector smoothing constraint threshold and the plane projection constraint threshold, respectively.
[0053] After the roof panels are extracted, the normal vector and vertical vector of each roof panel can be calculated. The angle between the two sides is used to determine the number of horizontal and sloping roof panels. For example, if the angle is less than 5°, the roof panel is considered a horizontal roof panel and included in the count. Otherwise, it will be judged as a sloping roof panel and included in the calculation. Subsequently, after all roof panels have been categorized, the number of horizontal roof panels can be used to determine their classification. and the number of sloping roof panels The sum of the total number of roof panels is used to calculate the total number of roof panels. .
[0054] Step 2.3: Project the point cloud of each roof panel onto a local two-dimensional plane, obtain the initial vertex sequence of each roof panel through the boundary extraction operator, remove redundant vertices by recursive angle discrimination, and determine the number of triangular roof panels and quadrilateral roof panels based on the number of polygon vertices retained in each roof panel.
[0055] Specifically, in this step, for each roof patch in the set of roof patches, the point cloud of that roof patch can be projected onto a local two-dimensional plane, and an initial vertex sequence can be obtained through the boundary extraction operator. Then, redundant vertices are removed by recursion. The specific method is as follows:
[0056] in, For the current vertex sequence being judged, This is the vertex sequence from the previous discrimination. The vertex currently being judged. The vertex preceding the vertex being judged. The next vertex after the currently judged vertex. This is a dynamically changing angle threshold based on the current number of vertices. If a vertex's local interior angle is too close to 180°, it indicates that the vertex is more likely located inside a boundary segment than at a structural corner and should be removed. When the final number of retained polygon vertices is 3, this roof panel is counted in the number of triangular roof panels. When the number of retained polygon vertices is 4, the roof panel is counted in the number of quadrilateral roof panels. .
[0057] Step 2.4: Calculate the average distance between the point set after rotating the point cloud data 180° around the center point along the Z-axis and the point set before rotation, and determine the symmetry index based on the average distance.
[0058] Specifically, when calculating the symmetry index, the point cloud can be rotated 180° around its center point along the Z-axis to obtain the rotated point set. Then, the average distance between the rotated point set and the original point set can be calculated. The symmetry index of the current input is determined by comparing the average distance with a preset threshold. The average distance between the rotated point set and the point set before rotation is... The calculation method is as follows:
[0059] in, This is the average distance between the point set after the point cloud is rotated 180° around its center point along the Z-axis and the point set before the rotation. The set of points before rotation. For points in the set of points before rotation, For the rotated set of points, These are the points in the rotated point set. In this embodiment, if... Less than the threshold Then let the symmetry index The value is 1 if the value is not 1, otherwise it is 0. The symmetry index is used to reflect the degree of overlap of the point cloud space after rotation.
[0060] Step 2.5: Calculate the ratio between the roof outline area and the area of the smallest bounding rectangle of the roof outline area, and determine the outline concavity / convexity index based on the ratio.
[0061] Specifically, when calculating the profile convexity index, the roof profile area and the area of its smallest bounding rectangle are first extracted. Then, the ratio between these two areas is calculated, and the profile convexity index is determined based on this ratio. Specifically, the ratio between the roof profile area and the area of its smallest bounding rectangle is... The calculation method is as follows:
[0062] in, This is the ratio between the area of the building's roof outline and the area of the smallest bounding rectangle of the roof outline. This refers to the area of the building's roof outline. Let be the area of the smallest bounding rectangle of the roof outline area. In this embodiment, if Less than the threshold If the contour is determined to have obvious concavity, then the contour concavity index is set. Set to 1; otherwise, set to 1. It is 0.
[0063] Step 2.6: Based on the relative relationship between the elevation of the center point of two adjacent slopes of the building roof and the elevation of the intersection line, construct a depression measure, and determine the valley line structure index based on the depression measure.
[0064] Specifically, when calculating valley line structure indices, for roofs with concave water catchment structures, a depression measure can be constructed based on the relative relationship between the center elevations of adjacent slopes and the intersection line elevation, and the valley line structure indices can be determined based on this depression measure. The relative relationship between the center elevations of two spatially adjacent slopes and the intersection line elevation is crucial. The calculation method is as follows:
[0065] in, This represents the relative relationship between the elevations of the center points of two spatially adjacent slopes and the elevation of their intersection line. For slope collection, and For two adjacent slopes, , These are the elevations of the center points of the two slopes, respectively. The elevation of the intersection of the two slopes. This is the lowest global elevation. This is a global elevation constraint term. In this embodiment, if... Greater than the threshold Then let The valley structure indicator is 1; otherwise, set it to 1. It is 0.
[0066] Step 2.7: For any point on the building roof, solve for the target tangential direction in the neighborhood, calculate the tangential displacement and normal vector deviation of the neighborhood points along the target tangential direction, perform quadratic curve fitting, determine whether the local shape conforms to the surface characteristics based on the fitting error in multiple neighborhoods, and determine the surface existence index.
[0067] Specifically, when calculating the surface existence index, for any point on the building roof... Let its tangent plane be . The normal vector is Then in the neighborhood Internal solution target tangential direction The calculation method is as follows:
[0068] in, Center point The target's tangential direction within the neighborhood. This is the current center point being analyzed. Center point any neighboring point, Center point The set of neighborhood points, This is the initial tangential direction.
[0069] Furthermore, the tangential displacement of a neighboring point along the tangential direction of the target is defined as... The normal vector perpendicular to the tangent plane deviates by And perform quadratic curve fitting, the process is as follows:
[0070]
[0071] in, This is the current center point being analyzed. Center point any neighboring point, Center point The set of neighborhood points, At the center point The target tangential direction is obtained by solving within the neighborhood. Center point The normal vector at point is perpendicular to the tangent plane at the center point. This represents the tangential displacement of a neighboring point along the tangential direction of the target. The deviation of the normal vector of the neighboring points from the tangent plane. , , These are the parameters for fitting the quadratic curve. This represents the deviation of the normal vector of the corresponding tangential displacement from the predicted value, based on the quadratic fitting model. The sum of squared fitting errors, which is the sum of the squares of the differences between the actual deviations of the normal vectors and the predicted deviations of the normal vectors at all neighboring points, is used to measure the degree of fit between the local surface and the quadratic surface.
[0072] If the fitting error remains below the threshold in multiple neighborhoods This indicates that the local morphology conforms more to the characteristics of a second-order curved surface than a planar surface, thus setting the surface existence index... Set to 1; otherwise, set to 1. It is 0.
[0073] In this embodiment, the thresholds involved in each step can be adaptively adjusted according to parameters such as point cloud density and building scale. This embodiment does not impose specific limitations.
[0074] In a specific application scenario Figure 2 Examples of explicit geometric feature statistics for two different types of roofs are shown. The upper frustum roof has the following geometric feature statistics: a total of 5 roof panels, 1 horizontal panel, 4 sloping panels, 0 triangular panels, and 5 quadrilateral panels. This indicates that the frustum roof consists of 1 horizontal panel and 4 sloping panels, and all 5 panels are quadrilaterals. The lower ridge roof has the following geometric feature statistics: a total of 6 roof panels, 0 horizontal panels, 6 sloping panels, 2 triangular panels, and 4 quadrilateral panels. This indicates that the ridge roof consists entirely of 6 sloping panels, including 2 triangular panels and 4 quadrilateral panels. Figure 2 It intuitively demonstrates the structural differences in the surface composition of different roof types. These geometric feature vectors can effectively distinguish roof categories that are similar in appearance but different in topology, thus providing quantifiable geometric priors for subsequent multimodal fusion and roof classification.
[0075] In another specific application scenario, Figure 3The diagram shows a comparison of four geometric attribute indices for two other different types of roofs. The M-shaped roof at the top has the following geometric attributes: symmetry index = 0, indicating no rotational symmetry; contour concavity / convexity index = 1, indicating a distinctly concave contour; valley structure index = 1, indicating the presence of valley structures, such as concave drainage structures; and surface presence index = 0, indicating the absence of curved surface features and primarily flat surface composition. The ellipsoidal roof at the bottom has the following geometric attributes: symmetry index = 1, indicating rotational symmetry; contour concavity / convexity index = 0, indicating a relatively regular contour with no obvious concavity; valley structure index = 0, indicating the absence of valley structures; and surface presence index = 1, indicating the presence of distinct curved surface features. Figure 3 This intuitively illustrates the essential differences in geometric properties between planar and curved roofs, and demonstrates that these geometric feature vectors can effectively distinguish roof types with significant structural differences, thus providing quantifiable geometric priors for subsequent multimodal fusion classification.
[0076] The above embodiments distinguish between horizontal and sloping roof surfaces by the angle between normal vectors, and between triangular and quadrilateral facets by the number of vertices in the projected contour. Furthermore, the symmetry, concavity / convexity, valley structure, and surface existence of the roof are quantified by the rotation overlap distance, the ratio of the contour area to the minimum bounding rectangle area, the elevation relationship between adjacent slopes, and the quadratic curve fitting error. This can transform the implicit three-dimensional topological structure in the point cloud into explicit geometric feature vectors, thereby providing interpretable structural priors for subsequent multimodal fusion and compensating for the problem that two-dimensional projection is insufficient in expressing key information such as ridges, valleys, and curvature.
[0077] In one embodiment, step 3 includes the following steps: Step 3.1: Generate projected images of the building roof from multiple perspectives, including projected images from multiple main perspectives and projected images from multiple random auxiliary perspectives.
[0078] Step 3.2: Input each projected image into the visual encoder of the pre-trained visual language model to obtain the visual embedding of each projected image.
[0079] Step 3.3: Concatenate the visual embeddings of each projected image to obtain visual features, where the dimension of the visual features is the sum of the dimensions of the visual embeddings of each projected image.
[0080] Specifically, multiple viewpoint projection images of a building roof can be generated using methods such as perspective projection or orthographic projection. The generated set of projection images can be represented as follows:
[0081] in, For a set of projected images, For projection image, This refers to the sequence number of the projected image.
[0082] In a specific application scenario Figure 4 This paper illustrates two types of viewing angles for a building roof. The first type consists of six projected images from a fixed direction aligned with the main axis of the roof, such as the front view, side view, and top view. These angles are deterministic and standardized, ensuring a stable representation of the roof's basic outline and structure. The second type consists of four projected images from random auxiliary viewing angles, generated randomly in space and not required to be aligned with the main axis. These angles increase the diversity of observation angles and can capture local structural information that is difficult to observe from a main axis-aligned viewpoint.
[0083] After generating multi-view projection images, each projection image can be input into a pre-trained visual language model. The visual encoder in the visual language model extracts features from each projection image to obtain the visual embedding of each projection image, such as a 512-dimensional visual embedding. Then, the visual features of the building roof are constructed by concatenation. The visual feature representation is as follows:
[0084] in, The visual characteristics of a building's roof. , ... These are the projected images numbered 1 to 10, respectively. Visual embedding of the projected image. This is the vector space containing the visual features of the building's roof.
[0085] The above embodiments can capture the outer contour and depth variation information of the building roof from multiple spatial directions by generating projected images from multiple perspectives. By using a pre-trained visual language model visual encoder to extract the visual embeddings of each perspective and concatenate them into high-dimensional visual features, the ability of visual features to represent the three-dimensional topological structure of the roof can be enhanced, thereby alleviating the problem of difficulty in distinguishing roofs with similar appearances from a single perspective.
[0086] In one embodiment, step 4 includes the following steps: Step 4.1: For each type of roof, construct geometrically enhanced text hints, which include descriptive text about the outline shape, slope relationship, number of ridges, depth variation, and surface features.
[0087] Step 4.2: Input the geometrically enhanced text cues for each roof category into the text encoder of the visual language model to obtain the text embedding.
[0088] Step 4.3: Repeatedly concatenate text embeddings of the same category multiple times to obtain text features with the same dimension as the visual features.
[0089] Specifically, when extracting textual features from building roofs, geometrically enhanced text cues can first be constructed. For each roof category, the constructed geometrically enhanced text cues can simultaneously include multiple semantic information such as the roof's outline shape, slope relationships, number of ridges, depth variations, and surface features. Then, the geometrically enhanced text cues for each roof category are input into the text encoder of the visual language model to obtain text embeddings. The text embeddings are expressed as follows:
[0090] in, For the first Text embedding resembling building rooftops For a pre-trained visual language model, a text encoder. For the roof category of building roof, For the first Geometrically enhanced text hints for building rooftops. The vector space in which the text embedding for the building roof resides.
[0091] Furthermore, to correspond to the dimensionality of the visual features in 10 views, text embeddings of the same category can be concatenated 10 times to form text features of the same dimension as the visual features. The text features are expressed as follows:
[0092] in, For the first Textual features of building roofs, For the first Text embedding resembling building rooftops For the roof category of building roof, This is the vector space containing the textual features of the building's roof.
[0093] The above embodiments construct geometrically enhanced text cues for each type of roof, including outline shape, slope relationship, number of ridges, depth variation and surface features. The text encoder of the visual language model extracts the text embedding and repeatedly concatenates it into text features of the same dimension as the visual features. This allows the text features to contain rich descriptions of roof topology attributes, improving the alignment ability of text features with visual features and geometric features.
[0094] In one embodiment, step 5 includes the following steps: Step 5.1: Use a multilayer perceptron to project multidimensional geometric features onto the feature space of the visual language model to obtain geometric projection features, and reorganize the visual features into multi-view visual features.
[0095] Step 5.2: Perform geometric-to-visual cross-attention, using geometric projection features as queries and multi-view visual features as keys and values, to calculate attention-weighted visual semantic features.
[0096] Step 5.3: Perform reverse correction from vision to geometry, using the mean of the attention-weighted visual features as the query and the geometric projection features as the key and value, to calculate the geometric constraint features after reverse enhancement.
[0097] Step 5.4 involves performing residual fusion on the attention-weighted visual semantic features, the inversely enhanced geometric constraint features, and the original geometric projection features. The contributions of the visual semantic features and geometric constraint features are adjusted by preset weight coefficients, and multimodal fusion features are obtained after layer normalization.
[0098] Specifically, in the process of multimodal feature fusion, firstly, a multilayer perceptron, such as a 4-layer multilayer perceptron, can be used to project multidimensional geometric features onto the feature space of the visual language model. The representation of the projected geometric features is as follows:
[0099] in, Geometric projection features, The multidimensional geometric features of a building roof It is a multilayer perceptron. This is the vector space containing the geometric projection features.
[0100] Then, the visual features are restructured into multi-view visual features, and geometric-to-visual cross-attention is used. Geometric projection features are used as queries, and multi-view visual features are used as keys and values to calculate attention-weighted visual semantic features. The attention-weighted visual semantic features are expressed as follows:
[0101]
[0102]
[0103] in, For the query matrix in the geometric-to-visual cross-attention process, The key matrix in the geometric-to-visual cross-attention process. The value matrix in the geometric-to-visual cross-attention process. , , These are the weight matrices for the geometric-to-visual cross-attention process. Geometric projection features, For multi-view visual features, This is the scaling factor in the attention mechanism. This is the attention weight matrix from geometry to vision. Visual semantic features after attention weighting.
[0104] Next, a visual-to-geometric inverse correction is performed, expressed as follows:
[0105]
[0106]
[0107]
[0108] in, The average of the attention-weighted visual semantic features across all 10 views, i.e., the visual mean vector. The first of the attention-weighted visual semantic features Feature vectors of each view Geometric projection features, The view number. The query matrix in the visual-to-geometric reverse correction. The key matrix in the visual-to-geometric inverse correction. The value matrix in the visual-to-geometric inverse correction. , , These are the weight matrices in the vision-to-geometric back-correction process. This is the scaling factor in the attention mechanism. The attention weight matrix is from visual to geometric. These are the geometric constraint features after reverse enhancement.
[0109] Finally, the multimodal fusion feature can be expressed as:
[0110] in, For multimodal fusion features, For geometric calibration operators, These are the weighting coefficients between visual semantic features and geometric constraint features. Visual semantic features after attention weighting These are the geometric constraint features after reverse enhancement. This is a layer normalization operation.
[0111] In a specific application scenario Figure 5 The upper part illustrates the workflow of the geometry-driven cue alignment module. Multidimensional geometric features are projected onto a 512-dimensional feature space through a 4-layer fully connected multilayer perceptron. Simultaneously, the visual encoder of a pre-trained visual language model extracts 512-dimensional visual features from the multi-view projected images. These two sets of features are then fused through a bidirectional cross-attention mechanism to achieve multimodal alignment and obtain multimodal fused features.
[0112] The above embodiments use geometric-to-visual cross-attention to guide visual features to focus on key regions through geometric features, and then use visual-to-geometric reverse correction to enhance geometric representation with visual mean. Finally, multimodal fusion features are obtained through residual fusion and layer normalization. This can achieve deep synergy between explicit geometric priors and visual semantic features, so that the fusion features have both geometric interpretability and visual richness.
[0113] In one embodiment, step 6 includes the following steps: Step 6.1: Calculate the cosine similarity between the multimodal fusion features and the text features of each category to obtain the basic semantic score.
[0114] Step 6.2: Calculate the difference between each geometric descriptor of the building roof and the prior geometric template corresponding to each category of roof. Substitute the square of the difference into the Gaussian kernel function to obtain the geometric matching score, where the geometric descriptor is each component in the multidimensional geometric features.
[0115] Step 6.3: Multiply the geometric matching score by the preset geometric intervention coefficient and add the result as an additive bias to the basic semantic score to obtain the corrected classification semantic score.
[0116] Step 6.4: Normalize the corrected classification semantic scores and take the roof category with the highest probability as the roof category of the building roof.
[0117] Specifically, when identifying the roof category of a building, the cosine similarity between the multimodal fusion features and the textual prompts for each category is first calculated to obtain a basic semantic score. The basic semantic score is expressed as follows:
[0118] in, For the first The basic semantic score of the building roof, For multimodal fusion features, For the first Textual features of building roofs, This is a temperature coefficient used to adjust the scaling of the cosine similarity, which controls the smoothness of the predicted distribution. Let be the Euclidean length of the vector.
[0119] Suppose that the first sample extracted from the input sample... The geometric descriptors are: , No. The prior geometric template corresponding to the building roof is The geometric matching score definition can be solved in the following way:
[0120] in, For the first The geometric matching score of the building roof. For the roof of the building The first geometric descriptor, i.e., the first in the multidimensional geometric features. One portion, For the first The first in the prior geometric template corresponding to the roof of the building One portion, The bandwidth parameter of the Gaussian kernel function controls the sensitivity of the geometric matching score to geometric differences. The index of the geometric descriptor. The total number of geometric descriptors.
[0121] Next, the geometric prior is injected into the classification decision using additive bias, expressed as follows:
[0122]
[0123]
[0124] in, For the first The semantic score for the classification of building roofs, For the first The basic semantic score of the building roof, For the first The geometric matching score of the building roof. Geometric intervention coefficient, Given point cloud data In the case of the first The posterior probability of a building roof is obtained by analyzing the first... The semantic scores for classifying building rooftops are obtained by normalization. The final roof category is the roof category with the highest posterior probability.
[0125] In a specific application scenario Figure 5 The lower half of the diagram illustrates the probabilistic geometric correction mechanism. The left side displays textual features for each category, describing the depth map features of various roof types, such as "top row: gray and dark-lined rectangles with no corners; side row: solid triangles with dark tips and gray bottoms; no hexagons or cross ridges." By calculating the cosine similarity between the multimodal fusion features and the textual features of each category, the basic semantics are obtained. Simultaneously, by comparing the geometric descriptors extracted from the point cloud data—that is, the components of the multidimensional geometric features—with the prior geometric templates for each category, and inputting the difference into a Gaussian kernel function, a geometric matching score is obtained. Then, the geometric matching score is multiplied by a geometric intervention coefficient and weighted and added to the basic semantic score to obtain the final classification semantic score, thus normalizing the roof category. This process allows geometric priors to intervene in classification decisions in a soft, probabilistic manner, rather than using hard threshold filtering, thereby maintaining high robustness under noisy and occlusion conditions.
[0126] Compared to hard thresholding, the above scheme allows geometric priors to smoothly intervene in the reasoning process in a probabilistic manner, thus maintaining high robustness in the presence of noise, occlusion, and local geometric errors.
[0127] The above embodiments calculate the matching score between the geometric descriptor and the prior geometric template in the form of a Gaussian kernel, and then add this score as an additive bias to the basic semantic score. This enables probabilistic geometric correction and allows the geometric prior to be smoothly involved in the classification decision. This avoids the problem of hard threshold filtering misclassifying the correct category under point cloud noise or local defects, thereby improving the robustness of classification.
[0128] In one embodiment, in step 107, the parameterized geometric primitives include a planar roof primitive and a curved roof primitive, wherein the parameter vector of the planar roof primitive includes at least length, width, height, ridge position and slope angle, and the parameter vector of the curved roof primitive includes at least length, width, height, surface radius, central axis direction and curvature control amount.
[0129] Specifically, after obtaining the roof category of a building's roof, a parametric geometric primitive corresponding to that roof category can be selected from a predefined roof primitive library. Different roof categories can correspond to different parameter vectors. For example, the parameter vector expression for a certain planar roof primitive is as follows:
[0130] in, For a type of roof, a parametric geometric primitive, For length, For width, For height, , These are the slope inclination angles of the two slopes, respectively. In addition, the above-mentioned parameterized geometric primitives may also include other structural control parameters, which are not specifically limited here.
[0131] In a specific application scenario Figure 6 Typical parameters for planar roof primitives and curved roof primitives are shown. Typical parameters for planar roof primitives may include ridge, height, center point, length, and width, which are used to define the regular shape of the planar roof. Typical parameters for curved roof primitives may include radius, height, and center point, which are used to define the curved shape of the curved roof. Figure 6 This example visually demonstrates the parameter vector composition required for the parameterized geometric primitives in this embodiment. Different types of roofs correspond to different parameter sets, which can provide clear optimization variables for subsequent semantically guided parameterized fitting.
[0132] Furthermore, after extracting the corresponding parametric geometric primitives, parametric fitting can be performed to output a 3D roof model. Assume the selected parametric primitives are sampled as a set of model points. Therefore, parameter optimization can be achieved by minimizing the bidirectional Chamfer distance between the original point cloud data and the model point set. The objective function can be written in the following form:
[0133]
[0134] in, For parametric geometric primitives, The objective function, namely the bidirectional chamfer distance, is used to measure the shape difference between the original point cloud data and the model point set. The original input point cloud data, For any point in the point cloud data, For the model point set, Let be any point in the model point set. Given the target parameter vector, the objective function is obtained. The parameter that yields the minimum value. It is the square of the Euclidean length.
[0135] Finally, after global optimization, the corresponding 3D roof CAD model can be generated and output as OBJ, PLY or other standard 3D roof model formats to obtain the final 3D roof model.
[0136] In a specific application scenario Figure 7 The parametric reconstruction results of this embodiment on various types of flat and curved roofs are shown. Flat roofs can include, for example, hexagonal pavilion roofs, hipped roofs, M-shaped roofs, butterfly roofs, flat roofs, zigzag roofs, salt box roofs, and many other types. These roofs are mainly composed of planar panels, and their geometry can be described by parameters such as length, width, height, ridge position, and slope angle. Curved roofs can include, for example, suspended roofs, hemispherical roofs, ellipsoidal roofs, cylindrical roofs, and conical roofs, and many other types. These roofs have curved surfaces, and their geometry can be described by parameters such as radius and curvature control.
[0137] The above embodiments, by selecting corresponding parametric geometric primitives based on the classification results of building roofs, can avoid repeated trial and error between incompatible geometric models and significantly reduce the probability of topological mismatch. Furthermore, by using bidirectional chamfer distance to optimize model parameters, the regular boundaries of planar roofs and the curvature characteristics of curved roofs can be accurately fitted, thereby generating a topologically consistent and boundary-complete 3D roof model.
[0138] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. In addition, the labels corresponding to each step in the above embodiments are only for identification purposes and are not intended to limit the execution order of the steps. The execution order of the steps in each embodiment can be set according to the actual situation.
[0139] Furthermore, as Figures 1 to 7 In a specific implementation of the method shown, this application provides a three-dimensional model reconstruction device for a building roof, the device comprising: The point cloud data acquisition module is used to acquire point cloud data of the building roof to be modeled and to preprocess the point cloud data, wherein the point cloud data is acquired by an airborne lidar. The geometric feature construction module is used to extract explicit geometric features from preprocessed point cloud data and construct multidimensional geometric features. The visual feature construction module is used to generate projected images of the building roof from multiple perspectives, input each projected image into the visual encoder of the pre-trained visual language model to obtain visual embeddings, and construct visual features by concatenation. The text feature construction module is used to construct geometrically enhanced text prompts for each type of roof, and input the geometrically enhanced text prompts of each roof category into the text encoder of the visual language model to obtain text embeddings. The text embeddings of the same category are repeatedly concatenated to obtain text features with the same dimension as the visual features. The multimodal feature fusion module is used to project the multidimensional geometric features onto the feature space of the visual language model using a multilayer perceptron, reshape the visual features into multi-view visual features, and perform geometry-to-vision cross attention and vision-to-geometry reverse correction. Multimodal fused features are obtained through residual fusion and layer normalization. The roof category identification module is used to calculate the cosine similarity between the multimodal fusion features and the text features of each category to obtain the basic semantic score, and to inject the geometric matching score between the geometric descriptor of the building roof and the prior geometric template of each category into the basic semantic score to obtain the roof category of the building roof. The roof model reconstruction module is used to select parametric geometric primitives corresponding to the roof category of the building roof from a predefined roof primitive library, construct parameter optimization targets based on the geometric deviation between the point cloud data and the parametric geometric primitives, optimize the shape parameters, position parameters and attitude parameters of the parametric geometric primitives, and generate a three-dimensional roof model of the building roof.
[0140] In specific application scenarios, the multidimensional geometric features include the total number of roof patches, the number of horizontal roof patches, the number of sloping roof patches, the number of triangular roof patches, the number of quadrilateral roof patches, symmetry indices, contour concavity / convexity indices, valley structure indices, and surface existence indices. The geometric feature construction module is specifically used to extract a set of roof patches from the point cloud data using a region growing method constrained by normal vector consistency and projection coplanarity, and to calculate the total number of roof patches in the set. It calculates the angle between the normal vector and the vertical direction vector of each roof patch in the set, and determines the number of horizontal and sloping roof patches based on this angle. The point cloud of each roof patch is projected onto a local two-dimensional plane, and the initial vertex sequence of each roof patch is obtained through a boundary extraction operator. Redundant vertices are removed through recursive angle discrimination, and the number of polygon vertices retained for each roof patch is determined. The number of triangular roof panels and quadrilateral roof panels are determined; the average distance between the point cloud data after rotating 180° around the center point along the Z-axis and the point set before rotation is calculated, and the symmetry index is determined based on the average distance; the ratio between the roof outline area and the area of the minimum bounding rectangle of the roof outline area is calculated, and the outline concavity / convexity index is determined based on the ratio; a concavity measure is constructed based on the relative relationship between the elevation of the center point and the intersection line elevation of two spatially adjacent slopes of the building roof, and the valley line structure index is determined based on the concavity measure; for any point on the building roof, the target tangential direction is solved in the neighborhood, the tangential displacement and normal vector deviation of the neighborhood points along the target tangential direction are calculated, a quadratic curve is fitted, and the local shape is determined to conform to the surface characteristics based on the fitting error in multiple neighborhoods, and the surface existence index is determined.
[0141] In a specific application scenario, the visual feature construction module is specifically used to generate projected images of the building roof from multiple perspectives, wherein the projected images from multiple perspectives include projected images from multiple main perspectives and projected images from multiple random auxiliary perspectives; each projected image is input into the visual encoder of a pre-trained visual language model to obtain the visual embedding of each projected image; the visual embeddings of each projected image are concatenated to obtain the visual features, wherein the dimension of the visual features is the sum of the dimensions of the visual embeddings of each projected image.
[0142] In specific application scenarios, the text feature construction module is specifically used to construct geometrically enhanced text prompts for each type of roof. The geometrically enhanced text prompts include descriptive text about the outline shape, slope relationship, number of ridges, depth variation, and surface features. The geometrically enhanced text prompts for each roof category are input into the text encoder of the visual language model to obtain text embeddings. The text embeddings of the same category are repeatedly concatenated to obtain text features with the same dimension as the visual features.
[0143] In specific application scenarios, the multimodal feature fusion module is specifically used to project the multidimensional geometric features onto the feature space of the visual language model using a multilayer perceptron to obtain geometric projection features, and to reorganize the visual features into multi-view visual features; to perform geometry-to-vision cross-attention, using the geometric projection features as the query and the multi-view visual features as the key and value, to calculate the attention-weighted visual semantic features; to perform vision-to-geometric reverse correction, using the mean of the attention-weighted visual features as the query and the geometric projection features as the key and value, to calculate the reverse-enhanced geometric constraint features; to perform residual fusion of the visual semantic features, the geometric constraint features, and the geometric projection features, and to adjust the contributions of the visual semantic features and the geometric constraint features through preset weight coefficients, and to obtain the multimodal fused features through layer normalization.
[0144] In a specific application scenario, the roof category recognition module is specifically used to calculate the cosine similarity between the multimodal fusion features and the text features of each category to obtain a basic semantic score; calculate the difference between each geometric descriptor of the building roof and the prior geometric template corresponding to each category of roof, and substitute the square of the difference into the Gaussian kernel function to obtain a geometric matching score, wherein the geometric descriptor is each component in the multidimensional geometric features; multiply the geometric matching score by a preset geometric intervention coefficient and then add it as an additive bias to the basic semantic score to obtain a corrected classification semantic score; perform normalization processing on the corrected classification semantic score, and take the roof category with the highest probability as the roof category of the building roof.
[0145] In specific application scenarios, the parameterized geometric primitives include planar roof primitives and curved roof primitives. The parameter vector of the planar roof primitive includes at least one of length, width, height, ridge position, and slope angle. The parameter vector of the curved roof primitive includes at least one of length, width, height, surface radius, central axis direction, and curvature control value.
[0146] It should be noted that other corresponding descriptions of the functional units involved in the three-dimensional model reconstruction device for building roofs provided in this application embodiment can be found in the following references. Figures 1 to 7 The corresponding descriptions in the methods shown will not be repeated here.
[0147] This application also provides a computer device, specifically a personal computer, server, network device, etc. The computer device includes a bus, processor, memory, and communication interface, and may also include input / output interfaces and a display device. The processor of the computer device provides computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer device stores location information. The network interface of the computer device is used for communication with external terminals via a network connection. When the computer program is executed by the processor, it implements the steps in the various method embodiments.
[0148] Those skilled in the art will understand that the structure of the computer device described above is only a partial structure related to the solution of this application, and does not constitute a limitation on the computer device to which the solution of this application is applied. A specific computer device may include more or fewer components, or combine certain components, or have different component arrangements.
[0149] In one embodiment, a computer-readable storage medium is provided, which may be non-volatile or volatile, having stored thereon a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0150] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0151] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.
[0152] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0153] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A method for reconstructing a three-dimensional model of a building roof, characterized in that, The method includes: The point cloud data of the building roof to be modeled is acquired and preprocessed. The point cloud data is acquired by an airborne lidar. Explicit geometric features are extracted from the preprocessed point cloud data, and multidimensional geometric features are constructed. The projected images of the building roof from multiple perspectives are generated. Each projected image is input into the visual encoder of a pre-trained visual language model to obtain visual embeddings, and visual features are constructed by concatenation. For each type of roof, a geometrically enhanced text cue is constructed, and the geometrically enhanced text cue for each roof category is input into the text encoder of the visual language model to obtain text embeddings. The text embeddings of the same category are concatenated multiple times to obtain text features with the same dimension as the visual features. The multidimensional geometric features are projected onto the feature space of the visual language model using a multilayer perceptron, the visual features are reorganized into multi-view visual features, and cross-attention from geometry to vision and reverse correction from vision to geometry are performed. Multimodal fusion features are obtained through residual fusion and layer normalization. Calculate the cosine similarity between the multimodal fusion feature and the text features of each category to obtain the basic semantic score, and inject the geometric matching score between the geometric descriptor of the building roof and the prior geometric template of each category into the basic semantic score to obtain the roof category of the building roof; In a predefined roof primitive library, parametric geometric primitives corresponding to the roof category of the building roof are selected. Based on the geometric deviation between the point cloud data and the parametric geometric primitives, a parameter optimization target is constructed. The shape parameters, position parameters, and orientation parameters of the parametric geometric primitives are optimized to generate a three-dimensional roof model of the building roof.
2. The method according to claim 1, characterized in that, The multidimensional geometric features include the total number of roof panels, the number of horizontal roof panels, the number of sloping roof panels, the number of triangular roof panels, the number of quadrilateral roof panels, symmetry indices, contour concavity / convexity indices, valley structure indices, and surface existence indices; then, the explicit geometric features extracted from the preprocessed point cloud data and the multidimensional geometric features constructed include: Using a region growing method constrained by normal vector consistency and projection coplanarity, a set of roof patches is extracted from the point cloud data, and the total number of roof patches in the set is calculated. Calculate the angle between the normal vector and the vertical vector of each roof panel in the set of roof panels, and determine the number of horizontal roof panels and the number of inclined roof panels based on the angle. The point cloud of each roof panel is projected onto a local two-dimensional plane. The initial vertex sequence of each roof panel is obtained by the boundary extraction operator. Redundant vertices are removed by recursive angle discrimination. The number of triangular roof panels and quadrilateral roof panels is determined based on the number of polygon vertices retained in each roof panel. Calculate the average distance between the point set after rotating the point cloud data 180° around the center point along the Z-axis and the point set before rotation, and determine the symmetry index based on the average distance; Calculate the ratio between the roof outline area of the building roof and the area of the smallest bounding rectangle of the roof outline area, and determine the outline concavity / convexity index based on the ratio. A depression measure is constructed based on the relative relationship between the elevation of the center point of two adjacent slopes of the building roof and the elevation of the intersection line, and the valley line structure index is determined based on the depression measure. For any point on the building roof, the target tangential direction is solved in the neighborhood, the tangential displacement and normal vector deviation of the neighborhood points along the target tangential direction are calculated, and a quadratic curve is fitted. Based on the fitting error in multiple neighborhoods, it is determined whether the local shape conforms to the surface characteristics, and the surface existence index is determined.
3. The method according to claim 1, characterized in that, The process of generating projected images of the building roof from multiple viewpoints, inputting each projected image into the visual encoder of a pre-trained visual language model to obtain visual embeddings, and constructing visual features through concatenation includes: Generate projection images of the building roof from multiple perspectives, wherein the projection images from multiple perspectives include projection images from multiple main perspectives and projection images from multiple random auxiliary perspectives; Each projected image is input into the visual encoder of a pre-trained visual language model to obtain the visual embedding of each projected image. The visual embeddings of each projected image are concatenated to obtain the visual features, wherein the dimension of the visual features is the sum of the dimensions of the visual embeddings of each projected image.
4. The method according to claim 1, characterized in that, For each type of roof, a geometrically enhanced text cue is constructed, and the geometrically enhanced text cue for each roof category is input into the text encoder of the visual language model to obtain text embeddings. The text embeddings of the same category are repeatedly concatenated multiple times to obtain text features with the same dimension as the visual features, including: For each type of roof, a geometrically enhanced text tooltip is constructed, wherein the geometrically enhanced text tooltip includes descriptive text about the outline shape, slope relationship, number of ridges, depth variation and surface features; The geometrically enhanced text cues for each roof category are input into the text encoder of the visual language model to obtain the text embedding; Text of the same category is embedded and concatenated multiple times to obtain text features with the same dimension as the visual features.
5. The method according to claim 1, characterized in that, The process involves using a multilayer perceptron to project the multidimensional geometric features onto the feature space of the visual language model, reshaping the visual features into multi-view visual features, performing geometry-to-vision cross-attention and vision-to-geometric inverse correction, and obtaining multimodal fusion features through residual fusion and layer normalization, including: The multidimensional geometric features are projected onto the feature space of the visual language model using a multilayer perceptron to obtain geometric projection features, and the visual features are then reorganized into multi-view visual features. Perform cross-attention from geometry to vision, using the geometric projection features as queries and the multi-view visual features as keys and values, to calculate attention-weighted visual semantic features; Perform a visual-to-geometric inverse correction, using the mean of the attention-weighted visual features as the query and the geometric projection features as the key and value, to calculate the inversely enhanced geometric constraint features; The visual semantic features, geometric constraint features, and geometric projection features are residually fused, and the contributions of the visual semantic features and geometric constraint features are adjusted by preset weight coefficients. After layer normalization, multimodal fusion features are obtained.
6. The method according to claim 1, characterized in that, The calculation of cosine similarity between the multimodal fusion features and text features of each category yields a basic semantic score. The geometric matching score between the geometric descriptor of the building roof and the prior geometric templates of each category is then injected into the basic semantic score to obtain the roof category of the building roof, including: Calculate the cosine similarity between the multimodal fusion features and the text features of each category to obtain the basic semantic score; Calculate the difference between each geometric descriptor of the building roof and the prior geometric template corresponding to each category of roof, substitute the square of the difference into the Gaussian kernel function to obtain the geometric matching score, wherein the geometric descriptor is each component of the multidimensional geometric feature; The geometric matching score is multiplied by a preset geometric intervention coefficient and then added as an additive bias to the basic semantic score to obtain the corrected classification semantic score. The corrected classification semantic score is normalized, and the roof category with the highest probability is taken as the roof category of the building roof.
7. The method according to claim 1, characterized in that, The parameterized geometric primitives include planar roof primitives and curved roof primitives. The parameter vector of the planar roof primitive includes at least one of length, width, height, ridge position, and slope angle. The parameter vector of the curved roof primitive includes at least one of length, width, height, surface radius, central axis direction, and curvature control value.
8. A three-dimensional model reconstruction device for a building roof, characterized in that, The device includes: The point cloud data acquisition module is used to acquire point cloud data of the building roof to be modeled and to preprocess the point cloud data, wherein the point cloud data is acquired by an airborne lidar. The geometric feature construction module is used to extract explicit geometric features from preprocessed point cloud data and construct multidimensional geometric features. The visual feature construction module is used to generate projected images of the building roof from multiple perspectives, input each projected image into the visual encoder of the pre-trained visual language model to obtain visual embeddings, and construct visual features by concatenation. The text feature construction module is used to construct geometrically enhanced text prompts for each type of roof, and input the geometrically enhanced text prompts of each roof category into the text encoder of the visual language model to obtain text embeddings. The text embeddings of the same category are repeatedly concatenated to obtain text features with the same dimension as the visual features. The multimodal feature fusion module is used to project the multidimensional geometric features onto the feature space of the visual language model using a multilayer perceptron, reshape the visual features into multi-view visual features, and perform geometry-to-vision cross attention and vision-to-geometry reverse correction. Multimodal fused features are obtained through residual fusion and layer normalization. The roof category identification module is used to calculate the cosine similarity between the multimodal fusion features and the text features of each category to obtain the basic semantic score, and to inject the geometric matching score between the geometric descriptor of the building roof and the prior geometric template of each category into the basic semantic score to obtain the roof category of the building roof. The roof model reconstruction module is used to select parametric geometric primitives corresponding to the roof category of the building roof from a predefined roof primitive library, construct parameter optimization targets based on the geometric deviation between the point cloud data and the parametric geometric primitives, optimize the shape parameters, position parameters and attitude parameters of the parametric geometric primitives, and generate a three-dimensional roof model of the building roof.
9. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 7.
10. A computer device, comprising a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method of any one of claims 1 to 7.