Fast reconstruction method of real scene three-dimensional model of oblique satellite image

By constructing a geographic digital base with semantic tags and performing multi-view correction and registration in parallel, and combining it with an image processor for 3D reconstruction, the problem of limited efficiency and accuracy in existing technologies has been solved, and high-precision real-scene 3D model generation has been achieved at high efficiency and low cost.

CN121304976BActive Publication Date: 2026-06-16SHAANXI TIRAIN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHAANXI TIRAIN TECH CO LTD
Filing Date
2025-11-25
Publication Date
2026-06-16

Smart Images

  • Figure CN121304976B_ABST
    Figure CN121304976B_ABST
Patent Text Reader

Abstract

The application discloses a rapid reconstruction method of real scene three-dimensional model of oblique satellite image, and relates to the technical field of image reconstruction, and the method comprises the steps of constructing a geographic digital base with semantic identification, storing to a reconstruction module of a three-dimensional reconstruction platform, receiving oblique satellite images of a target area under multiple perspectives at a platform data end, performing multiple perspective correction and registration in parallel, determining a reconstruction image set and importing the reconstruction module, triggering an image processor built in the reconstruction module, taking the geographic digital base as a base, taking the reconstruction image set to perform three-dimensional reconstruction, determining a real scene three-dimensional model, and visualizing on a display interface of the three-dimensional reconstruction platform, so as to solve the technical problems that the efficiency, complexity and accuracy of three-dimensional reconstruction are limited under the oblique image scene in the prior art, and realize the technical effect of efficient, low-cost and high-precision real scene three-dimensional model reconstruction by fully utilizing the advantages of multiple perspective oblique images.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image reconstruction technology, specifically to a method for rapid reconstruction of real-world 3D models from oblique satellite imagery. Background Technology

[0002] Currently, most existing 3D reconstruction technologies employ traditional image processing and photogrammetry methods. While these methods can construct ground feature models, they suffer from low efficiency and insufficient accuracy when processing large-scale, multi-view, and high-resolution image data.

[0003] First, due to the serial processing mode, the modeling cycle is long when the amount of images is large, making it difficult to meet the needs of emergency or rapid updates. Second, the geometric correction and registration errors of multi-view images are large, especially in the corners of buildings and complex terrain areas, which can easily lead to splicing misalignment and affect the overall accuracy of the model. Third, oblique images are affected by perspective distortion and lighting differences, and texture mapping often shows blurring or seam marks, reducing the realism of the model.

[0004] To address the aforementioned issues, some studies have attempted to improve the situation by introducing point cloud encryption, deep learning segmentation, or multi-source data fusion. However, most of these solutions suffer from high computational complexity, reliance on proprietary data sources, and significant costs, and they still struggle to balance efficiency and accuracy in practical applications.

[0005] Therefore, how to fully utilize the advantages of multi-view oblique imagery, combine open-source geographic data with rapid modeling methods, and achieve efficient, low-cost, and high-precision reconstruction of real-world 3D models remains a pressing technical challenge. Summary of the Invention

[0006] This application provides a method for rapid reconstruction of real-scene 3D models from oblique satellite imagery, which addresses the technical problem of limited efficiency, complexity, and accuracy in 3D reconstruction under oblique imagery scenarios in existing technologies.

[0007] In view of the above problems, this application provides a method for rapid reconstruction of real-scene 3D models from oblique satellite imagery.

[0008] This application provides a method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery. The method includes: constructing a geographic digital base with semantic tags and storing it in the reconstruction module of a 3D reconstruction platform; receiving oblique satellite imagery of the target area from multiple perspectives at the platform's data terminal, performing multi-view correction and registration in parallel, determining the reconstructed image set, and importing it into the reconstruction module; triggering the image processor built into the reconstruction module to perform 3D reconstruction based on the geographic digital base and the reconstructed image set to determine the real-scene 3D model; wherein the 3D reconstruction stage includes image boundary segmentation and completion, dense point cloud reconstruction, lightweight mesh reconstruction, and optimized viewpoint mapping; and visualizing the real-scene 3D model on the display interface of the 3D reconstruction platform.

[0009] One or more technical solutions provided in this application have at least the following technical effects or advantages:

[0010] The method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery provided in this application constructs a geographic digital base with semantic tags, stores it in the reconstruction module of a 3D reconstruction platform, receives oblique satellite imagery of the target area from multiple perspectives, performs multi-view correction and registration in parallel, determines the reconstructed image set and imports it into the reconstruction module, triggers the image processor built into the reconstruction module, and performs 3D reconstruction based on the geographic digital base and the reconstructed image set to determine the real-scene 3D model, which is then visualized on the display interface of the 3D reconstruction platform. This method addresses the technical problems of limited efficiency, complexity, and accuracy of 3D reconstruction in oblique imagery scenarios in the prior art. By fully utilizing the advantages of multi-view oblique imagery, it achieves the technical effect of efficient, low-cost, and high-precision real-scene 3D model reconstruction. Attached Figure Description

[0011] Figure 1 This application provides a flowchart illustrating the rapid reconstruction method for real-world 3D models from oblique satellite imagery.

[0012] Figure 2 This application provides a schematic diagram of the construction process of the geographic digital base in the rapid reconstruction method of real-scene 3D model from oblique satellite imagery. Detailed Implementation

[0013] This application provides a method for rapid reconstruction of real-scene 3D models from oblique satellite imagery, which addresses the technical limitations in efficiency, complexity, and accuracy of 3D reconstruction in oblique imagery scenarios in existing technologies.

[0014] like Figure 1 As shown, this application provides a method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery, the method comprising:

[0015] S1: Construct a geographic digital base with semantic tags and store it in the reconstruction module of the 3D reconstruction platform. The platform data terminal receives oblique satellite images of the target area from multiple perspectives, performs multi-view correction and registration in parallel, determines the reconstructed image set, and imports it into the reconstruction module.

[0016] In this application, a geographic digital foundation with semantic identifiers is proposed, that is, a set of basic geographic information of the target area is stored and expressed digitally, which mainly includes the vector boundaries of ground features, elevation data and spatial topological relationships.

[0017] Preferably, by introducing semantic identifiers, semantic attributes can be assigned to different geographic features in the geographic digital base. For example, building outlines can be marked as building features, road boundaries as road features, and terrain undulations can be described using elevation data. In this application, the role of semantic identifiers is to provide constraints and prior knowledge for the subsequent 3D modeling process, enabling the identification and maintenance of the consistency of boundaries and attributes of different features during the modeling process.

[0018] Subsequently, the geographic digital base is stored in the reconstruction module of the 3D reconstruction platform as the reference data source for subsequent image processing and model reconstruction.

[0019] Based on this, the platform's data terminal receives oblique satellite images of the target area from multiple perspectives. Specifically, oblique satellite images are images acquired by sensors onboard the satellite at non-vertical angles, and their perspectives may include frontal, forward, rearward, left, and right views, which can supplement the missing side information of ground features in traditional orthophotos.

[0020] For example, the texture of a building's facade is often obscured or compressed in orthophotos, but can be fully recorded in oblique images, thereby improving the realism and completeness of the model.

[0021] In the specific implementation process, when the platform's data terminal receives the above-mentioned multi-view images, it needs to support multiple input formats, such as TIFF or JPEG2000, in order to adapt to the output standards of different satellite data sources.

[0022] Subsequently, multi-view correction and registration are performed in parallel. Specifically, the correction mainly includes two aspects: radiometric correction and geometric correction. Radiometric correction is used to eliminate grayscale inconsistencies caused by atmospheric scattering, changes in illumination, and other factors, so that the multi-view images maintain uniformity in brightness and color. Geometric correction, on the other hand, uses satellite orbital parameters and camera imaging parameters, combined with ground control point data, to perform spatial geometric transformations on the images to eliminate distortions caused by sensor tilt or the curvature of the Earth, so that the spatial position of the images matches the actual geographical location.

[0023] Furthermore, the corrected multi-view images are registered, that is, the spatial alignment process of images from different perspectives can be achieved by using feature point matching.

[0024] For example, by extracting scale- and rotation-invariant features such as building corners or road intersections, and using random sampling consistency to eliminate mismatched points, images from different angles can be mapped to a unified coordinate system. This correction and registration process significantly reduces processing time while ensuring spatial consistency between images.

[0025] Ultimately, the reconstructed image set was determined, which is a set of multi-view images that have undergone radiometric and geometric correction and spatial registration. This set maintains strict consistency in spatial location and has a balanced texture representation, which can meet the needs of subsequent 3D reconstruction.

[0026] For example, in modeling an urban area, the reconstructed image set may contain image data from five perspectives, each image carrying spatial positioning information corresponding to a geographic digital atlas. After importing this image set into the reconstruction module, the module can perform 3D geometric reconstruction and texture mapping under the constraints of the digital atlas, using a unified benchmark, thereby forming a high-precision, semantically consistent real-world 3D model.

[0027] Furthermore, such as Figure 2 As shown, to construct a geographic digital infrastructure with semantic identifiers, step S1 of this application includes:

[0028] Semantic tags are determined, wherein the elements of the semantic tags include at least the vector boundaries of ground features and elevation data; for the target area, a lightweight geographic topology is constructed, and the semantic tags are introduced to identify key topological locations as the geographic digital base.

[0029] In this application, the semantic tag refers to a set of attributes that semantically identify different elements in geospatial data. It includes not only spatial location but also abstract cognition of element categories.

[0030] In one feasible implementation proposed in this application, the semantic tag's elements include at least the feature vector boundary and elevation data. The feature vector boundary refers to the outline information of a feature expressed in vector data form, such as the polygonal boundary of a building's exterior wall, the linear orientation of a road, or the boundary of a park or green space. Elevation data refers to numerical information reflecting the surface undulations of the feature, which can be derived from digital elevation models or lidar point cloud data. By combining the vector boundary and elevation data, the semantic tag can simultaneously describe the spatial characteristics of the feature in both its planar position and vertical direction.

[0031] For example, a building semantic label can include not only its two-dimensional planar outline, but also the building's average height or vertex elevation.

[0032] Subsequently, for the target area, a lightweight geographic topology is constructed, that is, while ensuring the correctness of spatial topological relationships, simplified rules are used to model the relationship between ground features.

[0033] In this application, since the constructed geographic digital base is mainly used for image comparison reference in subsequent modeling, it is constructed using a lightweight topology and semantic tagging method to eliminate information redundancy while meeting processing requirements.

[0034] Specifically, lightweight geographic topology retains only key adjacency, containment, and intersection relationships to support the necessary constraints for 3D reconstruction. For example, within a city block, the relationship between buildings and roads only needs to be identified by indicating that roads pass along building boundaries, rather than establishing a full topological information network, thereby reducing data redundancy and improving processing efficiency.

[0035] Subsequently, semantic tags are introduced to identify key topological locations. In this application, key topological locations refer to spatial nodes or boundaries that have a constraining effect on 3D modeling, such as road intersections, building corners, or terrain abrupt changes. Attaching semantic tags to these locations can serve as reference points during reconstruction, ensuring that the geometry remains consistent with the actual geographical scene.

[0036] For example, when dividing a building into grids, points with the semantics of building corners can be used as control points for grid generation, thereby avoiding wall misalignment or shape distortion caused by data ambiguity.

[0037] Finally, the lightweight geographic topology construction result containing semantic tags is used as the geographic digital foundation. This foundation plays a benchmark role in subsequent image processing and 3D modeling, not only providing spatial constraints but also guiding image segmentation, feature point extraction, and texture mapping through semantic information. This makes the generated 3D model more consistent with the semantic attributes and spatial structure of actual ground features, thereby significantly improving the model's accuracy and expressiveness.

[0038] Furthermore, by receiving oblique satellite images of the target area from multiple perspectives, performing multi-view correction and registration in parallel, and determining the reconstructed image set, step S1 of this application includes:

[0039] The oblique satellite imagery is received, and prior verification and correction processing are performed on the oblique satellite imagery from each viewpoint using radiometric and geometric correction to determine the corrected oblique satellite imagery. Using feature point matching as the registration method and the geographic digital base as a reference, spatial phase alignment registration is performed on the corrected oblique satellite imagery to determine the reconstructed image set.

[0040] In this application's embodiments, oblique satellite imagery refers to multi-angle remote sensing images acquired by satellite sensors at non-vertical angles. These images typically include multiple directions such as frontal, forward-looking, back-looking, left-looking, and right-looking. Their key feature is the ability to simultaneously provide information on the top and sides of ground features, resulting in richer data content compared to traditional orthophotos. The received imagery data may originate from different satellite platforms and be stored in various formats; therefore, the platform must possess multi-source compatible input capabilities to ensure smooth data import.

[0041] After importing the oblique satellite imagery, radiometric and geometric corrections are performed sequentially on the images from each viewpoint. Radiometric correction refers to correcting inconsistencies in brightness and color in the images caused by factors such as atmospheric scattering, differences in illumination intensity, and variations in sensor sensitivity.

[0042] For example, when photographing the same building at different times or from different angles, its walls may exhibit varying shades of gray. Radiometric correction can unify these variations into a consistent brightness level. In a preferred implementation, the average value from multiple viewing angles can be used as the standard for uniformity.

[0043] Similarly, geometric correction eliminates spatial distortions in images caused by imaging angles, Earth curvature, and terrain undulations by using sensor imaging models, satellite orbit parameters, and ground control points, so that the images maintain a correspondence with the actual locations of ground features in the geographic coordinate system.

[0044] Finally, the images, after radiometric and geometric correction, form corrected tilted satellite images, which have achieved a high degree of consistency in both space and color.

[0045] Based on this, feature point matching is used as the registration method, and a geographic digital atlas is used as a reference to perform spatial phase alignment registration on the corrected tilted satellite imagery. Specifically, feature points are extracted from the imagery by points with scale and rotation invariance, such as building corners, road intersections, or bridge edges, and corresponding relationships are found between images from different viewpoints to achieve spatial alignment between images.

[0046] In the optimized implementation process, in order to improve the matching accuracy, a random sampling consistency algorithm can be introduced to eliminate mismatched points and ensure that the final registration result is stable and reliable.

[0047] Meanwhile, the geographic digital base serves as a reference benchmark, providing semantic boundaries and elevation information for ground features. This ensures that the spatial alignment of multi-view images is not only geometrically consistent but also semantically consistent, thereby reducing misalignment and overlap.

[0048] In summary, the reconstructed image set, namely a set of multi-view images after prior verification, radiometric correction, geometric correction and spatial registration, has a unified spatial reference, coordinated lighting performance and high-precision geometric consistency, and can be directly used in the subsequent 3D modeling process, thus laying the foundation for generating high-quality real-scene 3D models.

[0049] S2: Trigger the image processor built into the reconstruction module to perform three-dimensional reconstruction based on the geographic digital base and the reconstructed image set to determine the real-scene three-dimensional model. The three-dimensional reconstruction stage includes image boundary segmentation and completion, dense point cloud reconstruction, lightweight mesh reconstruction, and optimal viewpoint mapping.

[0050] S3: Visualize the real-world 3D model on the display interface of the 3D reconstruction platform.

[0051] In the modeling stage of this application, 3D reconstruction processing is performed by triggering the image processor built into the reconstruction module. The image processor refers to a multi-stage modeling pipeline implemented in the reconstruction module in a software and hardware collaborative manner. It has task orchestration and resource scheduling capabilities and can adaptively allocate computing threads and video memory resources according to the size of the input image and the complexity of the scene.

[0052] After the image processor is started, modeling is carried out based on the geographic digital base. That is, the geographic digital base with semantic labels is injected into the reconstruction process as both geometric and semantic constraints to limit the spatial search range of candidate geometry and guide subsequent texture selection.

[0053] For example, building polygon boundaries are used as hard constraints for mesh cutting; elevation grids are used as threshold references for point cloud filtering; semantic labels, such as roads and water bodies, are used to suppress mismatches and texture contamination in non-rigid regions. By using the base as a global reference, error propagation can be constrained in the early stages, reducing the overall geometric offset caused by local optima.

[0054] Subsequently, the reconstructed image set is used as input data to perform three-dimensional reconstruction, that is, the process of retrieving the three-dimensional geometric and texture attributes of the target area from two-dimensional multi-view images under a unified coordinate framework. Its input is a multi-view sequence after correction and registration, and the output is a real-scene three-dimensional model containing vertices, normals, texture coordinates and material mapping.

[0055] In a preferred exemplary approach, to ensure a clear data link, images can be grouped and scheduled according to viewpoint type and coverage blocks, with priority given to blocks with sufficient parallax and high overlap, and cross-block boundary stitching performed after reconstruction to avoid boundary misalignment.

[0056] In this process, the 3D reconstruction stage comprises four consecutive sub-stages: image boundary segmentation and completion, dense point cloud reconstruction, lightweight mesh reconstruction, and optimized viewpoint mapping. Each stage explicitly interacts with the geospatial digital atlas. The image boundary segmentation and completion stage first regionalizes the image based on semantic boundaries, separating areas such as buildings, roads, and water bodies to reduce cross-category texture interference. For texture loss areas caused by occlusion or shadows, corresponding regions are retrieved from adjacent views, and geometric consistency-based interpolation completion is performed. Semantic consistency constraints are introduced when necessary to prevent vegetation textures from being mistakenly added to building facades. The output of this stage is a set of effective images with explicit segmentation masks and confidence scores, used to improve the robustness of subsequent matching and projection.

[0057] Subsequently, the dense point cloud reconstruction stage performs multi-view stereo matching and triangulation guided by the obtained robust features and segmentation to determine the densely reconstructed point cloud. Optionally, to suppress outliers and holes, a voxel grid aligned with the elevation grid can be used for local consistency checks, and semantic labels can be used to define reasonable intervals for normal distribution. For example, near-horizontal normals are preferentially retained for roof areas, and near-vertical normals are preferentially retained for facade areas, thereby improving the physical rationality of the point cloud structure. The output of this stage is a dense point cloud with normal and intensity attributes, providing sufficient sampling for meshing.

[0058] Subsequently, the lightweight mesh reconstruction stage performs incremental Poisson reconstruction on the dense point cloud, prioritizing the growth of surfaces from high-confidence sub-blocks, and using semantic boundaries as hard cut lines during the growth process to avoid surface connectivity across buildings and roads. The output of this stage is a topologically valid, detail-preserving, and face-controlled triangular mesh, providing a foundation for real-time rendering and network distribution.

[0059] Subsequently, the optimal viewpoint mapping stage performs texture source selection and projection mapping on the stabilized mesh. That is, it selects the candidate images that provide the best imaging conditions for the target mesh fragments. Evaluation indicators include, but are not limited to, incident angle, parallax, occlusion probability, and radiation consistency. Semantic weights are incorporated during the evaluation; for example, the weight of oblique side views is increased for facade areas, and the weight of frontal views is increased for roof areas. After selecting the main texture source, geometric projection and seam optimization are performed. The visibility of seams is mitigated by cross-view color adjustment and boundary feathering, and texture leakage is constrained by a segmentation mask during texture coordinate generation, ultimately forming a material mapping with realism and consistency.

[0060] In summary, the identified real-world 3D model of the target area is a 3D representation entity that combines geometric accuracy and realistic texture under real spatial scale and coordinate reference, and is capable of undergoing further measurement, analysis, and publishing operations.

[0061] In a further preferred embodiment, to ensure model reusability, the system generates a metadata list before output, recording coordinate references, semantic legends, LOD layering, and texture resolution parameters to facilitate cross-platform loading and subsequent incremental updates.

[0062] Subsequently, the real-world 3D model is visualized on the display interface of the 3D reconstruction platform. The specific display process is as follows: When loading the real-world 3D model, the display interface reads the aforementioned metadata, adaptively selects the LOD level according to device performance and network bandwidth, and enables frustum-based clipping and on-demand texture streaming to achieve a stable frame rate during the interaction process. To facilitate review and verification, the interface provides semantic layer switches and measurement tools, allowing switching between layers to complete the closed-loop presentation from data to visible results.

[0063] Furthermore, before triggering the image processor built into the reconstruction module, the construction of the image processor, step S2 of this application includes:

[0064] The image processor is determined by using image boundary segmentation and completion as the first reconstruction stage, dense point cloud reconstruction as the second reconstruction stage, lightweight mesh reconstruction as the third reconstruction stage, and optimized viewpoint mapping as the fourth reconstruction stage. The underlying logic of the first, second, third, and fourth reconstruction stages is deployed, and the image processor is determined by training to convergence through sample-driven training.

[0065] The first reconstruction stage logic is based on segmentation based on ground feature vector boundaries and interpolation completion based on occlusion areas; the second reconstruction stage logic is based on the 3D distribution under multi-view feature point fusion; the third reconstruction stage logic is based on incremental Poisson reconstruction; and the fourth reconstruction stage logic is based on the dynamic selection of the main texture source under the preferred viewpoint and semantically guided voxel rendering.

[0066] In this embodiment, the reconstruction process is divided into four consecutive stages, which trigger the pipeline execution of the image processor. Under a unified coordinate framework and semantic constraints, the progressive solution from two-dimensional image to three-dimensional geometry and then to texture representation is completed in sequence. The data structures before and after each stage are compatible, and the output is the input of the next stage, avoiding repeated calculations and reducing error accumulation.

[0067] The following is a specific implementation method provided in this application:

[0068] Firstly, the image boundary segmentation and completion in the first reconstruction stage adopts segmentation based on ground feature vector boundaries and interpolation completion based on occlusion areas as the core logic: the former ensures that the category boundary is consistent with the real boundary, and the latter uses cross-view redundancy to make up for missing areas.

[0069] Specifically, feature-based segmentation involves projecting the polygonal boundaries of buildings, linear boundaries of roads, and water bodies from the geographic digital base onto images from various viewpoints. This serves as a weak / strong supervised prior, guiding semantic segmentation networks or traditional segmentation operators to converge within the corresponding regions, thereby obtaining segmentation results with consistent categories and clear edges. The interpolation completion based on occlusion areas refers to completing areas lacking texture due to tree canopies, shadows, or extreme viewpoints through cross-view projection and photometric consistency constraints, using geometrically corrected regions of the same name from adjacent viewpoints. If necessary, semantic priors are superimposed to suppress category color mixing.

[0070] For example, when a building facade is obscured by roadside trees, the corresponding facade block is retrieved from the front-view or rear-view image. After registration, the texture and edge are jointly completed within the obscuring mask to improve the reliability of subsequent matching and projection.

[0071] Secondly, the dense point cloud reconstruction in the second reconstruction stage takes the three-dimensional distribution under the fusion of multi-view feature points as the core logic: the three-dimensional distribution under the fusion of multi-view feature points as the logic: through robust solution of cross-view same-name point trajectories and triangulation, a high-density and semantically consistent three-dimensional point cloud is obtained, providing sufficient sampling and normal prior for surface reconstruction.

[0072] Specifically, the feature point fusion refers to extracting corner and edge points with scale / rotation invariance from various viewpoints and filtering cross-class noise under segmentation mask constraints; subsequently, robust trajectories of corresponding points are obtained through cross-view matching and random consistency elimination. The three-dimensional distribution refers to performing triangulation on corresponding points in a unified coordinate system to obtain a dense three-dimensional point set with normal and intensity attributes.

[0073] In a preferred embodiment, an elevation grid is used as a voxel reference to perform consistency checks on outliers, and the normal prior is adjusted according to semantic categories, such as making roofs more horizontal and facades more vertical, in order to improve the physical rationality of the point cloud and the subsequent surface reconstructability.

[0074] Third, the lightweight mesh reconstruction in the third reconstruction stage is based on incremental Poisson reconstruction: while ensuring the continuity of the surface, semantic boundaries are used as topological cutting conditions and error evaluation references, and face reduction is performed in the gentle curvature region, thereby achieving a balance between detail fidelity and scale control, which is convenient for real-time rendering and network distribution.

[0075] Specifically, the incremental approach refers to growing implicit surfaces starting from high-confidence sub-blocks and gradually expanding them as data is introduced. Simultaneously, alternating optimizations of local reconstruction and global stitching are performed to control memory usage and reconstruction stability. To maintain the geometric realism of semantic boundaries, building outlines, road boundaries, etc., are used as hard cut lines to suppress cross-class connectivity during reconstruction. After completing the initial surface, adaptive surface simplification and edge fidelity are implemented based on curvature thresholds and projection errors, prioritizing the preservation of high-value details such as roof ridges, eaves, and corners, outputting a topologically compliant triangular mesh with controlled face count.

[0076] Fourth, the optimal viewpoint mapping in the fourth reconstruction stage follows the logic of dynamic selection of the main texture source under the optimal viewpoint and semantically guided voxel rendering: the best imaging source is selected through viewpoint scoring, and texture fitting is completed with semantically adaptive color correction and seam optimization; in the facade area with high reflectivity or repetitive texture, voxel-level occlusion testing and semantic consistency regularization are introduced to reduce seam visibility and color drift, ensuring that the model has realism and consistency at both the macro and detail levels.

[0077] Specifically, the dynamic selection refers to comprehensively scoring candidate views based on incident angle, resolution (ground sampling distance), occlusion probability, radiation consistency, and geometric reprojection error, and assigning weights according to semantic categories: for example, improving the side view score of facades, determining the highest-scoring view as the primary texture source, and using the second-best view for seam optimization and detail compensation. The semantically guided voxel rendering refers to establishing a one-to-one mapping between mesh fragments and voxel space, selecting differentiated color correction, anti-aliasing, and boundary feathering strategies based on semantic categories, suppressing problems such as road texture leakage to building facades or water reflection pollution, and ultimately generating a material mapping result with high consistency and low seam visibility.

[0078] Furthermore, using the above processing logic to perform sample-driven training until convergence, one feasible implementation provided in this application is as follows: a multi-view sample library containing multiple climates, landforms, and building volumes is constructed, which includes both the original images and their semantic ground truth, as well as the corrected / registered geometric references and quality labels; during training, class equalization and boundary enhancement loss are used for the segmentation network, a composite loss of photometric consistency and geometric consistency is introduced for matching / depth estimation, and seam visibility and color drift regularization are introduced for texture mapping; the convergence criterion can be the simultaneous stability of multiple indicators on the validation set.

[0079] Once the above training process converges, the stage logic and parameters are solidified into a configurable image processor and embedded in the reconstruction module of the 3D reconstruction platform.

[0080] Furthermore, three-dimensional reconstruction is performed using the reconstructed image set. In the first reconstruction stage, step S2 of this application includes:

[0081] Based on the geographic digital base, the reconstructed image set is segmented using the ground feature vector boundary as the segmentation target, and a segmentation identifier is added to each reconstructed image. The reconstructed image set is scanned to detect occlusion areas. If occlusion areas exist, interpolation is used to complete the occlusion areas and determine the valid image set. The interpolation is optimized using semantic tag-based guidance and reference to viewpoint-registered images.

[0082] In this embodiment, the reconstructed image set is subjected to positioning constraints based on the geographic digital base, and regionalization processing is performed on an image-by-image basis using the feature vector boundaries as the segmentation target. Specifically, the feature vector boundaries are the set of boundaries in the aforementioned geographic digital base, such as building outlines, road red lines, and water body areas, which can be projected onto the image coordinate system and expressed in vector form.

[0083] In one feasible implementation provided in this application, by projecting the vector boundary onto images from various viewpoints under a unified coordinate reference, candidate segmentation regions consistent with the image pixel grid are obtained, thereby limiting the spatial search range of the segmentation solution, reducing the probability of cross-class missegmentation and improving the solution stability.

[0084] After spatially locating the target, the reconstructed image set is segmented to obtain pixel-level masks corresponding one-to-one with the land cover category. The localization segmentation proposed in this application refers to using prior candidate regions obtained by vector boundary projection as strong / weak supervision to guide semantic segmentation or traditional segmentation operators based on energy minimization to converge within a limited region, and combining edge enhancement and morphological refinement to output mask results with continuous boundaries and controllable holes.

[0085] Subsequently, after the localization and segmentation are completed, segmentation identifiers are added to each reconstructed image. These identifiers are category and region metadata stored with the image, including at least category codes such as buildings, roads, and water bodies; region IDs; boundary polygons; and confidence scores. These identifiers are used for operator selection and weight adjustment in subsequent matching, depth estimation, and texture mapping stages.

[0086] Subsequently, the reconstructed image set is subjected to occlusion detection to identify areas of missing visibility caused by tree canopies, shadows, extreme viewing angles, or occluders. In one exemplary detection method proposed in this application: when a pixel exhibits high reprojection error between the main view and the reference view and its luminous similarity is below a threshold, and the parallax of the corresponding triangulation is unstable, the pixel is determined to fall into an occlusion area.

[0087] Furthermore, when occlusion areas are detected, interpolation is used to complete the occlusion areas in order to restore the geometric and texture visibility of key areas.

[0088] Specifically, the interpolation completion refers to filling in the content of missing pixels while ensuring geometric consistency. In a preferred embodiment, the view is first scored based on the incident angle, occlusion probability, and reprojection error in the reference view, i.e., the images from other perspectives. The views with the highest scores are selected as the main completion sources, and sub-pixel alignment is performed in the overlapping area using local affine or thin plate splines. Then, pixel-level completion is completed by confidence-weighted fusion.

[0089] Preferably, to avoid cross-category texture leakage, the completion process is optimized by using semantic label-based guidance and view registration image reference for interpolation: the former means that the completion is constrained to be performed only within the same semantic category mask, such as the building only having texture and edge information provided by the building area; the latter means that the completion results are verified for consistency and color drift correction by using a reference view registered to a unified coordinate system.

[0090] In summary, through the aforementioned processes of localization and segmentation, occlusion detection, and optimized interpolation completion, a valid image set is determined. This set comprises multi-view images with complete segmentation labels, occlusion masks, and completion content within a unified coordinate framework. These images meet the input requirements for subsequent 3D reconstruction in terms of photometric representation, geometric consistency, and semantic consistency. This significantly reduces the incidence of mismatches and texture seams, and provides a reliable data foundation for generating high-precision, semantically consistent real-world 3D models.

[0091] Furthermore, in the second reconstruction stage, step S2 of this application includes performing three-dimensional reconstruction using the reconstructed image set:

[0092] For the reconstructed image set, feature point extraction is performed, wherein the feature points include at least corner points and edge points, and the feature points have scale-rotation invariance; for the feature points, image feature extraction and multi-view fusion are performed to determine the fused point cloud features; triangulation under multi-view conditions is used to determine the three-dimensional point cloud distribution; based on the fused point cloud features and the three-dimensional point cloud distribution, a dense reconstructed point cloud is determined.

[0093] In this embodiment of the application, when entering the second stage of 3D reconstruction, feature point extraction is first performed on the reconstructed image set. The feature points refer to key pixel locations in the image that can be stably identified and have local saliency; typical types include at least corner points and edge points.

[0094] Specifically, corner points are typically located in areas of dramatic grayscale changes and discontinuous orientation in an image, such as the corners of buildings or the intersections of roads. They maintain high detectability even under scale and rotation transformations. Edge points, on the other hand, are local extrema formed along the contours of objects or the boundaries of texture changes, such as the boundary between roofs and walls or the edge contours of roads. To ensure cross-view consistency, these feature points must possess scale-rotation invariance, meaning that the point can be correctly matched using the feature descriptor regardless of different scaling ratios or rotation angles.

[0095] Subsequently, image feature extraction is performed on the feature points. This involves calculating high-dimensional feature vectors, such as grayscale distribution and histogram of directional gradients, for the local neighborhood of each feature point, thereby obtaining a stable representation that can distinguish different points. Next, multi-view fusion of multi-view image features is performed. This involves finding the corresponding features of the same physical point in images from different viewpoints and weighting and merging these feature vectors using registered geometric constraints to reduce information loss due to illumination or occlusion from a single viewpoint.

[0096] For example, the corner features of a building extracted from a frontal view may be blurred due to shadows, but the information of that point is complete in a side view. The fusion process can enhance the robustness of the overall features, ultimately forming a fused point cloud feature with cross-view robustness.

[0097] Furthermore, the distribution of the 3D point cloud is determined by triangulation from multiple perspectives. Specifically, the coordinates of feature points in 3D space are calculated through geometric inversion by projecting feature points from at least two different perspectives.

[0098] Specifically, when the pixel coordinates of the same feature point are known in different images, trigonometric function calculations are performed using the camera's intrinsic and extrinsic parameter models. That is, the edges are determined by the camera position and the target position, and the angle between these edges is used to calculate the sine or cosine. This yields the position of the feature point in the three-dimensional coordinate system. A large number of feature points can be triangulated to form a dense three-dimensional point cloud distribution, with an overall coverage consistent with the reconstructed image set.

[0099] Finally, based on the fused point cloud features and the 3D point cloud distribution, a densely reconstructed point cloud is determined. In this process, the fused point cloud features provide reliable initial matching pairs, while the 3D point cloud distribution limits spatial geometric constraints. The combination of the two can effectively eliminate incorrect matches and fill sparse areas.

[0100] For example, in the building facade area, stereo matching is performed within a range of neighboring pixels guided by fusion features to supplement texture areas not covered by the sparse point cloud; in the road plan area, the uniformity of the point cloud is improved through disparity consistency detection. The final output dense reconstructed point cloud is not only continuous and complete in spatial distribution, but also outperforms traditional sparse reconstruction results in terms of geometric accuracy and texture consistency, providing a solid foundation for subsequent mesh generation and texture mapping.

[0101] Furthermore, three-dimensional reconstruction is performed using the reconstructed image set. In the third reconstruction stage, step S2 of this application includes:

[0102] Using 3D point cloud distribution and elevation data, spatial recursion is performed to determine triangular mesh division, wherein a preset gradient difference is used to measure the height; the densely reconstructed point cloud and the triangular mesh are fused to determine the triangular mesh model.

[0103] In this embodiment, three-dimensional point cloud distribution and elevation data are used as input to perform spatial recursive partitioning, that is, the reconstruction domain is subdivided from top to bottom using a hierarchical spatial segmentation structure under a unified coordinate system.

[0104] The implementation can be quadtree / octree or adaptive block segmentation: when the point cloud density, elevation undulation or normal change of a local area exceeds the threshold, further subdivision is triggered, so that the mesh division is more dense in areas with complex details and remains simple in areas with gentle terrain.

[0105] Preferably, to improve the consistency between elevation layers, a constraint based on a preset gradient difference is introduced during the recursive process for contour quantization. This involves setting the quantization step size based on the gradient magnitude of the elevation field, performing layered quantization and contour zone constraints on the elevation values, ensuring that mesh vertices within the same quantization layer tend to be coplanar, and that elevation abrupt changes at mesh edges across layers are expressed in a controlled manner. This reduces fluctuations and jitter caused by noise and creates regular hierarchical structures in near-horizontal areas such as rooftops, terraces, and road surfaces, thus providing good initial values ​​for subsequent surface fitting and topology optimization.

[0106] Finally, triangular mesh generation is generated accordingly. Specifically, within each recursive sub-block, constrained triangulation is performed based on quantized elevation grid points and boundary control lines, preferably using the Delaunay criterion or its constraint variants for better conformality to ensure triangle shape quality. For feature vector boundaries marked as hard boundaries, such as building outlines and road edges, they are inserted as uncrossable constraint edges to avoid cross-category mesh connectivity and geometric aliasing. Simultaneously, sampling density and edge fidelity weights are increased for regions with high curvature or abrupt normal changes, ensuring that features such as roof ridges, eaves, and corners are clearly expressed at the mesh level. This results in a triangular mesh skeleton that is spatially adaptive to point cloud density and consistent with isohyet measurements in the elevation dimension.

[0107] Subsequently, the densely reconstructed point cloud and the triangular mesh are fused. Specifically, firstly, the nearest point or local plane projection of the mesh vertices to the point cloud is performed, and the influence of outliers is suppressed by robust estimation. Then, the point-to-face / edge-to-face bidirectional distance is used as the energy term, and the global optimization objective is constructed by combining normal consistency and voxel occupancy consistency. Multiple rounds of geometric refinement are performed by iterative nearest point or gradient descent to make the mesh fit the real sampling surface better.

[0108] Preferably, in order to prevent semantic boundaries and hard features from being overly smoothed, weight protection is applied to constrained edges and high curvature edges during the optimization process to achieve a balance between detail fidelity and controllable scale.

[0109] The final output triangular mesh model is highly consistent with the dense point cloud in terms of vertex position, normal field and topological connectivity, and meets the engineering requirements for subsequent texture mapping and multi-level detail distribution.

[0110] Furthermore, in the fourth reconstruction stage, step S2 of this application includes performing three-dimensional reconstruction using the reconstructed image set:

[0111] A first grid region in the triangular mesh model is determined as a first reconstruction partition, wherein the first grid region is any triangular mesh region; for the first reconstruction partition, multi-view partitioning is performed in the effective image set to select the main texture source under the preferred view; the first partition model is determined by the projection of the main texture source of the first reconstruction partition and the optimized rendering based on semantic tags.

[0112] In this embodiment, a first mesh region is first determined in the triangular mesh model as a first reconstruction partition. The first mesh region refers to the smallest independently processable unit within any triangular mesh cell constituting the triangular mesh model, whose vertex coordinates, normals, and boundaries are its basic elements.

[0113] In this application, the reconstruction method of each triangular mesh region is the same, and the first mesh region will be used for detailed explanation here.

[0114] Specifically, for the first reconstructed partition, multi-view partition localization is performed in the effective image set. That is, the triangular facets of the partition are projected orthogonally into the image coordinate system to obtain the pixel coverage area of ​​the partition in each view. Visibility tests, such as backface culling and depth consistency based on dense point clouds, are used to determine whether it is occluded. Preferably, to suppress cross-category misprojection, segmentation identifiers are introduced during the localization process for semantic consistency filtering, retaining only view responses that match the semantics of the partition.

[0115] After obtaining candidate views, the views are evaluated and the primary texture source under the preferred viewpoint is selected. The primary texture source refers to the viewpoint image region in the candidate views that has the best imaging conditions for the target region. The specific evaluation method may be: the smaller the angle of incidence, the smaller the angle between the surface normal and the line of sight, the higher the score; the lower the occlusion probability, the better; the higher the radiation consistency and color stability, the better; the smaller the reprojection error, the better, etc. The scoring weight can be adjusted adaptively according to semantics.

[0116] For example, facade zoning of street-facing buildings typically receives higher scores from left / right oblique views, while roof zoning receives higher scores from frontal views.

[0117] Using the main texture source of the first reconstructed partition as a reference, projection mapping and semantic tag-based optimized rendering are performed to determine the model of the first partition. Specifically, in the projection mapping stage, the partition triangles are parameterized to a unified texture coordinate domain, the corresponding pixels in the main texture source are written into the texture atlas, and sub-pixel alignment correction is performed with geometric reprojection error as a constraint. The optimized rendering stage is guided by semantic tags: for example, edge preservation and anisotropic sharpening are used for building facades to preserve window frame and corner details, color drift correction and small-scale noise suppression are used for roofs to enhance surface consistency, and low-frequency smoothing and high-frequency repetitive textures are limited for roads to avoid moiré patterns; at the same time, color balancing and feathering transitions involving suboptimal views are used at partition boundaries to reduce the visibility of seams. To avoid cross-category texture leakage, all sampling is constrained by segmentation masks, and potential occluded pixels are eliminated through visibility re-detection.

[0118] After completing the above steps, the output includes the first partition model containing geometric patches, normals, UV coordinates and optimized textures, and records the source view ID, rating log and rendering parameters of the partition as metadata for global compositing and quality backtracking.

[0119] For example, in an urban street scene, after the above process is applied to a single facade section, the brick joints and window frame edges are clearly expressed, and a smooth transition is achieved with the roof section at the eaves, thereby meeting the engineering requirements for subsequent display and measurement.

[0120] Furthermore, based on the projection of the main texture source and the semantic tag-based optimized rendering, the first partition model is determined. Step S2 of this application includes:

[0121] Based on the semantic tags, the first rendering conditions under explicit and implicit expressions are determined for the first reconstruction partition; voxel clipping is performed on the main texture source; for each clipping region, a single point cloud is constructed and twin expansion within the clipping region is performed using texture projection of a reconstructed point cloud and optimized rendering based on the first rendering conditions, to determine the first partition model, wherein similar voxels are used as the clipping criteria.

[0122] In this embodiment, during the partition rendering stage, the first reconstructed partition is first parsed based on the semantic tags to determine the first rendering conditions under explicit and implicit expressions. Specifically, this is based on the semantic attributes pre-assigned to the geographic digital base elements, such as buildings, water bodies, and vegetation. Explicit expressions refer to the visible attributes directly specified by the tags, such as building facades should maintain sharp edges and road surfaces should exhibit continuous grayscale; implicit expressions are the indirect features implied by semantics, such as water bodies should have smooth color and low-frequency reflection characteristics, and vegetation should have texture details and random color.

[0123] By using explicit and implicit two-layer parsing, differentiated first rendering conditions can be formed, that is, different constraint strategies are adopted in terms of texture clarity, color consistency, and seam visibility control, so as to ensure that the output model is consistent with the real objects at both the visual and semantic levels.

[0124] Subsequently, voxel clipping is performed on the main texture source. The clipping criterion is based on voxel similarity; that is, voxels with similar characteristics are clipped into the same partition. In the processing method of this application, only single-point cloud reconstruction rendering is performed for the same partition. Based on the reconstruction result, twin distribution is performed on the remaining voxel positions within the same partition to reduce redundant processing and reasonably balance reconstruction efficiency and quality.

[0125] Specifically, for each clipped region, single-point cloud construction is performed using a texture projection of a reconstructed point cloud and optimized rendering based on the first rendering condition. Specifically, the point cloud within the clipped region is first projected onto a texture plane using the main texture source to generate a preliminary color and texture mapping. Then, based on this, optimized rendering is performed according to pre-determined explicit and implicit rendering conditions. For example, edge sharpening and color equalization are enhanced in building facade areas, high-frequency noise is suppressed and smoothness is enhanced in water areas, and texture details are preserved and color perturbations are applied in vegetation areas to achieve semantically consistent rendering output.

[0126] Subsequently, after the single point cloud is constructed, a twin expansion is performed within the clipping region. That is, the texture, normal, and color information of the rendered point cloud unit are extended to the adjacent similar voxel units to generate locally continuous point cloud fragments.

[0127] This method allows for the rapid filling of sparse gaps within the clipping region while maintaining spatial continuity and texture consistency. Since the expansion only applies to similar voxel sets, it avoids the propagation of errors across categories or boundaries.

[0128] Ultimately, the first partition model of the first reconstruction partition is obtained, which not only accurately fits the partition mesh in terms of geometry, but also meets the explicit and implicit rendering conditions guided by semantics in terms of texture representation, ensuring a high degree of consistency between the local reconstruction results and the global model in terms of vision, semantics and geometry.

[0129] Based on the above method, the reconstruction and rendering of each grid in the triangular mesh model are completed, and the mesh is stitched together to form the real-world 3D model.

[0130] The method for rapid reconstruction of real-scene 3D models from oblique satellite imagery provided in this application has the following technical advantages:

[0131] 1. Construction of a semantically labeled geodigital base: A geodigital base containing semantic labels such as feature vector boundaries and elevation data serves as the baseline framework for 3D reconstruction. It provides semantic anchors for subsequent image correction, segmentation, and reconstruction, ensuring consistency between the model and the real geographic topology and reducing meaningless geometric errors. Parallel correction and registration of multi-view oblique imagery: Radiometric and geometric corrections are performed on multi-view oblique satellite imagery, and spatial phase alignment and registration are performed using the geodigital base as a reference to generate a reconstructed image set. This eliminates interference from image capture angles and lighting conditions, achieving accurate spatial matching of multi-view imagery and providing high-quality input data for subsequent 3D reconstruction.

[0132] 2. Four-stage progressive image processor design: The underlying logic is deployed in four stages: image boundary segmentation and completion → dense point cloud reconstruction → lightweight mesh reconstruction → optimized viewpoint mapping. A dedicated image processor is built through sample training. The complex reconstruction task is broken down into a modular process to improve processing efficiency; each stage is optimized in a targeted manner, such as interpolation completion of occluded areas and multi-view fusion of feature points, to ensure the integrity and accuracy of reconstruction.

[0133] Through the foregoing detailed description of the method for rapid reconstruction of real-scene 3D models of oblique satellite imagery, those skilled in the art can clearly understand the method for rapid reconstruction of real-scene 3D models of oblique satellite imagery in this embodiment. As for the apparatus disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and relevant parts can be referred to the method section description.

[0134] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery, characterized in that, The method includes: A geographic digital base with semantic tags is constructed and stored in the reconstruction module of the 3D reconstruction platform. The platform data terminal receives oblique satellite images of the target area from multiple perspectives, performs multi-view correction and registration in parallel, determines the reconstructed image set, and imports it into the reconstruction module. The image processor built into the reconstruction module is triggered to perform three-dimensional reconstruction based on the geographic digital base and the reconstructed image set to determine the real-scene three-dimensional model. The three-dimensional reconstruction stage includes image boundary segmentation and completion, dense point cloud reconstruction, lightweight mesh reconstruction, and optimal viewpoint mapping. The real-world 3D model is visualized on the display interface of the 3D reconstruction platform; The first reconstruction stage, which involves performing 3D reconstruction using the reconstructed image set, includes: Based on the aforementioned geographic digital base, the reconstructed image set is segmented using the vector boundaries of ground features as the segmentation target, and segmentation identifiers are added to each reconstructed image. Scan the reconstructed image set and perform occlusion detection. If occlusion exists, use interpolation to complete the occlusion and determine the valid image set. Here, the interpolation is optimized using semantic tag-based guidance and reference to viewpoint-registered images. The second reconstruction stage, which involves performing 3D reconstruction using the reconstructed image set, includes: For the reconstructed image set, feature point extraction is performed, wherein the feature points include at least corner points and edge points, and the feature points have scale-rotation invariance; For the aforementioned feature points, image feature extraction and multi-view fusion are performed to determine the fused point cloud features; The distribution of three-dimensional point clouds is determined by triangulation from multiple perspectives; Based on the fused point cloud features and the 3D point cloud distribution, a densely reconstructed point cloud is determined; The third reconstruction stage, which involves performing 3D reconstruction using the reconstructed image set, includes: Using 3D point cloud distribution and elevation data, spatial recursion is performed to determine triangular mesh division, where a preset gradient difference is used to measure the height. By fusing the densely reconstructed point cloud with the triangular mesh, a triangular mesh model is determined; The fourth reconstruction stage, which involves performing 3D reconstruction using the reconstructed image set, includes: The first grid region in the triangular mesh model is determined as the first reconstruction partition, wherein the first grid region is any triangular mesh region; For the first reconstructed partition, multi-view partitioning and localization are performed in the effective image set to select the main texture source under the optimal view; The first partition model is determined by projecting the main texture source of the first reconstructed partition and optimizing the rendering based on semantic tags.

2. The method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery as described in claim 1, characterized in that, Constructing a semantically labeled geo-digital foundation, including: Determine semantic labels, wherein the elements of the semantic labels include at least the vector boundaries and elevation data of ground features; For the target area, a lightweight geographic topology is constructed, and the semantic tags are introduced to identify key topological locations as the geographic digital base.

3. The method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery as described in claim 2, characterized in that, Receive oblique satellite images of the target area from multiple perspectives, perform multi-view correction and registration in parallel, and determine the reconstructed image set, including: The oblique satellite imagery is received, and radiometric and geometric corrections are applied to perform prior verification and correction processing on the oblique satellite imagery at each viewpoint to determine the corrected oblique satellite imagery. Using feature point matching as the registration method and the geographic digital base as a reference, spatial phase alignment registration is performed on the corrected tilted satellite image to determine the reconstructed image set.

4. The method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery as described in claim 1, characterized in that, Before triggering the image processor built into the reconstruction module, the construction of the image processor includes: The reconstruction process consists of four stages: image boundary segmentation and completion as the first reconstruction stage, dense point cloud reconstruction as the second reconstruction stage, lightweight mesh reconstruction as the third reconstruction stage, and optimal viewpoint mapping as the fourth reconstruction stage. The underlying logic of the first reconstruction stage, the second reconstruction stage, the third reconstruction stage and the fourth reconstruction stage is deployed, and the image processor is determined by sample-driven training until convergence.

5. The method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery as described in claim 4, characterized in that, The first reconstruction stage logic is based on segmentation based on ground feature vector boundaries and interpolation completion based on occlusion areas. The three-dimensional distribution under multi-view feature point fusion is used as the logic for the second reconstruction stage; Incremental Poisson reconstruction is used as the logic for the third reconstruction stage; The fourth reconstruction stage logic is based on the dynamic selection of the main texture source from the optimal perspective and the semantically guided voxel rendering.

6. The method for rapid reconstruction of a real-scene 3D model from oblique satellite imagery as described in claim 1, characterized in that, The first partition model is determined by projecting the main texture source and using semantic tag-based optimized rendering, including: Based on the semantic tags, determine the first rendering conditions for the first reconstructed partition under explicit and implicit expressions; Voxel clipping is performed on the main texture source. For each clipping region, a single point cloud is constructed using a texture projection of a reconstructed point cloud and optimized rendering based on the first rendering condition. Twin expansion within the clipping region is then performed to determine the first partition model, wherein similar voxels are used as the clipping criterion.