An extensible dynamic loading large scene three-dimensional reconstruction method
By performing block-based parallel training on large scene datasets and combining spatial distribution characteristics with dynamic loading and unloading of parameters, the problems of high training cost and low efficiency in large scene 3D reconstruction are solved, achieving efficient 3D reconstruction results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
- Filing Date
- 2024-11-22
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies suffer from high training costs and low efficiency in large-scale 3D reconstruction, especially since a single GPU card cannot effectively load and process large-scale datasets, and existing block training methods fail to fully utilize spatial distribution characteristics.
A scalable dynamic loading method is adopted to uniformly divide the large scene dataset into blocks with overlap between them. The parameters of the non-overlapping parts are loaded and unloaded through GPU streaming. Training is carried out in combination with the spatial distribution characteristics, supporting model parallelism and data parallelism, and realizing hybrid parallelism between blocks.
It reduces the cost of repeated loading, improves training efficiency, and enables training of infinitely large-area scenes on a single GPU card, resulting in higher training efficiency and suitability for large-scene 3D reconstruction.
Smart Images

Figure CN119832143B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the technical field of 3D reconstruction, and in particular to a scalable, dynamically loaded, large-scene 3D reconstruction method. Background Technology
[0002] Thanks to advancements in GPU hardware, a single GPU card can effectively reconstruct objects and small scenes such as indoor spaces in the current field of 3D reconstruction. However, as the scale of the scene expands, particularly to large-scale 3D reconstruction of city-level scenes, the large dataset size, numerous parameters, and poor convergence results present significant challenges. A single GPU card cannot handle such large-scale scenes, making it a persistent major hurdle.
[0003] Currently, there are two main types of algorithms used in 3D reconstruction: Neural Radiance Field (NeRF) and 3D Gaussian Splatting (3DGS). The NeRF algorithm uses a fully connected deep network to represent the scene, converting the input 5D coordinates (spatial position (x, y, z) and viewing direction (θ, φ)) into the volumetric density of that spatial location and view-dependent emitted rays. It synthesizes the viewpoint by querying the 5D coordinates along the camera rays and uses classic volumetric rendering techniques to project the output color and density onto the image. The 3DGS algorithm, on the other hand, starts with sparse points generated during camera calibration, representing the scene with a 3D Gaussian sphere. It preserves the desired properties of the continuous volumetric radiance field for scene optimization while avoiding unnecessary calculations in empty spaces. During training, it controls the density of the 3D Gaussian sphere and optimizes the anisotropic covariance to achieve accurate scene representation. In addition, 3DGS employs a fast visibility-aware rendering algorithm that supports anisotropic sputtering, accelerating training and allowing real-time rendering.
[0004] Block-based training schemes such as MegaNeRF, BlockNeRF, VastGS, and HierarchicalGS are representative works of independent training, dividing the scene into blocks for independent parallel training between blocks. NeRF-XL, on the other hand, adopts a joint training approach, allocating NeRF parameters to disjoint regions and improving the algorithm to jointly train them on multiple GPUs. GrendelGS, leveraging the characteristics of the Gaussian sputtering algorithm, bypasses the concept of block-based training by randomly and evenly distributing the Gaussian sphere storing model parameters across multiple GPUs for joint training. Zero-Offload and a large model training method based on an improved ZeRO-Offload technique (CN117992220A) apply the idea of dynamic loading to the training of large language models, allowing a single GPU to train more parameters by periodically loading and unloading parameters. Specifically, half-precision model parameters are stored in the GPU, while half-precision gradients, full-precision parameters, and all optimizer states are stored on the CPU. During training, the loss is calculated on the GPU via forward propagation, while during backpropagation, the gradient is unloaded to CPU memory. Then, the full-precision parameters and optimizer state are updated directly on the CPU, and the updated full-precision parameters are copied from CPU memory to GPU memory.
[0005] Existing technologies can employ independent training, dividing the scene into blocks for independent parallel training. However, techniques like MegaNeRF, BlockNeRF, VastGS, and HierarchicalGS are not optimized for single-GPU training and do not consider potential optimization points during continuous training between blocks, resulting in each loading being a complete block and incurring high repetitive loading costs. Joint training methods like VastGS and NeRF-XL distribute parameters across multiple GPUs, requiring global synchronization of parameter information during computation, introducing additional global message synchronization overhead. Furthermore, Zero-Offload, applied to large language models, splits the entire model, updating parameters as a whole during training. However, in 3D reconstruction scenarios, this splitting approach lacks spatial distribution design, making it less efficient than block-based methods. Summary of the Invention
[0006] The purpose of this invention is to provide a scalable, dynamically loaded large-scene 3D reconstruction method to reduce the cost of repeated loading while improving training efficiency.
[0007] The objective of this invention can be achieved through the following technical solutions:
[0008] A scalable, dynamically loaded large-scene 3D reconstruction method, comprising the following steps:
[0009] S1. Obtain the large-scale scene dataset to be trained, and construct the entire training scene based on the dataset distribution;
[0010] S2. Divide the training scene into blocks evenly, and divide the large scene dataset into corresponding blocks. There are overlapping datasets and parameters between each block. Select a block as the current block and load the parameters of the first current block.
[0011] S3. Load the dataset of the current block for training on a GPU stream, and start the GPU stream responsible for preloading to load the parameters of the next block that does not intersect with the current block;
[0012] S4. After the current block is trained, unload the parameters that belong to the current block but not to the next block. Keep the overlapping parts of the parameters of the current block and the next block unchanged. Merge the non-overlapping parts of the parameters of the next block and the overlapping parts of the current block. Use the merged part as the parameters of the next block after it is loaded. Then, use the next block as the new current block and return to S3. Repeat this process until all blocks are traversed. Then repeat S3 and S4 again until training ends and the reconstructed scene is obtained.
[0013] Furthermore, the parameters of the part of the next block that does not intersect with the current block are parameters that belong to the next block but not to the current block.
[0014] Furthermore, the method also includes: treating all blocks as current blocks, and performing steps S3 and S4 in parallel on multiple GPUs for all current blocks.
[0015] Furthermore, the step of executing S3 and S4 in parallel on multiple GPUs for all current blocks specifically involves: distributing all current blocks evenly across multiple GPUs and executing steps S3 and S4 in parallel.
[0016] Furthermore, repeat steps S3 and S4 until the training is complete, specifically:
[0017] After traversing all blocks, increment the iteration count by 1, and repeat S3 and S4 until the iteration count reaches the threshold.
[0018] Furthermore, all blocks are traversed according to the route.
[0019] Furthermore, the route is zigzag.
[0020] Furthermore, the large scene dataset consists of scene images captured by a camera.
[0021] Furthermore, the loading of the parameters and dataset of the current block is performed on a GPU stream for training and startup, while the GPU stream responsible for preloading loads the parameters of the next block that do not intersect with the current block simultaneously.
[0022] Furthermore, constructing the entire training scenario based on the dataset distribution specifically involves:
[0023] Record the location of the camera center in a large scene dataset, and construct the entire training scene based on the location distribution of the camera center.
[0024] Compared with the prior art, the present invention has the following beneficial effects:
[0025] The independent training method of this invention avoids global synchronization overhead and allows for independent optimization of individual blocks. Compared to traditional independent training, the dynamic loading provided by this patent is designed for single-card training. Each loading and unloading operation involves parameters from the difference set between two blocks, and loading and computation overlap, resulting in lower overall costs. Theoretically, it can support training and reconstructing scenes of infinite area on a single GPU card. Furthermore, compared to dynamic loading methods for large language models, the dynamic loading method of this patent focuses more on 3D scenes and is designed in conjunction with spatial distribution characteristics, resulting in higher training efficiency. Attached Figure Description
[0026] Figure 1 This is a flowchart of the present invention;
[0027] Figure 2 This is a schematic diagram of the dynamic loading three-dimensional reconstruction of the present invention. Detailed Implementation
[0028] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0029] Benefiting from the advancements in GPU hardware, a single GPU card can effectively reconstruct objects and small scenes such as indoor spaces in the current field of 3D reconstruction. However, as the scale of scenes expands, particularly to large-scale 3D reconstruction of cities, the large dataset size, numerous parameters, and poor convergence results present significant challenges. A single GPU card cannot handle such large-scale scenes, making this a major hurdle. This invention aims to propose a scalable, dynamically loaded parameter training method to address the issues of high training costs and poor training performance in large-scale 3D reconstruction.
[0030] This invention proposes a scalable, dynamically loaded, large-scene 3D reconstruction method, the flowchart of which is shown below. Figure 1 As shown, the method includes the following steps:
[0031] S1. Obtain the large-scale scene dataset to be trained, and construct the entire training scene based on the dataset distribution;
[0032] S2. Divide the training scene into uniform blocks, and divide the large scene dataset into corresponding blocks. There are overlapping datasets and parameters between each block. Select a block as the current block and load the parameters of the first current block.
[0033] S3. Load the dataset of the current block for training on a GPU stream, and start the GPU stream responsible for preloading to load the parameters of the next block that does not intersect with the current block;
[0034] S4. After the current block is trained, unload the parameters that belong to the current block but not to the next block. Keep the overlapping parts of the parameters of the current block and the next block unchanged. Merge the non-overlapping parts of the parameters of the next block and the overlapping parts of the current block. Use the merged part as the parameters of the next block after it is loaded. Then use the next block as the new current block and return to S3. Repeat S3 and S4 until all blocks are traversed. Then repeat S3 and S4 again until the training ends and the reconstructed scene is obtained.
[0035] This invention divides the entire 3D scene into blocks. During training, training is performed by unloading and loading individual blocks. A round is considered complete when all blocks have been loaded and trained. Unloading and loading only operate on the non-overlapping parts between blocks. When the current GPU stream is training the previous block, the new GPU stream pre-loads the non-overlapping parts of the next block. Based on this idea, this invention further proposes a method to scale from a single GPU to multiple GPUs. Specifically, it achieves hybrid parallelism between blocks (model parallelism plus data parallelism) by setting multiple blocks and multiple starting points, and achieves intra-block data parallelism by simultaneously training multiple camera inputs.
[0036] In this invention, when loading the dataset required for training, the position of the camera center is recorded, and the distribution of the camera center positions in the dataset constructs the entire training scene.
[0037] When training officially begins, the parameters and dataset of the current block are first loaded and trained on a GPU stream. At the same time, a GPU stream responsible for preloading is started to load the parameters of the next block that do not intersect with the current training block (the part that belongs to the preloaded block but does not belong to the current training block). The two can overlap and be executed simultaneously.
[0038] Once a block of training is complete, parameters belonging to the current block but not to the next block are unloaded. The overlapping portions of the parameters between the current and next blocks are preserved. The disjoint portions of the parameters between the next and current blocks are merged with the overlapping portions, and training for the next block begins. Simultaneously, the GPU stream responsible for preloading begins prefetching for the next block.
[0039] Repeat the above steps until all blocks have been loaded and trained, which is considered the end of one round. The end of the entire training is when the number of training rounds reaches the specified number of training rounds.
[0040] This invention also extends the dynamic loading of the aforementioned single GPU in a parallel manner. All blocks of the scene can be distributed across multiple GPUs for parallel execution, achieving the effect of model parallelism combined with data parallelism. Within each block, this invention can further split the dataset to achieve data parallelism, further addressing the problem of low training efficiency for large scenes and large datasets.
[0041] This invention has been validated using both the NeRF and 3DGS algorithms. Modeling of a 4K scene covering 15 square kilometers was achieved using the NeRF algorithm, and modeling of a scene covering 25 square kilometers was achieved using the 3DGS algorithm. A schematic diagram of the dynamic loading 3D reconstruction of this invention is shown below. Figure 2 As shown, the entire scene is divided into blocks, and the dataset is assigned to the corresponding sub-blocks according to the location of the camera center. The training uses a zigzag traversal method to dynamically load and unload parameters, and supports parallel training of data.
[0042] The preferred embodiments of the present invention have been described in detail above. It should be understood that those skilled in the art can make numerous modifications and variations based on the concept of the present invention without creative effort. Therefore, all technical solutions that can be obtained by those skilled in the art based on the concept of the present invention through logical analysis, reasoning, or limited experimentation on the basis of existing technology should be within the scope of protection defined by the claims.
Claims
1. A scalable, dynamically loaded large-scene 3D reconstruction method, characterized in that, The method includes the following steps: S1. Obtain the large-scale scene dataset to be trained, and construct the entire training scene based on the dataset distribution; S2. Divide the training scene into blocks evenly, and divide the large scene dataset into corresponding blocks. There are overlapping datasets and parameters between each block. Select a block as the current block and load the parameters of the first current block. S3. Load the dataset of the current block for training on a GPU stream, and start the GPU stream responsible for preloading to load the parameters of the next block that does not intersect with the current block; S4. After the current block is trained, unload the parameters that belong to the current block but not to the next block. Keep the overlapping part of the parameters of the current block and the next block unchanged. Merge the non-overlapping part of the parameters of the next block and the overlapping part of the current block. Use the merged part as the parameters of the next block after it is loaded. Then, use the next block as the new current block and return to S3. Repeat this process until all blocks are traversed. Then repeat S3 and S4 again until training ends and the reconstructed scene is obtained. The method further includes: treating all blocks as current blocks and executing steps S3 and S4 in parallel on multiple GPUs for all current blocks; The parameters and dataset of the current block are loaded and trained on a GPU stream. The GPU stream responsible for preloading loads the parameters of the next block, which do not intersect with the current block, and executes simultaneously.
2. The scalable, dynamically loaded large-scene 3D reconstruction method according to claim 1, characterized in that, The parameters of the part of the next block that does not intersect with the current block are those that belong to the next block but not to the current block.
3. The scalable, dynamically loaded large-scene 3D reconstruction method according to claim 1, characterized in that, The steps of executing S3 and S4 in parallel on multiple GPUs for all current blocks are as follows: distribute all current blocks evenly across multiple GPUs and execute steps S3 and S4 in parallel.
4. The scalable, dynamically loaded large-scene 3D reconstruction method according to claim 1, characterized in that, Repeat steps S3 and S4 until the training is complete, specifically: After traversing all blocks, increment the iteration count by 1, and repeat S3 and S4 until the iteration count reaches the threshold.
5. The scalable, dynamically loaded large-scene 3D reconstruction method according to claim 1, characterized in that, Traverse all blocks according to the route.
6. The scalable, dynamically loaded large-scene 3D reconstruction method according to claim 5, characterized in that, The route is zigzag-shaped.
7. The scalable, dynamically loaded large-scene 3D reconstruction method according to claim 1, characterized in that, Large scene datasets consist of scene images captured by cameras.
8. A scalable, dynamically loaded large-scene 3D reconstruction method according to claim 1, characterized in that, The entire training scenario is constructed based on the dataset distribution as follows: Record the location of the camera center in a large scene dataset, and construct the entire training scene based on the location distribution of the camera center.