A Multi-View 3D Reconstruction Method Based on Video Diffusion Model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining a global 3D point cloud map and an explicit 3D geometric memory with a video diffusion model, the geometric consistency problem of 3D scene reconstruction under sparse perspective is solved, and efficient and robust reconstruction of large-scale scenes is achieved.

CN122066871BActive Publication Date: 2026-06-30SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date: 2026-04-21
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

Existing technologies suffer from geometric ambiguity, information loss, and geometric distortion in 3D scene reconstruction under sparse perspectives, making it difficult to effectively utilize sparse observation data and resulting in poor reconstruction performance in large-scale complex scenes.

Method used

A multi-view 3D reconstruction method based on a video diffusion model is adopted. The video diffusion model is guided by a global 3D point cloud map to generate images. Combined with an explicit 3D geometric memory and a segmented processing strategy, and utilizing uncompressed 2D variational autoencoder and context window sparse attention mechanism, the method can supplement sparse viewpoints and maintain geometric consistency.

Benefits of technology

It achieves robust reconstruction of sparse perspectives, ensuring geometric consistency and high quality of reconstruction results in large-scale scenarios, reducing computational complexity and improving reconstruction efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122066871B_ABST

Patent Text Reader

Abstract

This invention relates to a multi-view 3D reconstruction method based on a video diffusion model, comprising: extracting multi-scale point clouds from an original sparse image set and fusing them into a global 3D point cloud map; and dividing the reconstruction trajectory into multiple segments, each segment containing one or more new viewpoints, and sequentially performing image generation on each segment, including: projecting the corresponding point cloud in the global 3D point cloud map onto the current segment to obtain a rendering index map and a rendering color map, and selecting reference frames from a dynamically acquired view library; inputting the reference frames, the rendering color map, and the visibility mask of the rendering color map into an unordered context video diffusion model, and outputting an image sequence containing complete images of the new viewpoints and reference frames; and extracting the point cloud of the complete images of the new viewpoints to update the global 3D point cloud map to guide the image generation of the next segment. This method solves the problems of poor geometric consistency and insufficient scalability of reconstruction results when the input viewpoints are sparse, unordered, and involve long-distance displacement.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of 3D scene reconstruction technology, and in particular to a multi-view 3D reconstruction method based on a video diffusion model. Background Technology

[0002] With the rapid development of virtual reality (VR), augmented reality (AR), and film and television special effects, 3D scene reconstruction has become one of the core research tasks in the field of computer vision. While existing techniques such as neural radiation fields and 3D Gaussian splashing can achieve high-fidelity viewpoint synthesis effects and meet the application requirements of some scenarios, their performance is highly dependent on dense and continuous multi-view observation data. Specifically, these techniques require acquiring multiple frames of images of the target scene from different angles with high density distribution as input, and then performing 3D reconstruction and rendering of the scene through the fusion calculation of multi-view information.

[0003] However, in real-world applications, the scene observation data collected by users is often sparse and irregularly distributed, with significant spans between different observation perspectives, making it difficult to meet the stringent requirements of current mainstream technologies for input data. When observation data is sparse, the 3D reconstruction process inevitably faces serious geometric ambiguity issues, and occluded areas in the scene may experience information loss, leading to artifacts, geometric distortions, and even geometric collapse in the reconstruction results, severely impacting reconstruction quality and subsequent application effectiveness.

[0004] Therefore, how to effectively utilize generative prior knowledge (such as diffusion models) to accurately supplement the missing viewpoint information in sparse observation data, while strictly maintaining the geometric consistency of the reconstructed scene and avoiding geometric distortion, has become a core bottleneck problem that urgently needs to be solved in the field of large-scale 3D scene reconstruction.

[0005] Currently, the relevant technical solutions for 3D scene reconstruction under sparse perspective can be mainly divided into the following two categories:

[0006] The first category is enhancement methods based on image diffusion priors. The core logic of this type of method is to use the generation capability of the diffusion model to synthesize pseudo-true values for new perspectives during the optimization loop of 3D reconstruction. Then, the pseudo-true values are used as guidance to drive the neural radiation field or 3D Gaussian splash model for training and optimization, thereby making up for the lack of information in sparse observation data and improving the reconstruction effect.

[0007] The second category is geometrically perceptive video generation models. These methods leverage the temporal modeling capabilities of video diffusion models to ensure consistency between multiple frames of images. They also introduce coarse point cloud projections generated by models such as DUSt3R as geometric guidance information, attempting to maintain the geometric rationality of the scene while supplementing missing viewpoint information.

[0008] While the two existing technical solutions mentioned above have alleviated some of the problems in 3D reconstruction under sparse perspectives and improved reconstruction quality to a certain extent, they still have significant limitations and cannot meet the actual reconstruction needs of large-scale and complex scenes. Specifically, these solutions are limited by the capacity constraints of the input reference frame, the defects in the modeling method of the model's latent space, and the inadequacy of the long-range update mechanism in large-scale scenes. When processing observation data of complex scenes with non-serialized distribution and large perspective spans, they often encounter problems such as difficulty in maintaining geometric consistency and significant long-range cumulative errors. Summary of the Invention

[0009] To address at least some of the aforementioned problems in the prior art, this invention provides a multi-view 3D reconstruction method based on a video diffusion model, comprising:

[0010] The system integrates raw sparse image sets captured from multiple perspectives into a dynamically acquired view library, extracts multi-scale point clouds from the raw sparse image sets, fuses them into a global 3D point cloud map, and stores it in an explicit 3D geometric memory; and

[0011] The user-specified reconstruction trajectory is divided into multiple segments, each containing one or more new viewpoints. Image generation is then performed on each segment sequentially, including:

[0012] Project the corresponding point cloud in the global 3D point cloud map onto the current segment to obtain the rendering index map and the rendering color map, and use the rendering index map to filter the reference frame from the dynamic acquisition view library through geometric perception retrieval.

[0013] The reference frame, the rendered color map, and the visibility mask of the rendered color map are input into the unordered context video diffusion model, and the output is an image sequence containing the complete image of the new perspective and the reference frame.

[0014] Extract the point cloud of the complete image from the new perspective and add it to the global 3D point cloud map to update the global 3D point cloud map. Use the updated global 3D point cloud map to generate the image for the next segment.

[0015] Furthermore, a feedforward point map estimation model is used to extract multi-scale point clouds from images in the original sparse image set and fuse them into a global 3D point cloud map.

[0016] Based on the new perspective of the current fragment, the 3D point cloud map in the current explicit 3D geometric memory is projected and rendered to generate a rendering index map and a rendering color map.

[0017] Furthermore, reference frames are filtered using the rendered index map: the ratio of the coverage area of the effective 3D point cloud of each image in the dynamic acquisition view library after being projected onto the new viewpoint to the area of the rendered index map is independently calculated as the coverage score, and images with a coverage score greater than the threshold are filtered as reference frames.

[0018] Furthermore, the reference frame and the rendered color map are arranged into an input image sequence and then fed into the unordered context video diffusion model, where the reference frame is located at the beginning of the input image sequence.

[0019] Furthermore, the encoding and decoding module of the disordered context video diffusion model uses an uncompressed 2D variational autoencoder, which includes an uncompressed encoder and a decoder, wherein the uncompressed encoder is configured to extract latent features from a reference frame and a rendered color map, and the decoder is configured to restore the latent features to an image.

[0020] Furthermore, the unordered context video diffusion model includes an uncompressed 2D variational autoencoder, a noise addition module, a noise prediction network, and a denoising module, wherein:

[0021] The uncompressed 2D variational autoencoder extracts latent features from the reference frame and the rendered color map; the noise-adding module adds noise to the latent features to obtain noisy latent features; the noisy latent features are input into the noise prediction network, which predicts noise; the denoising module removes the predicted noise from the noisy latent features and outputs denoised latent variables; the uncompressed 2D variational autoencoder decoder restores the denoised latent variables to the image, resulting in an image sequence containing the complete image from the new perspective and the reference frame.

[0022] Furthermore, the noise prediction network of the disordered context video diffusion model incorporates a context window sparse attention mechanism.

[0023] Furthermore, the unordered context video diffusion model employs a global scene storage mechanism to transform the latent features of the reference frame into key-value cache information and place it at the beginning of the attention sequence.

[0024] Furthermore, the disordered context video diffusion model is trained by a distribution matching distillation method, including: using a teacher model to guide the student model in learning, and mapping the continuous denoising trajectory of the teacher model to the four-step sampling denoising of the student model.

[0025] The present invention also provides a computer-readable storage medium having a computer program stored thereon, the computer program performing the steps of the above method when executed by a processor.

[0026] The present invention has at least the following beneficial effects:

[0027] The three-dimensional reconstruction method of the present invention breaks the strict limitation of the number and order of input viewpoints in traditional methods. It generates a complete image of a new viewpoint through a diffusion model, fills the spatial gaps between sparse viewpoints, and can achieve robust three-dimensional scene restoration of randomly shot and discontinuous viewpoints.

[0028] The three-dimensional reconstruction method of the present invention deeply couples the generation of images by using a video diffusion model guided by a global three-dimensional point cloud map with the updating of the global three-dimensional point cloud map in the explicit three-dimensional geometric memory using the generated images, forming a generation-reconstruction closed-loop iterative mechanism. This solves the problems of poor geometric consistency and insufficient scalability of reconstruction results when the input viewpoint is extremely sparse, disordered and has long-distance displacement, and eliminates the drift problem of long trajectory generation.

[0029] For long reconstruction trajectories, this invention adopts a segmented processing strategy to divide the long reconstruction trajectory into small segments and perform image generation separately, thereby achieving efficient and coherent reconstruction of long reconstruction trajectories and complex large scenes.

[0030] The encoder of the video diffusion model of the present invention adopts an uncompressed encoder, which maintains a strict one-to-one mapping between latent features and pixel-level physical coordinates, thereby eliminating geometric alignment deviations and texture confusion caused by temporal dimension compression from the source and significantly improving the synthesis quality between viewpoints with a large span.

[0031] This invention introduces a distributed matching distillation technique to train a video diffusion model, compressing the denoising process to four sampling steps, significantly improving inference speed. It also combines a context window sparse attention mechanism that can handle variable length conditions, enabling the model to have strong generalization ability for different numbers of input perspectives while supporting rapid and continuous reconstruction of large-scale long trajectory scenes. Attached Figure Description

[0032] To further illustrate the above and other advantages and features of the various embodiments of the present invention, a more specific description of the embodiments of the invention will be presented with reference to the accompanying drawings. It is to be understood that these drawings depict only typical embodiments of the invention and are therefore not intended to limit its scope. In the drawings, identical or corresponding parts will be indicated by identical or similar reference numerals for clarity.

[0033] Figure 1 The flowchart of a multi-view 3D reconstruction method based on a video diffusion model according to an embodiment of the present invention is shown.

[0034] Figure 2 A system architecture diagram of a novel perspective synthesis of video diffusion based on geometry perception according to an embodiment of the present invention is shown.

[0035] Figure 3 A schematic diagram of a new perspective point cloud rendering according to an embodiment of the present invention is shown.

[0036] Figure 4 A comparison diagram of the updated and unupdated point cloud outputs in an explicit three-dimensional geometric memory according to an embodiment of the present invention is shown. Detailed Implementation

[0037] It should be noted that the components in the accompanying drawings may be shown exaggerated for illustrative purposes and may not be to scale.

[0038] In this invention, the various embodiments are merely intended to illustrate the solutions of the invention and should not be construed as limiting.

[0039] In this invention, unless otherwise specified, the quantifiers “a” and “one” do not exclude scenarios involving multiple elements.

[0040] It should also be noted that, in the embodiments of the present invention, only a portion of the parts or components may be shown for clarity and simplicity. However, those skilled in the art will understand that, under the teachings of the present invention, the required parts or components can be added as needed for specific scenarios.

[0041] It should also be noted that within the scope of this invention, the terms "same", "equal", and "equal to" do not mean that the two values are absolutely equal, but allow for a certain reasonable error. In other words, the terms also cover "substantially the same", "substantially equal", and "substantially equal to".

[0042] It should also be noted that in the description of this invention, the terms "center," "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the invention and for simplifying the description, and do not explicitly or implicitly suggest that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0043] Furthermore, the embodiments of the present invention describe the process steps in a specific order. However, this is only for the convenience of distinguishing each step, and is not a limitation on the order of each step. In different embodiments of the present invention, the order of each step can be adjusted according to the process.

[0044] While existing diffusion models have made some progress in sparse view synthesis, they still have significant limitations in large-scale real-world applications. First, they are inefficient and lack flexibility in utilizing input observation information; existing video generation models typically only receive a very small number of reference frames as conditions, making it difficult to capture the global context and fine details of complex scenes. Second, existing 3D reconstruction architectures heavily rely on temporal causal compression mechanisms. When dealing with non-serialized, large-viewpoint-difference random sampling such as handheld shots or internet videos, this compression easily disrupts the precise spatial correspondence between frames, resulting in the generated viewpoint not being strictly aligned with the given pose, leading to geometric distortion and spatial drift. Furthermore, in large-scale scenes (such as long path trajectories exceeding 200 frames), existing methods lack effective geometric memory update mechanisms, causing errors to accumulate rapidly during segmented reconstruction, manifesting as geometric breaks and visual jumps between adjacent segments. Finally, as the number of reference views increases, the computational overhead of model attention grows quadratically, creating an irreconcilable bottleneck between high-quality generation and computational efficiency.

[0045] To address the aforementioned shortcomings, this invention aims to propose a multi-view 3D reconstruction method based on a video diffusion model. By utilizing the global scene storage mechanism of the video diffusion model and an uncompressed 2D variational autoencoder (removing the latent encoder of temporal compression), it supports any number of discontinuous observation inputs, ensuring rigorous pose alignment and accurate frame-level feature correspondence even under large viewpoint spans and irregular sampling.

[0046] Multi-scale point clouds are extracted from the original sparse image set and fused into a global 3D point cloud map, which is stored in an explicit 3D geometric memory. The corresponding point clouds in the global 3D point cloud map are projected onto a new viewpoint to obtain a rendered color map. This rendered color map is input into a video diffusion model to generate a complete image from the new viewpoint. The point clouds from the complete image from the new viewpoint are extracted and added to the global 3D point cloud map. The global 3D point cloud map in the explicit 3D geometric memory is then updated to guide the next round of image generation. By utilizing an explicit 3D geometric memory and a back-projection mechanism, the generated output is updated with shared geometry in real time, and this back-projection guides the subsequent image generation, effectively suppressing accumulated errors in long-range reconstruction and ensuring seamless transitions in large-scale scenes.

[0047] A video diffusion model is trained using a distribution matching distillation method. This model employs a context window sparse attention mechanism, which significantly reduces computational complexity and greatly improves inference efficiency while maintaining high-fidelity reconstruction accuracy. Ultimately, this achieves a 3D reconstruction solution that can handle sparse and irregular observation images while maintaining high spatial consistency and computational economy in large-scale scenes.

[0048] Figure 1 The flowchart of a multi-view 3D reconstruction method based on a video diffusion model according to an embodiment of the present invention is shown. Figure 2 A system architecture diagram of a novel perspective synthesis of video diffusion based on geometry perception according to an embodiment of the present invention is shown. Figure 3 A schematic diagram of a new perspective point cloud rendering according to an embodiment of the present invention is shown. Figure 4 A comparison diagram of the updated and unupdated point cloud outputs in an explicit three-dimensional geometric memory according to an embodiment of the present invention is shown.

[0049] A multi-view 3D reconstruction method based on a video diffusion model includes the following steps:

[0050] Step S1: Preprocess the original sparse image set captured from multiple perspectives, integrate it into a dynamic acquisition view library, use a feedforward point map estimation model to extract multi-scale point clouds from the original sparse image set, and fuse them into a global three-dimensional point cloud map, which is stored in an explicit three-dimensional geometric memory.

[0051] The original sparse image set can contain multiple images that are deserialized and have a large viewpoint span.

[0052] At the initial stage of the reconstruction task, the system first preprocesses the input raw sparse image set, integrating it into a dynamically acquired view library. Then, it invokes a high-performance feedforward point map estimation model, utilizing depth inference techniques to extract multi-scale point clouds from these initial views (images in the raw sparse image set). These multi-scale point clouds are fused into a 3D point cloud map and stored in an explicit 3D geometric memory, forming the system's spatial foundation. Unlike traditional implicit neural representations, this explicit 3D geometric memory allows the system to directly perform spatial occupancy queries and incremental updates, providing accurate geometric prior guidance for the video diffusion model in subsequent generation stages, ensuring the model is under the correct spatial constraints from the outset.

[0053] Step S2, as follows Figure 2 As shown, the user-specified reconstruction trajectory is divided into multiple segments, each containing one or more new perspectives, and image generation is performed on each segment in turn.

[0054] A segmented processing strategy is employed to divide the long reconstructed trajectory into smaller segments, and image generation is performed on each segment separately. The reconstructed trajectory refers to a trajectory composed of multiple new perspectives.

[0055] The reconstructed trajectory may differ from the original sparse image set's capture trajectory.

[0056] Each image in the original sparse image set corresponds to an existing acquisition viewpoint. Users can specify and generate images from other shooting viewpoints (new viewpoints) to fill in the missing observation information between the original sparse viewpoints, thus enabling 3D scene reconstruction.

[0057] After inputting a video shot from multiple perspectives, images are extracted. Users can freely explore the space and reconstruct multiple new perspectives. The new perspective does not need to follow the same trajectory as the original acquisition perspective; that is, the new perspective does not need to follow the trajectory of the original acquisition perspective. For example, using a mobile phone to shoot a video of the room from the doorway, and moving the phone horizontally during shooting to capture multiple images from different perspectives (the trajectory of the acquired perspectives is horizontal), the reconstructed new perspective does not need to follow this horizontal trajectory; it can be a perspective within the room.

[0058] Step S2.1: Project the corresponding point cloud from the global 3D point cloud map onto the current segment to obtain a rendering index map and a rendering color map. Then, use the rendering index map to filter reference frames from the dynamic acquisition view library through geometric perception retrieval. A reference frame may contain one or more images.

[0059] To efficiently utilize computing resources and reduce noise interference in large-scale scenarios, this solution designs a precise retrieval strategy based on geometric contribution. For example... Figure 3 As shown, based on the candidate camera pose to be generated (the new viewpoint of the current segment), the 3D point cloud map in the current explicit 3D geometry memory is projected and rendered (projecting the 3D point cloud onto the new viewpoint), generating a rendering index map (visibility index map) and a rendering color map. The color of each pixel in the rendering index map represents which original image that point originally came from. The rendering color map is colored according to the original image colors, like a photograph.

[0060] Rendering color maps (such as) Figure 4 (c) in the image is a missing image, which needs to be completed using an unordered context video diffusion model to generate a complete image from a new perspective (e.g., Figure 4 (d) in the middle.

[0061] Use the rendered index map to filter reference frames: Independently calculate the ratio of the coverage area of the effective 3D point cloud of each image in the dynamic acquisition view library after being projected onto the new viewpoint to the area of the rendered index map as the coverage score, and filter images with a coverage score greater than the threshold as reference frames.

[0062] Not all point clouds in the dynamically acquired view library can be projected onto the new viewpoint. Point clouds that can be projected onto the new viewpoint are called effective 3D point clouds.

[0063] If part or all of the content captured by the acquisition camera does not appear in the rendering index map of the new view (it was not captured by the acquisition camera), then the point cloud corresponding to this part or all of the content will be empty when projected onto the new view.

[0064] When part or all of the content captured by the acquisition camera appears in the rendering index map of the new view (captured by the acquisition camera), the point cloud corresponding to this part or all of the content is projected onto the new view and has corresponding pixels in the rendering index map. The coverage area of these pixels is calculated, and then the ratio of the coverage area to the area of the rendering index map is calculated. When this ratio is greater than a threshold, the corresponding image is selected as the reference frame.

[0065] The system can intelligently identify and remove invalid reference frames (images with low relevance) that are obscured by walls or have excessively large viewpoint shifts, and select reference frames (images with high relevance). This method ensures that the video diffusion model only receives the view with the highest geometric value for the currently generated segment as a conditional constraint, significantly enhancing the physical realism and multi-viewpoint alignment of the synthesized view under complex occlusion relationships.

[0066] Step S2.2, as follows Figure 2 As shown, the reference frame, the rendered color map, and the visibility mask of the rendered color map are input into the disordered context video diffusion model to generate a complete image of the new perspective, and the output is an image sequence containing the complete image of the new perspective and the reference frame.

[0067] The reference frame and the rendered color map are arranged into an input image sequence and then fed into the unordered context video diffusion model, where the reference frame is located at the beginning of the input image sequence.

[0068] The reference frame serves as the basis for supplementing missing information in the rendered color map.

[0069] For discontinuous and disordered inputs with large parallax, this invention reconstructs the underlying architecture of the video diffusion model to be geometrically consistent.

[0070] like Figure 2 As shown in (B), a global scene storage mechanism is introduced, which transforms the latent features of the retrieved reference frames into key-value cache information and places it at the beginning of the attention sequence. This allows the target frame (the image from the new perspective) to skip time steps during the generation process and directly retrieve global spatial information for feature decoupling.

[0071] The encoding and decoding module of the unordered context video diffusion model uses an uncompressed 2D variational autoencoder. The uncompressed 2D variational autoencoder consists of an uncompressed encoder and a decoder; the uncompressed encoder extracts latent features from the reference frame and the rendered color map, and outputs a feature map with the same height and width as the corresponding reference frame and rendered color map.

[0072] To further improve accuracy, this invention abandons the temporal pooling operation (compressed variational autoencoder) that leads to feature blurring, and adopts an uncompressed latent space coding strategy (using uncompressed 2D variational autoencoder), that is, performing independent 2D variational autoencoder on each frame of the image. This design ensures a strict one-to-one mapping between latent signals (latent features) and pixel-level physical coordinates, eliminating geometric alignment deviations and texture confusion caused by temporal compression at the source.

[0073] To address the limitation of traditional video variational autoencoder time compression due to discrete acquisition perspectives and weak temporal continuity, the time compression process in the encoder of the variational autoencoder is decoupled, and an uncompressed encoder is adopted.

[0074] The unordered context video diffusion model includes an uncompressed 2D variational autoencoder, a noise addition module, a noise prediction network, and a denoising module, wherein:

[0075] An uncompressed 2D variational autoencoder extracts latent features from a reference frame and a rendered color map; a noise-adding module adds noise to the latent features to obtain noisy latent features; the noisy latent features are input into a noise prediction network, which predicts noise; a denoising module removes the predicted noise from the noisy latent features and outputs denoised latent variables; the uncompressed 2D variational autoencoder decoder restores the denoised latent variables to the image, resulting in an image sequence containing a complete image from the new perspective and the reference frame.

[0076] To achieve continuous trajectory processing over hundreds of frames in large-scale scenarios, a context window sparse attention mechanism is introduced into the underlying computational architecture of the video diffusion model. For example... Figure 2 As shown in (C), the video diffusion model does not compute the attention map for the entire sequence. Instead, it restricts the attention output to a weighted aggregation of two parts of information: first, the local context, where the current frame only interacts with its immediate preceding and following fixed window frames to ensure the temporal smoothness of the video; and second, the global geometric anchor point, where all frames must perform cross-attention computation with the reference frame stored at the beginning of the sequence to maintain spatial consistency. Through this sparsity design, the computational complexity is successfully reduced from the quadratic level of the sequence length to a linear level, significantly reducing memory overhead while ensuring robustness in generating long-distance trajectories.

[0077] In the specific computational logic, the context window sparse attention redistributes feature weights by defining a mask matrix (for frame j, the latent features of all frames except the keyframe and the nearest neighbor frames within [jk, j+k] are masked by defining a mask matrix, and are not calculated; only the latent features of the keyframe and the nearest neighbor frames within [jk, j+k] are calculated). This makes the feature query vector Q of the current frame j (the feature query vector Q is transformed from the latent features of frame j) no longer perform a global search on the key-value pairs K and V of the entire sequence, but instead perform targeted energy aggregation in a subset consisting of two parts: First, the algorithm uses a sliding window mechanism to lock the nearest neighbor frames with an index range within [jk, j+k]. By extracting these local temporal features, it captures the microscopic motion trend and edge continuity of objects, thus mathematically focusing the attention weights on the local region near the diagonal of the context window sparse attention map (see reference). Figure 2 (c) The gray semi-transparent part); At the same time, the system forcibly activates the remote attention path with the first reference frame of the sequence, associating the latent features of the j-th frame with the latent features of the reference frame representing the global geometric skeleton. This dual mapping logic achieves selective filtering of information by splicing the key-value tensors of the local window and the global anchor (reference frame), ensuring that the model can suppress inter-frame flicker at the microscale (inter-frame flicker affects the visual experience) and prevent long-term drift of the object structure through anchor constraints at the macroscale.

[0078] Specifically, the context window sparse attention mechanism is embedded in the noise prediction network. This mechanism is part of the core inference layer within the noise prediction network, helping the network to predict noise more accurately.

[0079] This invention introduces a video diffusion model to generate images that fill in the missing observation information between sparse viewpoints, enabling 3D scene reconstruction from any acquired video. It does not require the acquired viewpoint and the new viewpoint to be generated to be sequentially continuous, nor does it require them to form the same continuous video sequence. This eliminates the dependence of existing methods on fixed acquisition paths or continuous viewpoint inputs, and allows it to adapt to input conditions such as free acquisition and irregular trajectories.

[0080] Step S2.3: Use the point map estimation model to extract the point cloud of the complete image from the new perspective and add it to the global 3D point cloud map to update the global 3D point cloud map. Use the updated global 3D point cloud map to generate the image of the next segment. Continue until all segment image generation and global 3D point cloud map updates are completed.

[0081] The key to maintaining consistency in long trajectory reconstruction lies in the system's ability to continuously evolve its memory. For example... Figure 4The evolution of global 3D point cloud updates in explicit 3D geometric memory is intuitively demonstrated: after each round of synthesizing a complete image from a new perspective, the incremental geometric information carried in the image is immediately extracted using the point map estimation model, and the extracted point cloud is updated in explicit global geometric memory to obtain the updated global 3D point cloud.

[0082] This invention utilizes global 3D point cloud information to uniformly constrain the generation results of different segments (multiple new perspectives), and updates the global 3D point cloud information after each new perspective is generated to guide the image generation of the next segment, thereby effectively reducing the cumulative error in long trajectory generation and improving the global consistency of cross-segment generation and reconstruction in large scenes.

[0083] Figure 4 The comparison between the unupdated and updated point clouds clearly shows that the backs of objects (such as the deep structures of chair backs and table bottoms) that were originally missing due to sparse perspective are accurately filled in after geometric memory fusion. This self-consistent recursive evolution process allows the updated explicit 3D geometric memory to directly serve as a reference base for the next iteration, providing a more complete scene context for subsequent retrieval and generation. Through this cycle of "generating and completing geometry, and geometrically enhancing generation," the system ultimately achieves a leap from locally incomplete point clouds to high-precision, seamless full-scene 3D reconstruction.

[0084] Figure 4 The first and second stages refer to different rounds. The output of the first stage and the output of the second stage are both new perspective complete images output by the video diffusion model.

[0085] The second-stage output (not updated) refers to the failure to extract the point cloud from the complete new perspective image output in the first stage and add it to the global 3D point cloud map, resulting in the subsequent generation of a complete new perspective image that is inconsistent with the actual scene.

[0086] The second-stage output (updated) refers to extracting the point cloud from the complete new perspective image output in the first stage and adding it to the global 3D point cloud map, thus updating the global 3D point cloud map. The subsequent generated complete new perspective image is consistent with the actual scene.

[0087] The three-dimensional reconstruction method of the present invention deeply couples the generation of images by using a video diffusion model guided by a global three-dimensional point cloud map with the updating of the global three-dimensional point cloud map in the explicit three-dimensional geometric memory using the generated images, forming a generation-reconstruction closed-loop iterative mechanism. This solves the problems of poor geometric consistency and insufficient scalability of reconstruction results when the input viewpoint is extremely sparse, disordered, and has long-distance displacement.

[0088] The following section introduces the training method for the unordered context video diffusion model.

[0089] To meet the high efficiency requirements of real-time reconstruction, this invention employs a Distribution Matching Distillation (DMD) strategy to refine the sampling trajectory of the pre-trained video diffusion model and significantly improve the model's inference efficiency. This strategy uses the original high-performance pre-trained model as the teacher model to guide the student model's learning. Its core lies in establishing a distribution alignment mechanism between the two, mapping the teacher model's complex continuous denoising trajectory to the student model's extremely simple discrete sampling steps. During the distillation process, not only is a distribution matching loss function introduced to ensure that the images generated by the student model are statistically consistent with the teacher model, but also an adversarial training loss (adversarial generative network loss function) is combined, utilizing a discriminator to capture high-frequency details. This effectively compensates for the problems of blurred image edges and lost texture that traditional distillation methods often cause.

[0090] In the specific training process, this invention employs an alternating iterative update strategy. First, the discriminator is optimized to enhance its "discrimination ability," enabling it to accurately identify detail loss or image blurring caused by the student model after simplifying the sampling steps. Then, with the optimized discriminator fixed, the student model is updated, utilizing the detailed feature feedback provided by the discriminator and the distribution guidance of the teacher model to force the student model to generate high-frequency textures consistent with the real image within a very short inference step (four steps). Through this iterative adversarial learning, the student model ultimately learns to maintain extremely high image reconstruction quality while significantly compressing the sampling trajectory.

[0091] The optimized loss function for the discriminator (adversarial generative network loss function) is as follows:

[0092] ,in This is the raw, noise-free output generated by the student model, where t represents the time step and is used to specify the level of noise added to the current sample. This represents the noisy latent feature at time step t. This represents the denoising result of the discriminator on the noisy latent features at time step t. This represents the expectation of the sample and time step distributions. Within the overall distillation framework, this loss function enables the discriminator to learn how to recover highly faithful features from noisy samples (noisy latent features).

[0093] The distribution matching loss function used during model training It is formulated as a pseudo-regression objective with a gradient blocking operator sg: Where t represents the time step. This represents a noisy latent feature. This represents the denoising result of the discriminator on the noisy latent features at time step t. This represents the denoising result of the student model for the noisy latent features at time step t. This represents the denoising result of the teacher model for the noisy latent features at time step t. Step size, The normalization factor varies over time. This represents the expectation of the sample and time step distributions, and sg represents the blocking operator.

[0094] This improvement streamlines the original denoising process, which required fifty iterations (50 loops of noise prediction and noise removal), into a fast sampling denoising process that requires only four steps. While achieving approximately twenty times the inference speedup, it accurately preserves the original model's ability to generate complex geometric topologies and fine material details, making it possible to stably and in real-time process long sequence 3D reconstruction tasks with limited computing resources.

[0095] By training a video diffusion model using distribution matching distillation, the denoising process is compressed to four sampling steps, significantly improving inference speed. Combined with a context window sparse attention mechanism that can handle variable length conditions, the model has strong generalization ability for different numbers of input perspectives, while supporting rapid and continuous reconstruction of large-scale long trajectory scenes.

[0096] The 3D reconstruction method of this invention has been validated on authoritative large-scale real-world scene datasets such as DL3DV-10K and Tanks and Temples. Results show that when dealing with sparse and unordered input, this invention significantly outperforms existing techniques in image quality metrics (PSNR, SSIM) and geometric consistency evaluation. Through a closed-loop "generation-reconstruction" mechanism, it effectively eliminates geometric drift in long-trajectory reconstruction, successfully recovering structural details not observed from the initial viewpoint. Furthermore, by combining four-step sampling distillation with sparse attention techniques, it achieves approximately 20 times faster inference while maintaining high-precision reconstruction, demonstrating the technical feasibility and superior performance of this scheme in handling large-scale, long-sequence 3D reconstruction tasks.

[0097] While some embodiments of the present invention have been described in this application, those skilled in the art will understand that these embodiments are merely illustrative. Numerous variations, alternatives, and improvements will arise in those skilled in the art under the teachings of this invention without departing from its scope. The appended claims are intended to define the scope of the invention and thereby cover methods and structures within the scope of the claims themselves and their equivalents.

Claims

1. A multi-view 3D reconstruction method based on a video diffusion model, characterized in that, include: The original sparse image set captured from multiple perspectives is integrated into a dynamic acquisition view library, and multi-scale point clouds are extracted from the original sparse image set, fused into a global 3D point cloud map, and stored in an explicit 3D geometric memory. as well as The user-specified reconstruction trajectory is divided into multiple segments, each containing one or more new viewpoints. Image generation is then performed on each segment sequentially, including: Project the corresponding point cloud in the global 3D point cloud map onto the current segment to obtain the rendering index map and the rendering color map. Use the rendering index map to filter reference frames from the dynamic acquisition view library through geometric perception retrieval. Use the rendering index map to filter reference frames: independently calculate the ratio of the coverage area of the effective 3D point cloud of each image in the dynamic acquisition view library after being projected onto the new view to the area of the rendering index map as the coverage score, and filter images with a coverage score greater than the threshold as reference frames. The reference frame, rendered color map, and visibility mask of the rendered color map are input into the disordered context video diffusion model, which outputs an image sequence containing a complete image of the new perspective and the reference frame. The noise prediction network of the disordered context video diffusion model embeds a context window sparse attention mechanism. The disordered context video diffusion model adopts a global scene storage mechanism to convert the latent features of the reference frame into key-value cache information and place it at the beginning of the attention sequence. The encoding and decoding module of the disordered context video diffusion model uses an uncompressed 2D variational autoencoder. The uncompressed 2D variational autoencoder has an uncompressed encoder and decoder. The uncompressed encoder is configured to extract the latent features of the reference frame and the rendered color map, and the decoder is configured to restore the latent features into an image. Extract the point cloud of the complete image from the new perspective and add it to the global 3D point cloud map to update the global 3D point cloud map. Use the updated global 3D point cloud map to generate the image for the next segment.

2. The multi-view 3D reconstruction method based on a video diffusion model according to claim 1, characterized in that, The feedforward point map estimation model is used to extract multi-scale point clouds from images in the original sparse image set and fuse them into a global 3D point cloud map. Based on the new perspective of the current fragment, the 3D point cloud map in the current explicit 3D geometric memory is projected and rendered to generate a rendering index map and a rendering color map.

3. The multi-view 3D reconstruction method based on a video diffusion model according to claim 1, characterized in that, The reference frame and the rendered color map are arranged into an input image sequence and then fed into the unordered context video diffusion model, where the reference frame is located at the beginning of the input image sequence.

4. The multi-view 3D reconstruction method based on a video diffusion model according to claim 1, characterized in that, The unordered context video diffusion model includes an uncompressed 2D variational autoencoder, a noise addition module, a noise prediction network, and a denoising module, wherein: The uncompressed 2D variational autoencoder extracts latent features from the reference frame and the rendered color map; the noise-adding module adds noise to the latent features to obtain noisy latent features; the noisy latent features are input into the noise prediction network, which predicts noise; the denoising module removes the predicted noise from the noisy latent features and outputs denoised latent variables; the uncompressed 2D variational autoencoder decoder restores the denoised latent variables to the image, resulting in an image sequence containing the complete image from the new perspective and the reference frame.

5. The multi-view 3D reconstruction method based on a video diffusion model according to claim 1, characterized in that, The disordered context video diffusion model is trained by a distribution matching distillation method, including: using a teacher model to guide the student model in learning, and mapping the continuous denoising trajectory of the teacher model to the four-step sampling denoising of the student model.

6. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, performs the steps of the method according to any one of claims 1-5.