Single-view image three-dimensional reconstruction method and system based on double-space collaborative diffusion
By employing a dual-space collaborative diffusion method, a high-quality 3D mesh model is generated using multi-band feature decoupling and cross-space attention mechanisms. This solves the occlusion ambiguity problem in 3D reconstruction of single-view images and achieves fast and low-cost 3D model generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN LINXI INTELLIGENT MODEL ARTIFICIAL INTELLIGENCE CO LTD
- Filing Date
- 2026-04-30
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244384A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and 3D geometry generation technology, and in particular to a method and system for 3D reconstruction of single-view images based on dual-space cooperative diffusion. Background Technology
[0002] In scenarios such as product concept design, industrial prototyping, virtual demonstration, and metaverse asset generation, rapidly converting 2D sketches or reference images into 3D digital models is a core step in improving production efficiency. Traditional 3D modeling workflows heavily rely on manual operation (such as proficiency in professional software like CAD, Maya, and Blender) or expensive professional scanning equipment (such as high-precision laser scanners and structured light scanners), resulting in the following significant shortcomings: First, they are time-consuming; generating a topologically sound, industrially usable 3D model from a concept sketch typically takes several hours or even days, making it difficult to meet the demands of rapid iteration. Second, they are costly; professional scanning equipment is expensive and requires highly skilled operators and technicians, limiting its adoption by small and medium-sized enterprises and individual creators. Third, their applicability is limited; existing multi-view stereo vision and photogrammetry methods heavily rely on multi-view overlapping images or long video sequences, making it difficult to handle single images, which are the most common input format in the early stages of design.
[0003] In recent years, with the development of deep learning, data-driven 3D reconstruction technology has made significant progress. Early research mainly used 3D voxels or point clouds to represent 3D objects. Voxel representation suffers from extremely low resolution due to cubic memory consumption; while point cloud representation is flexible, it loses surface topological connectivity, making it difficult to directly import into rendering engines. Recently, novel perspective synthesis techniques such as neural radiation fields and 3D Gaussian sputtering have attracted much attention, but they are essentially optimized for photometric rendering, making it extremely difficult to extract clear, noise-free, and high-quality 3D meshes with extremely thin structures (such as chair legs or human fingers).
[0004] Applying diffusion models to 3D generation is a cutting-edge exploration in this field. While existing 3D diffusion models (such as DreamFusion, Point-E, Shap-E, etc.) can generate diverse 3D shapes, they still face significant challenges under single-image conditions: due to the severe occlusion ambiguity of single 2D images (the back of the object cannot be observed), diffusion models based on implicit functions (such as signed distance fields or NeRF) often generate chaotic geometry on the back of the object, or exhibit surface breaks and non-manifold errors at complex topological changes (such as structures with multiple holes).
[0005] The fundamental reason is that existing generative models typically learn in a single representation space (implicit or explicit). Implicit spaces excel at representing continuous surfaces but lack global topology-aware biases; explicit spaces (such as meshes or graph structures) excel at maintaining topological connectivity but are difficult to deform and prone to self-intersection. Because the advantages of both representations cannot be combined during the generation process, the model struggles to make reasonable and accurate geometric inferences when faced with occlusion ambiguity in single-view inputs. Summary of the Invention
[0006] To overcome the aforementioned deficiencies of the prior art, this invention provides a method and system for single-view image 3D reconstruction based on dual-space cooperative diffusion, in order to solve the problems existing in the background art.
[0007] This invention provides the following technical solution: a single-view image 3D reconstruction method based on dual-space cooperative diffusion, comprising: A single-view two-dimensional image of the target object is acquired, and multi-band feature decoupling is performed on the single-view two-dimensional image to obtain two-dimensional global semantic features and local depth cue features. Construct the initial implicit symbolic distance field latent space distribution and the explicit topological deformation graph space distribution; Using the two-dimensional global semantic features and the local deep cue features as guiding conditions, the latent space distribution of the implicit symbolic distance field and the spatial distribution of the explicit topological deformation graph are input into a pre-trained dual-space collaborative diffusion model for joint inverse denoising to obtain denoised implicit geometric feature data and explicit topological feature data. In the joint inverse denoising process, the implicit geometric feature data and the explicit topological feature data are bidirectionally aligned through a cross-space attention mechanism. Based on the denoised implicit geometric feature data and the explicit topological feature data, a target three-dimensional mesh model is generated using a differentiable isosurface extraction algorithm.
[0008] Furthermore, the step of decoupling multi-band features from the single-view two-dimensional image includes: The single-view two-dimensional image is input into a visual transformer network to extract low-frequency two-dimensional global semantic features, which are used to characterize the overall category and macroscopic skeleton of the target object. The single-view 2D image is input into a monocular depth estimation subnetwork containing dense dilated convolutions to extract high-frequency local depth cue features, which are used to characterize the geometric details of the visible surface of the target object.
[0009] Furthermore, the joint inverse denoising process is expressed as follows: In the At each time step, the reverse process of the implicit diffusion branch is as follows: The reverse process of the explicit graph diffusion branch is as follows: in, Indicates the first The implicit symbolic distance field latent vector of the step. Indicates the first The explicit topological deformation graph node features of the step. This represents a conditional vector containing the two-dimensional global semantic features and the local deep cue features. and These are the noise scheduling parameters for the diffusion model. and These represent implicit and explicit noise prediction networks, respectively. and Standard Gaussian noise.
[0010] Furthermore, the cross-spatial attention mechanism is used at each time step The interaction between the implicit noise prediction network and the explicit noise prediction network is calculated, and its cross-space information transfer formula is expressed as follows: in, Indicates the first Layered implicit geometric feature matrix Indicates the first Layered explicit graph node feature matrix, These represent the learnable query, key, and value weight matrices, respectively. Indicates the scaling factor. This represents implicit geometric features that incorporate explicit topological information, used to suppress implicit surface topological breaks caused by single-view occlusion.
[0011] Furthermore, the graph structure in the explicit topological deformation graph spatial distribution is constrained by a Laplace topological regularization term, which is expressed as follows: in, This represents the three-dimensional coordinate matrix of the nodes in the deformation graph. Represents the Laplacian matrix of the graph. Represents the edge set of a graph. and Representing adjacent nodes respectively and coordinates This is the regularization weight coefficient, which is used to ensure that the 3D mesh surface is smooth and does not produce self-intersections during the reconstruction process.
[0012] Furthermore, the step of generating the target 3D mesh model based on the denoised implicit geometric feature data and the explicit topological feature data using a differentiable isosurface extraction algorithm includes: The denoised latent geometric feature data is decoded into a signed distance scalar field in a continuous three-dimensional voxel grid; Using the node coordinates in the explicit topological feature data as anchor points, the symbolic distance scalar field is locally deformed and aligned. Based on the aligned symbolic distance scalar field, the zero isosurface is extracted using the differentiable Marching Cubes algorithm to obtain the basic 3D mesh; The local depth cue features are projected onto the surface of the base 3D mesh, and vertex-level fine-tuning is performed to obtain the final target 3D mesh model.
[0013] A single-view image 3D reconstruction system based on dual-space cooperative diffusion includes: The conditional feature parsing module is used to acquire a single-view two-dimensional image of the target object and to decouple the multi-band features of the single-view two-dimensional image to obtain two-dimensional global semantic features and local depth cue features. The spatial initialization module is used to construct the initial implicit symbolic distance field latent space distribution and the explicit topological deformation graph spatial distribution; The dual-space collaborative diffusion module is used to take the two-dimensional global semantic features and the local deep cue features as guiding conditions, input the latent space distribution of the implicit symbolic distance field and the spatial distribution of the explicit topological deformation map into the pre-trained dual-space collaborative diffusion model for joint inverse denoising, obtain denoised implicit geometric feature data and explicit topological feature data, and perform bidirectional feature alignment through a cross-space attention mechanism during the joint inverse denoising process. The differentiable mesh generation module is used to generate a target three-dimensional mesh model based on the denoised implicit geometric feature data and the explicit topological feature data, using a differentiable isosurface extraction algorithm.
[0014] Furthermore, the conditional feature parsing module is specifically used for: Low-frequency two-dimensional global semantic features are extracted using a pre-trained Vision Transformer; High-frequency local depth cue features are extracted using a dense dilated convolutional network based on multi-scale receptive fields.
[0015] An electronic device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method as described in any one of the preceding descriptions.
[0016] A computer-readable storage medium storing instructions that, when executed on a computer, cause the computer to perform the method as described in any one of the preceding descriptions.
[0017] The technical effects and advantages of this invention are as follows: This invention constructs a dual-space collaborative diffusion architecture of implicit symbolic distance field and explicit topological deformation graph, and introduces a cross-space attention mechanism in the reverse denoising process, so that the model can refer to explicit topological constraints when generating invisible regions such as the back of objects, thereby reducing topological breaks, floating blocks or non-manifold errors caused by occlusion ambiguity.
[0018] This invention employs a multi-band feature decoupling strategy to extract low-frequency global semantic features and high-frequency local depth cue features respectively, and uses local depth cue to perform vertex-level fine-tuning on the generated mesh, which helps to preserve subtle geometric information such as surface bumps and edge chamfers in the input image and improves the detail restoration of the reconstruction model.
[0019] This invention introduces a Laplace topological regularization term into the explicit topological deformation graph branch to constrain the displacement smoothness of graph nodes and the distance between adjacent nodes. This can reduce geometric distortions such as mesh self-intersection and local excessive distortion during the generation process, resulting in better water tightness and surface smoothness of the output 3D mesh.
[0020] Compared to traditional manual modeling or multi-view photogrammetry methods, this invention uses a single two-dimensional image as input and directly generates a three-dimensional mesh model through an end-to-end dual-space collaborative diffusion model. It eliminates the need for multi-view acquisition or professional scanning equipment, thus reducing the time cost of generating three-dimensional assets. Attached Figure Description
[0021] Figure 1 A flowchart illustrating the single-view image 3D reconstruction method based on dual-space cooperative diffusion provided in this application embodiment; Figure 2 A schematic diagram illustrating the principle structure of the multi-band feature decoupling mechanism provided in this application embodiment; Figure 3 This is a structural block diagram of the dual-space cooperative diffusion system provided in the embodiments of this application; Figure 4 This is a hardware structure block diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0022] The technical solutions in the embodiments of this application will now be described with reference to the accompanying drawings.
[0023] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this application, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0024] Generally, conventional 3D reconstruction or generation algorithms face significant uncertainties when handling single-view input. This is because a single 2D image only contains the projection information of the front of the target object, completely lacking information about the back side. Existing 3D generation methods based on implicit neural radiation fields or simple voxel diffusion often tend to generate homogeneous, blurred, or even severely distorted geometries in order to fill in the gaps in the back side. For example, when generating a "round-bottomed chair with holes" from a single front image, traditional algorithms often generate a completely solid cylinder because they cannot determine the back topology. Furthermore, purely data-driven generation lacks physical constraints, easily leading to generated meshes with numerous non-manifold edges and self-intersecting triangles, making them unsuitable for direct use in 3D printing or game physics engines.
[0025] To address the aforementioned technical problems, embodiments of this application provide a method, system, electronic device, and computer-readable storage medium for single-view image 3D reconstruction based on dual-space cooperative diffusion; Please see Figure 1 , Figure 1 This application provides a flowchart illustrating a three-dimensional reconstruction method, which includes the following steps: S100: Obtain a single-view two-dimensional image of the target object, and decouple the single-view two-dimensional image for multi-band features to obtain two-dimensional global semantic features and local depth cue features.
[0026] For example, the input single-view 2D image can be a user-drawn design sketch or an RGB real-world image captured from the internet. After acquiring the image, instead of compressing it into a single low-dimensional feature vector as in traditional methods, multi-band feature decoupling is performed.
[0027] Since a single image contains both "macroscopic structure" (such as the outline of an object, aspect ratio, and component composition) and "microscopic details" (such as surface texture, slight bumps, and edge chamfers), this step extracts these details using two parallel neural network sub-architectures: First, the image is input into a pre-trained Vision Transformer (ViT) model to extract low-frequency two-dimensional global semantic features. ViT uses a self-attention mechanism to capture the global receptive field, and its output features encode the overall category bias of objects.
[0028] Secondly, the image is synchronously input into a monocular depth estimation module based on dense atrous convolution. Atrous convolution expands the receptive field without reducing the feature map resolution, thereby extracting local depth cue features containing dense high-frequency depth information. These two features are combined and integrated to form the strong guiding condition for the subsequent diffusion model. .
[0029] S200: Construct the initial implicit symbolic distance field latent space distribution and the explicit topological deformation graph space distribution.
[0030] For example, in a diffusion-based generative task, a “canvas” for the generative process must be defined. To overcome the inherent limitations of a single representation, this application designs an original dual-space initialization mechanism.
[0031] Implicit space initialization: Instead of performing diffusion denoising directly in the huge 3D voxel space, which would lead to catastrophic memory consumption, a pre-trained autoencoder compresses the 3D mesh into a compact latent symbolic distance field (Latent SDF) space, with the initial state... That is, from the standard normal distribution The pure noise vector sampled in the middle. The characteristic of SDF is that the value of any point in space represents the shortest distance from that point to the object surface. Positive values are on the outside and negative values are on the inside. The isosurface 0 is the model surface, which ensures that the generated surface is always water-tight.
[0032] Explicit space initialization: Simultaneously, an explicit semantic deformation graph is initialized. Among them, nodes Initially uniformly distributed on the surface of a standard bounding sphere, each node is assigned a random feature matrix sampled from a normal distribution. Edge set K-Nearest Neighbor (KNN) connections are made based on the initial spatial distance between nodes.
[0033] S300: Using the two-dimensional global semantic features and the local deep cue features as guiding conditions, the implicit symbolic distance field latent space distribution and the explicit topological deformation graph space distribution are input into a pre-trained dual-space collaborative diffusion model for joint inverse denoising.
[0034] For example, this is the core mechanism of the present invention. In the context of... Step to During the step-by-step denoising process (gradually removing Gaussian noise to restore true features), the latent space branch is responsible for "sculpting" the fine geometric surface, while the graph space branch is responsible for "guiding" the overall topological skeleton. These two processes are not isolated. The model at each time step... It depends not only on the time embedding of the current step and the conditional features. It also depends on the state of another space. Its mathematical model is as follows: To achieve deep information interaction between the two prediction networks, a cross-space attention mechanism is incorporated into the network feature layers. Specifically, the implicit feature is typically a dense feature map, while the explicit feature is a set of discrete nodes. Fusion is achieved through the following formula: The physical significance of this step is that when the SDF space decides whether a "surface" should be generated at a certain point in the space, it not only refers to the two-dimensional image, but also searches for the nearest topological node. The characteristic attribute of the graph. If the graph node indicates that there should be no entity in the area (such as the hollow area between the backrest and the seat of a chair), the attention mechanism will assign a very high repulsion weight to suppress the SDF from generating floating noise blocks at that location.
[0035] Furthermore, to ensure the physical plausibility of the graph structure during subsequent deformation, a strict Laplacian topological regularization term was applied during the training of this denoising network: This loss function ensures that the displacement of nodes in the graph space remains smooth during the diffusion and traction process, and that adjacent nodes do not cross or flip. This is the fundamental problem that models generated by traditional methods require extensive manual post-processing. Through this constraint, the model possesses inherent physical legitimacy.
[0036] S400: Based on the denoised implicit geometric feature data and the explicit topological feature data, a target three-dimensional mesh model is generated using a differentiable isosurface extraction algorithm.
[0037] For example, when the diffusion process reaches At that time, a clear latent SDF vector was obtained. and mature topological deformation diagrams At this time, The data is fed into the decoder, which decodes it into continuous SDF scalar values for each voxel angle in a high-resolution three-dimensional voxel field.
[0038] Subsequently, the surface is not simply extracted. Instead, an explicit diagram is used. The nodes in the algorithm serve as control anchors, fine-tuning the SDF scalar field based on Laplace deformation to ensure it strictly conforms to the topological skeleton. Finally, differentiable marching cubes (such as the DMTet architecture) are used to extract the 0 isosurface. Because the algorithm is fully differentiable, local depth cue features are obtained in the final stages of training and inference. The extracted base mesh surface vertices will be projected directly onto them like an "embossing," and sub-voxel vertex offset calculations will be performed. This step gives the reconstructed model amazing minute details, such as the texture of clothing or the carving of surfaces.
[0039] III. System Architecture Implementation: Please see Figure 3 , Figure 3 This application provides a structural block diagram of a single-view image 3D reconstruction system based on dual-space cooperative diffusion, which includes: The conditional feature parsing module 100 is used to acquire a single-view two-dimensional image of the target object and to decouple the multi-band features of the single-view two-dimensional image to obtain two-dimensional global semantic features and local depth cue features. Specifically, it is executed by a GPU cluster equipped with ViT-Large and Dense-Atrous-Net, which supports fast feature stripping of ultra-high-definition 4K images.
[0040] The space initialization module 200 is used to construct the initial implicit symbolic distance field latent space distribution and the explicit topological deformation graph space distribution; and to initialize the Gaussian noise matrix and the graph connectivity matrix of the spherical topology in memory.
[0041] The dual-space cooperative diffusion module 300 is used to input the initial dual-space distribution as a guiding condition into the cooperative diffusion model for joint inverse denoising. This module achieves large-scale parallel computation of the cross-space attention matrix through highly optimized CUDA operators.
[0042] The Differentiable Mesh Generation Module 400 is used to generate and output target 3D mesh models in standard formats (such as OBJ, FBX, etc.) based on denoised features using a differentiable isosurface extraction algorithm. It can directly interface with industrial-grade engines such as Unreal Engine 5.
[0043] IV. Hardware and Storage Media Examples: Please see Figure 4 , Figure 4This is a hardware structure block diagram of an electronic device provided in an embodiment of this application. The electronic device includes a processor 510 (such as a high-performance GPU or AI accelerator card like the NVIDIA RTX A6000), a communication interface 520, a memory 530 (such as high-bandwidth HBM2 or DDR5 memory), and a communication bus 540. The processor 510 executes a computer program stored in the memory 530 to perform the aforementioned dual-space collaborative feature denoising, cross-space attention matrix multiplication, and final extraction calculation of differentiable meshes.
[0044] This application also provides a non-volatile computer-readable storage medium storing instruction code. When the instructions are executed on a computer, the computer performs the aforementioned 3D reconstruction method. This includes not only local computing nodes but also distributed cloud computing cluster environments.
[0045] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for three-dimensional reconstruction of a single-view image based on dual-space cooperative diffusion, characterized in that, include: A single-view two-dimensional image of the target object is acquired, and multi-band feature decoupling is performed on the single-view two-dimensional image to obtain two-dimensional global semantic features and local depth cue features. Construct the initial implicit symbolic distance field latent space distribution and the explicit topological deformation graph space distribution; Using the two-dimensional global semantic features and the local deep cue features as guiding conditions, the latent space distribution of the implicit symbolic distance field and the spatial distribution of the explicit topological deformation graph are input into a pre-trained dual-space collaborative diffusion model for joint inverse denoising to obtain denoised implicit geometric feature data and explicit topological feature data. In the joint inverse denoising process, the implicit geometric feature data and the explicit topological feature data are bidirectionally aligned through a cross-space attention mechanism. Based on the denoised implicit geometric feature data and the explicit topological feature data, a target three-dimensional mesh model is generated using a differentiable isosurface extraction algorithm.
2. The method according to claim 1, characterized in that, The step of decoupling multi-band features from the single-view two-dimensional image includes: The single-view two-dimensional image is input into a visual transformer network to extract low-frequency two-dimensional global semantic features, which are used to characterize the overall category and macroscopic skeleton of the target object. The single-view 2D image is input into a monocular depth estimation subnetwork containing dense dilated convolutions to extract high-frequency local depth cue features, which are used to characterize the geometric details of the visible surface of the target object.
3. The method according to claim 1, characterized in that, The dual-space cooperative diffusion model includes an implicit diffusion branch and an explicit graph diffusion branch; the joint inverse denoising process is represented as follows: In the At each time step, the reverse process of the implicit diffusion branch is as follows: The reverse process of the explicit graph diffusion branch is as follows: in, Indicates the first The implicit symbolic distance field latent vector of the step. Indicates the first The explicit topological deformation graph node features of the step. This represents a conditional vector containing the two-dimensional global semantic features and the local deep cue features. and These are the noise scheduling parameters for the diffusion model. and These represent implicit and explicit noise prediction networks, respectively. and Standard Gaussian noise.
4. The method according to claim 3, characterized in that, The cross-spatial attention mechanism is used at each time step The interaction between the implicit noise prediction network and the explicit noise prediction network is calculated, and its cross-space information transfer formula is expressed as follows: in, Indicates the first Layered implicit geometric feature matrix Indicates the first Layered explicit graph node feature matrix, These represent the learnable query, key, and value weight matrices, respectively. Indicates the scaling factor. This represents implicit geometric features that incorporate explicit topological information, used to suppress implicit surface topological breaks caused by single-view occlusion.
5. The method according to claim 1 or 3, characterized in that, The graph structure in the explicit topological deformation graph spatial distribution is constrained by a Laplace topological regularization term, which is expressed as follows: in, This represents the three-dimensional coordinate matrix of the nodes in the deformation graph. Represents the Laplacian matrix of the graph. Represents the edge set of a graph. and Representing adjacent nodes respectively and coordinates This is the regularization weight coefficient, which is used to ensure that the 3D mesh surface is smooth and does not produce self-intersections during the reconstruction process.
6. The method according to claim 1, characterized in that, The step of generating a target 3D mesh model based on the denoised implicit geometric feature data and the explicit topological feature data using a differentiable isosurface extraction algorithm includes: The denoised latent geometric feature data is decoded into a signed distance scalar field in a continuous three-dimensional voxel grid; Using the node coordinates in the explicit topological feature data as anchor points, the symbolic distance scalar field is locally deformed and aligned. Based on the aligned symbolic distance scalar field, the zero isosurface is extracted using the differentiable Marching Cubes algorithm to obtain the basic 3D mesh; The local depth cue features are projected onto the surface of the base 3D mesh, and vertex-level fine-tuning is performed to obtain the final target 3D mesh model.
7. A single-view image 3D reconstruction system based on dual-space cooperative diffusion, characterized in that, include: The conditional feature parsing module is used to acquire a single-view two-dimensional image of the target object and to decouple the multi-band features of the single-view two-dimensional image to obtain two-dimensional global semantic features and local depth cue features. The spatial initialization module is used to construct the initial implicit symbolic distance field latent space distribution and the explicit topological deformation graph spatial distribution; The dual-space collaborative diffusion module is used to take the two-dimensional global semantic features and the local deep cue features as guiding conditions, input the latent space distribution of the implicit symbolic distance field and the spatial distribution of the explicit topological deformation map into the pre-trained dual-space collaborative diffusion model for joint inverse denoising, obtain denoised implicit geometric feature data and explicit topological feature data, and perform bidirectional feature alignment through a cross-space attention mechanism during the joint inverse denoising process. The differentiable mesh generation module is used to generate a target three-dimensional mesh model based on the denoised implicit geometric feature data and the explicit topological feature data, using a differentiable isosurface extraction algorithm.
8. The system according to claim 7, characterized in that, The conditional feature parsing module is specifically used for: Low-frequency two-dimensional global semantic features are extracted using a pre-trained Vision Transformer; High-frequency local depth cue features are extracted using a dense dilated convolutional network based on multi-scale receptive fields.
9. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the method as claimed in any one of claims 1 to 6.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform the method as described in any one of claims 1 to 6.