A three-dimensional object reconstruction system based on deep learning

By using a cascaded 3D reconstruction network and a depth map optimization module, combined with visibility perception and variance-based disparity range prediction, the problem of high memory and computational consumption in existing technologies is solved, achieving high-precision and complete 3D reconstruction results.

CN115359191BActive Publication Date: 2026-06-23CHONGQING UNIV OF TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV OF TECH
Filing Date
2022-09-13
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing deep learning-based 3D reconstruction methods have high memory and computational costs, making it difficult to achieve high-precision and highly complete reconstructions, and they fail to effectively handle the visibility problem of occluded areas in images.

Method used

A cascaded 3D reconstruction network is adopted, which combines a visibility-aware adaptive cost aggregation method and a depth map optimization module guided by residual and channel attention. Depth estimation is performed in stages from low to high resolution. In the cost volume generation stage, a similarity metric and a visibility-aware network are used, combined with variance-based disparity range prediction and depth map optimization, to generate a high-precision 3D dense point cloud.

Benefits of technology

It achieves high-precision and highly complete 3D reconstruction with low memory usage and low computational consumption, improving the robustness and accuracy of reconstruction, especially performing well in occluded areas and scenes with rich details.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115359191B_ABST
    Figure CN115359191B_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of three-dimensional reconstruction, and in particular to an object three-dimensional reconstruction system based on deep learning, which introduces an adaptive cost aggregation method with visibility perception for cost volume aggregation, acquires the visibility of pixel points in the view through a network, and can improve the reconstruction integrity of the occluded area; a variance-based method is used to predict the disparity range of each pixel, and a spatially-varying depth hypothesis surface is constructed for depth estimation in the next stage, and a residual and channel attention guided fusion depth map optimization module is proposed in the last stage to obtain an optimized depth map; an improved depth map fusion algorithm is used to combine the pixel point and 3D point re-projection error for consistency checking to obtain a dense point cloud. Quantitative and qualitative comparison results of the present application and other methods on the DTU dataset show that the present application can reconstruct a scene with better details, and achieves the purposes of reducing GPU memory consumption and computation time.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of 3D reconstruction technology, and more specifically to a 3D object reconstruction system based on deep learning. Background Technology

[0002] 3D reconstruction refers to the mathematical process and computer technology of recovering the three-dimensional information (shape, etc.) of an object using two-dimensional projection. As a popular area of ​​computer vision, 3D reconstruction technology is widely used in medicine, 3D printing, virtual reality, and 3D mapping and navigation. Traditional 3D reconstruction methods use similarity metrics and regularization methods such as standardized cross-correlation and semi-global matching to calculate photometric consistency and recover depth information. Although some current traditional algorithms perform well in terms of accuracy, they also share some common limitations, such as difficulty in reconstructing low-texture scenes, specular reflections, and reflective areas.

[0003] Compared to traditional algorithms, learning-based methods can learn to utilize global semantic information of the scene, including object material, specular reflectivity, and ambient lighting, to achieve more robust matching and more complete reconstruction. In recent years, the successful application of convolutional neural networks in various computer vision tasks has promoted improvements in multi-view geometry (MVS) methods. Stereo matching tasks are well-suited for deep learning-based methods because, with pre-correction of the image, the problem becomes disparity estimation in the horizontal pixel direction, without needing to consider camera parameters.

[0004] In the field of deep learning-based 3D reconstruction, some researchers have proposed SurfaceNet, which pre-constructs colored voxel cubes, combining all image pixel color information and camera information into a single voxel as the network input; others have proposed a stereo learning machine (LSM) that directly utilizes differentiable mappings to achieve end-to-end training. However, both methods utilize volumetric representations of regular grids, which are limited by the huge memory consumption of 3D volumes, making their networks difficult to scale: LSM can only handle low volumetric resolution objects, while SurfaceNet uses a heuristic divide-and-conquer strategy, requiring a long time for large-scale reconstruction. In addition, some researchers have proposed end-to-end networks (such as MVSNet) that directly estimate the depth of a scene from a series of images, thereby achieving higher prediction accuracy.

[0005] While the accuracy of the aforementioned methods has been validated on various datasets, most methods utilize 3D convolutional neural networks (CNNs) to predict depth maps or voxel occupancy, leading to excessive memory consumption and limiting the improvement of estimated resolution. Subsequently, researchers proposed a novel scalable multi-view stereo framework based on recurrent neural networks, called R-MVSNet. By processing sequentially, the algorithm's online memory requirements are reduced from cubic to quadratic, enabling high-resolution reconstruction. However, this affects the completeness and accuracy of the reconstruction and also reduces the running speed. Subsequent work has used cascaded stereo networks for 3D reconstruction of multiple RGB images. However, in the process of fusion of 2D to 3D information, some shortcomings also exist, such as excessive memory consumption of the depth estimation network, inability to handle the visibility of occluded areas in the image, and excessive time consumption for calculating the depth map.

[0006] In summary, how to achieve high-precision and highly complete reconstruction with low memory usage and low computational consumption has become an urgent problem to be solved. Summary of the Invention

[0007] To address the shortcomings of the existing technologies, this invention provides a deep learning-based 3D object reconstruction system that can achieve high-precision and highly complete reconstruction with low memory usage and low computational consumption.

[0008] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:

[0009] A deep learning-based 3D object reconstruction system includes an input unit, a processing unit, a fusion unit, and a reconstruction unit; the input unit is used to input an initial image for 3D reconstruction, the initial image including a source image and a reference image;

[0010] The processing unit includes a cascaded 3D reconstruction network and a depth map optimization module. The cascaded 3D reconstruction network is used to perform depth estimation in stages from low to high resolution. Each stage of the cascaded 3D reconstruction network includes a feature extraction module, a volume construction module, an adaptive aggregation module, and a depth map construction module.

[0011] The feature extraction module is used to extract features from the initial image according to preset requirements, and obtain the corresponding feature map. The preset requirement is that the feature extraction modules of each stage extract features sequentially in order of increasing resolution. The cost volume construction module is used to process the feature map of this stage, obtain the visibility of each pixel, and construct the corresponding cost volume. The adaptive aggregation module is used to analyze and process the cost volume of this stage to obtain the corresponding probability volume, and then use the disparity range based on variance to predict the disparity range of spatial change of each pixel, and construct the spatially changing depth hypothesis surface. The depth map construction module is used to predict the corresponding initial depth map based on the probability volume. If the cost volume construction module does not belong to the first stage of the cascaded 3D reconstruction network, the cost volume construction module constructs the cost volume based on the feature map of this stage and the depth hypothesis surface of the previous stage. The depth map optimization module is used to optimize the initial depth map of the last stage to obtain the optimized depth map.

[0012] The fusion unit is used to generate a 3D dense point cloud based on the optimized depth map; the reconstruction unit is used to process the 3D dense point cloud to obtain a reconstructed 3D view.

[0013] Beneficial effects of the basic solution:

[0014] This invention proposes an adaptive cost aggregation method incorporating visibility awareness. In the cost volume generation stage, a similarity metric is employed, using a visibility awareness network to determine the visibility of pixels in the view. Based on variance prediction of the disparity range per pixel, the local depth range is divided into small, learned intervals, and depth estimation is performed in stages from low to high resolution. Finally, a depth map optimization module integrating residual and channel attention guidance is proposed in the last stage to achieve a coarse-to-fine reconstruction. Experiments demonstrate that quantitative and qualitative comparisons with other methods on the DTU dataset show that the proposed method can reconstruct scenes with better detail while reducing GPU memory consumption and computation time.

[0015] Compared with existing technologies, this method can achieve high-precision and highly complete reconstruction with low memory usage and low computational consumption.

[0016] Preferably, the feature extraction module includes an encoder and a feature extractor; the encoder includes a set of convolutional layers, the uniform layer of the encoder is INPLACE-ABN, and the encoder is used to downsample the initial image size by convolution with a preset stride; the feature extractor is used to extract feature maps from the decoder according to preset requirements.

[0017] Beneficial effects: INPLACE-ABN replaces the common BN+Activation combination in deep networks with a merging layer. By storing a small amount of computational results (discarding some intermediate results and reversing the computation during backpropagation to recover the required parameters), it saves 50% of storage space with only a slight increase in computation. It replaces commonly used batch normalization (BN) and non-linear activation layers. During backpropagation, all the required quantities can be efficiently recovered from this buffer by reversing the forward propagation computation. Theoretically, it achieves a 50% memory gain in convolutional layers without introducing significant computational overhead, with only a 0.8-2% increase in computation time.

[0018] Preferably, the working process of the cost building module in the first stage includes:

[0019] Establish a standard planar scan volume, starting from a predefined depth interval [d] min ,d max L depth hypothetical layers were obtained by uniform sampling in the middle. The corresponding cost volume is obtained by warping the mapping based on the pixel correspondence between the feature map of the source view and the reference image; the pixel correspondence between the feature map of the source view and the reference image is as follows:

[0020] p i,l =K i ·(R i ·(K -1 ·p·d l )+t i ); where p i,l Let pixel p in the i-th source image have a depth d in the l-th layer of the reference image. l The corresponding pixel: The intrinsic parameter matrix of the reference image and the i-th source image; Let be the rotation and translation matrix between the reference image and the i-th source image.

[0021] Preferably, except for the first stage, the working process of the cost body construction module in the remaining stages includes:

[0022] After dividing the feature channels into G groups, the feature F(p) of the reference image and the feature map F of the i-th source view after the hypothetical surface distortion mapping at the l-th depth are calculated. i (p i,l The similarity S in the g-th group i (p,l) g :

[0023] Where H is the number of feature channels; G is the number of feature channel groups;

[0024] Calculate the final similarity between pixel P and the depth hypothesis surface of layer l for each pair. S i (p,l) represents the similarity between the pixel p reference image features and the i-th source image in the l-th layer feature map; n represents the number of initial images; Let be the visibility mask for the i-th source image;

[0025] Calculate the cost volume of the i-th source image This represents the final similarity of each group of hypothetical surfaces at the depth of the i-th source view at the l-th layer;

[0026] Recalculate cost body C:

[0027] Beneficial Effects: In most known MVS methods, the cost volume is generated by converting all extracted feature maps into feature maps of a reference image. This invention, while using a feature aggregation method different from others, conducts in-depth research on cost volume generation. By mapping the extracted features from the source view to the reference view through a planar scanning algorithm, multiple cost volumes are constructed at multiple scales. Furthermore, considering the specific characteristics of each stage of the construction modules in this invention, different cost volume construction methods are specifically designed for the first stage and subsequent stages. This construction method ensures both computational efficiency and the quality of the cost volume construction.

[0028] Preferably, the adaptive aggregation module uses a similarity metric calculated by average grouping correlation to represent the structural weight cost, and then uses a visibility perception network to determine whether pixels in the source image are visible.

[0029] The step of obtaining whether a pixel is visible in a view through a visibility perception network includes: combining the reference image features F(p) and the source image features F... i (p i,l The similarity S i (p,l) Input the visibility-aware network and output the visibility mask of view i. Furthermore, weights are shared across all pixels, and the visibility of each pixel is predicted independently; the visibility mask In the middle, w i (p)=max{P i (p,l)|l=0,1,...,L-1};where P i (p,l) represents the pixel value of pixel p in the i-th source image at the l-th depth hypothesis surface; L is the number of depth hypothesis surfaces at this stage.

[0030] Beneficial Effects: In existing technologies, such as MVSNet, multi-view functionality across all views is provided to a variance-based cost metric, regardless of pixel visibility. Unresolved visibility issues can inevitably worsen the final reconstruction. To address this, this invention proposes a novel aggregation operation that learns the visibility information of source view pixels in the reference image during cost aggregation, thereby achieving robustness.

[0031] Preferably, the adaptive aggregation module processes the volume using a 3D CNN, and then applies a softmax in the depth direction at the end of the 3D CNN to analyze the predicted depth of each pixel to obtain the corresponding probability volume.

[0032] Preferably, the predicted depth Q of pixel p in the k-th stage k The formula for calculating (p) is: Where L is the number of depth assumption surfaces in this stage; Q k,l Let Q represent the l-th hypothesis plane in the k-th stage. k,l (p) represents Q k,l The value at pixel p; P k,l (p) indicates that pixel p is in Q. k,l The probability value.

[0033] Preferably, the adaptive aggregation module uses a variance-based disparity range to predict the spatial variation disparity range of each pixel, and constructs a spatially varying depth hypothesis surface, specifically including:

[0034] Calculate the variance v of the probability distribution of pixel p in stage k. k (p): Among them, P k,l (p) indicates that pixel p is in Q. k,l The probability value of depth; Q k (p) represents the predicted depth probability volume of pixel p at stage k; and the corresponding standard deviation is calculated.

[0035] Use variance-based confidence intervals to measure disparity range prediction:

[0036] c k (p)=[Q k (p)-λσ k (p),Q k (p)+λσ k (p)]; where λ is a preset scalar parameter used to determine the size of the confidence interval;

[0037] Then, for each pixel p, from the confidence interval c of the k-th stage k (p) Uniform sampling L k+1A depth value is used to obtain the depth value Q of the hypothetical surface at stage k+1 for that pixel. k+1,1 (p), Q k+1,2 (p), ..., And construct the corresponding depth hypothesis surface.

[0038] Beneficial effects: This approach creates a probabilistic local space around the ground truth surface, where the ground truth depth lies within the disparity range and possesses high confidence. Since the variance-based disparity range estimation is differentiable, this allows the network of this invention to learn and adjust the probability predictions at each stage to achieve optimized intervals and corresponding depth assumptions for subsequent stages during end-to-end training, thereby achieving efficient spatial partitioning.

[0039] Preferably, the depth map optimization module includes an attention-guided deep residual network; training the deep residual network includes optimizing the deep residual network's learned output residuals and adding the residuals to the depth map estimation of the deep residual network.

[0040] The depth map optimization module works by taking the feature map D∈R from the reference image after passing it through a 2D convolutional layer. H×W×C Compared with the initial depth map D generated in the final stage pre The concatenated feature map D1 is obtained by concatenating the features along the spatial dimension. Then, feature compression, global average pooling, and 1×1 convolution are performed on D1 to obtain an R. 1×1×C tensor w c Tensor w c This is used to represent the weights in the corresponding channels of the spliced ​​feature map D1; then, the tensor w c After normalization using the sigmoid function, the image is multiplied by the concatenated feature map D1, resulting in each channel's concatenated feature map D1 being multiplied by its weight. Then, the weighted concatenated feature map D1 is combined with the initial depth map D generated in the final stage. pre Add them together to generate an optimized depth map D. c .

[0041] Beneficial effects: The depth map optimization module with residual and channel attention guidance fusion proposed in this invention can optimize the initial depth map of the last stage while minimizing the risk of gradient vanishing and gradient explosion in deep learning algorithms.

[0042] Preferably, the fusion unit is further configured to perform consistency screening on the 3D points generated from the optimized depth map, wherein the consistency screening specifically includes:

[0043] Obtain the depth value d of image i at pixel p in the optimized depth map. i (p) followed by the projection matrix P of the camera parameters. i =[Mi |t i The pixel p on image i is back-projected into 3D space to generate 3D point T. ref (x,y,z):

[0044] Then, the 3D point T ref (x,y,z) is projected onto the neighborhood view of image i to generate projected pixel q: Among them, P j This represents the camera parameters of the neighborhood view, where d is the projection depth; then, the projected pixels q of the neighborhood view are calculated based on their estimated depth d. j (q) Back-project to 3D space and project back to the reference image to generate pixel p': Where d' is the depth value at the reprojected pixel p' on the reference image; the reprojection error of the pixel is calculated as: ξ p =||p-p'||2; The reprojection error ξ in image i p Pixels corresponding to >θ1 are filtered out; where θ1 is the forward projection error threshold.

[0045] Then, the pixels at the same location in the neighborhood view are compared according to their estimated depth d. j (q) Back-projection onto 3D space to generate 3D point T src (x',y',z'): Then, the 3D points of the neighborhood view are distorted and mapped onto image I using a homography matrix. i Point T is obtained in 3D space proj (x”,y”,z”); and calculate point T. proj Reprojection error: ξ n =(x”-x) 2 +(y”-y) 2 +(z”-z) 2 The reprojection error ξ n > θ2 corresponds to the 3D point T proj Screening out; where θ2 backprojection error threshold;

[0046] Global multi-view geometric consistency is achieved by aggregating the 3D point matching consistency from all neighboring views. Where n is the number of initial images; then, the 3D points T corresponding to η(p)≥τ are... ref Delete; where τ is the global multi-view geometric consistency error threshold.

[0047] Beneficial effects: Compared with the depth map fusion methods in the prior art, this invention fully considers geometric consistency and combines the calculation of pixel reprojection error and 3D point reprojection error, which significantly improves the robustness, integrity and accuracy of 3D reconstructed point cloud. Attached Figure Description

[0048] To make the objectives, technical solutions, and advantages of the invention clearer, the invention will now be described in further detail with reference to the accompanying drawings, wherein:

[0049] Figure 1 This is a schematic diagram of the processing unit in the embodiment;

[0050] Figure 2 This is a schematic diagram of the adaptive aggregation module in the embodiment;

[0051] Figure 3 This is a schematic diagram of the hypothetical depth plane that varies spatially with the pixel.

[0052] Figure 4 A schematic diagram of the attention-guided depth residual network in the depth map optimization module;

[0053] Figure 5 Example graph showing the comparison results of experimental accuracy error and memory consumption;

[0054] Figure 6 Example graphs showing the results of the experiment's accuracy error and runtime memory consumption;

[0055] Figures 7-8 The figure shows the qualitative comparison results of the four networks tested.

[0056] Figure 9 The following diagram illustrates the processing results of this model for 22 scenarios in the experiment;

[0057] Figure 10 A comparison of the depth maps used in the experiment. Detailed Implementation

[0058] The following detailed explanation illustrates the specific implementation methods:

[0059] Example:

[0060] This embodiment discloses a deep learning-based 3D object reconstruction system, including an input unit, a processing unit, a fusion unit, and a reconstruction unit.

[0061] The input unit is used to input the initial image for 3D reconstruction, which includes a source image and one reference image. For ease of explanation later, in specific implementation, the number of initial images is n, that is, the number of source images is n-1.

[0062] like Figure 1As shown, the processing unit includes a cascaded 3D reconstruction network and a depth map optimization module. The cascaded 3D reconstruction network is used to perform depth estimation in stages according to resolution from low to high. Each stage of the cascaded 3D reconstruction network includes a feature extraction module, a volume construction module, an adaptive aggregation module, and a depth map construction module. In specific implementation, the cascaded 3D reconstruction network includes three stages.

[0063] The feature extraction module is used to extract features from the initial image according to preset requirements, obtaining the corresponding feature map. The preset requirements are that the feature extraction modules at each stage extract features sequentially in order of increasing resolution. In specific implementation, the feature extraction module includes an encoder and a feature extractor. The encoder includes a set of convolutional layers, with the uniform layer of the encoder being INPLACE-ABN. The encoder is used to downsample the initial image size by convolution with a preset stride. The feature extractor is used to extract the feature map from the decoder according to preset requirements.

[0064] Previous methods typically employ multi-layer 2D CNN downsampling or UNet for feature extraction at a single resolution. To achieve high-resolution feature extraction by learning an upsampling process to appropriately merge information at a lower resolution, this invention proposes a multi-scale feature extractor. It first uses an eight-layer downsampling convolutional network similar to FPN, and then, referencing UNet, uses feature information from the previous stage in each stage of multi-stage depth prediction, thus performing reasonable high-frequency feature extraction. The encoder consists of a set of convolutional layers, using convolutions with stride=2 to downsample the original image size twice. Previous networks heavily utilize combinations of BN layers followed by activation layers, but existing deep learning frameworks suffer from inadequate memory management. This invention employs a novel unified layer (INPLACE-ABN), which replaces the common BN+Activation combination in deep networks with a single merging layer. By storing a small amount of computational results (discarding some intermediate results and reversing the computation during backpropagation to recover the required parameters), it saves 50% of storage space while only slightly increasing computational cost. It replaces commonly used batch normalization (BN) and nonlinear activation layers. During backpropagation, all necessary quantities can be efficiently recovered from the buffer by reversing the forward propagation computation. Theoretically, it achieves a 50% memory gain in convolutional layers without introducing significant computational overhead, with only a 0.8-2% increase in computation time. Its input consists of a reference image and N-1 source images. The feature extractor extracts three scaled feature maps F1, F2, and F3 from the decoder for cost volume construction. This invention represents the size of the original image as W*H, and F1, F2, and F3 have... And the resolution of W*H.

[0065] The cost body construction module processes the feature map in this stage, obtaining the visibility of each pixel and constructing the corresponding cost body. Specifically, the first-stage cost body construction module's workflow includes:

[0066] Establish a standard planar scan volume, starting from a predefined depth interval [d] min ,d max L depth hypothetical layers were obtained by uniform sampling in the middle. The corresponding cost volume is obtained by warping the mapping based on the pixel correspondence between the feature map of the source view and the reference image; the pixel correspondence between the feature map of the source view and the reference image is as follows:

[0067] p i,l =K i ·(R i ·(K -1 ·p·d l )+t i ); where p i,l Let pixel p in the i-th source image have a depth d in the l-th layer of the reference image. l The corresponding pixel: The intrinsic parameter matrix of the reference image and the i-th source image; Let be the rotation and translation matrix between the reference image and the i-th source image.

[0068] Apart from the first stage, the working process of the cost body construction modules in the remaining stages includes:

[0069] After dividing the feature channels into G groups, the feature F(p) of the reference image and the feature map F of the i-th source view after the hypothetical surface distortion mapping at the l-th depth are calculated. i (p i,l The similarity S in the g-th group i (p,l) g :

[0070] Where H is the number of feature channels; G is the number of feature channel groups;

[0071] Calculate the final similarity between pixel P and the depth hypothesis surface of layer l for each pair. Among them, S i (p,l) represents the similarity between the pixel p reference image features and the i-th source image in the l-th layer feature map; n represents the number of initial images; Let be the visibility mask for the i-th source image;

[0072] Calculate the cost volume of the i-th source image This represents the final similarity of each group of hypothetical surfaces at the depth of the i-th source view at the l-th layer;

[0073] Recalculate cost body C:

[0074] In most known MVS methods, the cost volume is generated by converting all extracted feature maps into feature maps of a reference image. This invention, while using a feature aggregation method different from other methods, delves into the generation of cost volumes. By mapping extracted features from the source view to the reference view using a planar scanning algorithm, multiple cost volumes are constructed at multiple scales. Furthermore, considering the specific characteristics of each stage of the construction modules in this invention, different cost volume construction methods are specifically designed for the first stage and subsequent stages. This construction method ensures both computational efficiency and the quality of the cost volume construction.

[0075] The adaptive aggregation module is used to analyze and process the cost volume of this stage to obtain the corresponding probability volume, and then use the variance-based disparity range to predict the disparity range of spatial changes for each pixel, and construct the depth hypothesis surface of spatial changes.

[0076] In specific implementation, such as Figure 2 As shown, the adaptive aggregation module uses a similarity metric calculated by average grouping relevance to represent the structural weight cost, and then uses a visibility-aware network to determine whether pixels in the source image are visible; wherein, determining whether pixels in the view are visible through the visibility-aware network includes: combining the reference image features F(p) and the source image features F i (p i,l The similarity S i (p,l) Input the visibility-aware network and output the visibility mask of view i. Furthermore, weights are shared across all pixels, and the visibility of each pixel is predicted independently; the visibility mask In the middle, w i (p)=max{P i (p,l)|l=0,1,...,L-1};where P i (p,l) represents the pixel value of pixel p in the i-th source image at the l-th depth hypothesis surface; L is the number of depth hypothesis surfaces at this stage. In existing technologies, such as MVSNet, multi-view functionality across all views is provided to a variance-based cost metric, regardless of pixel visibility. Unresolved visibility issues may inevitably worsen the final reconstruction. To address this, this invention proposes a novel aggregation operation that learns the visibility information of source view pixels in the reference image during cost aggregation, thereby achieving robustness.

[0077] The adaptive aggregation module processes the volume using a 3D CNN, and then applies a softmax in the depth direction at the end of the 3D CNN to analyze the predicted depth of each pixel, thus obtaining the corresponding probability volume.

[0078] The predicted depth Q of pixel p in stage k k The formula for calculating (p) is: Where L is the number of depth assumption surfaces in this stage; Q k,l Let Q represent the l-th hypothesis plane in the k-th stage. k,l (p) represents Q k,l The value at pixel p; P k,l (p) indicates that pixel p is in Q. k,l The probability value.

[0079] The adaptive aggregation module uses variance-based disparity range prediction to estimate the spatial variation of disparity range per pixel and constructs a depth hypothesis surface for spatial variation, specifically including:

[0080] Calculate the variance v of the probability distribution of pixel p in stage k. k (p): Among them, P k,l (p) indicates that pixel p is in Q. k,l The probability value of depth; Q k (p) represents the predicted depth probability volume of pixel p at stage k; and the corresponding standard deviation is calculated.

[0081] Use variance-based confidence intervals to measure disparity range prediction:

[0082] c k (p)=[Q k (p)-λσ k (p),Q k (p)+λσ k (p)]; where λ is a preset scalar parameter used to determine the size of the confidence interval;

[0083] Then, for each pixel p, from the confidence interval c of the k-th stage k (p) Uniform sampling L k+1 A depth value is used to obtain the depth value Q of the hypothetical surface at stage k+1 for that pixel. k+1,1 (p), Q k+1,2 (p), ..., And construct the corresponding depth hypothesis surface.

[0084] In this way, the present invention can construct L k+1 A depth-hypothetical surface Q that varies spatially with each pixel. k+1,l ,like Figure 3As shown. This method employs a probabilistic local space around the ground truth surface, where the ground truth depth lies within the disparity range with high confidence. Since the variance-based disparity range estimation is differentiable, this allows the network of this invention to learn and adjust the probability predictions at each stage to achieve optimized intervals and corresponding depth hypothesis planes for subsequent stages during end-to-end training, thereby achieving efficient spatial partitioning.

[0085] The depth map construction module is used to obtain the corresponding initial depth map based on the probabilistic volume prediction. The depth map optimization module is used to optimize the initial depth map in the final stage to obtain the optimized depth map. In specific implementations, the depth map optimization module includes an attention-guided depth residual network, such as... Figure 4 As shown, training a deep residual network involves optimizing the learned output residuals and adding them to the depth map estimation of the deep residual network.

[0086] like Figure 5 As shown, the working process of the depth map optimization module includes: transforming the reference image into a feature map D∈R after passing it through a 2D convolutional layer. H×W×C Compared with the initial depth map D generated in the final stage pre The concatenated feature map D1 is obtained by concatenating the features along the spatial dimension. Then, feature compression, global average pooling, and 1×1 convolution are performed on D1 to obtain an R. 1×1×C tensor w c Tensor w c This is used to represent the weights in the corresponding channels of the spliced ​​feature map D1; then, the tensor w c After normalization using the sigmoid function, the image is multiplied by the concatenated feature map D1, resulting in each channel's concatenated feature map D1 being multiplied by its weight. Then, the weighted concatenated feature map D1 is combined with the initial depth map D generated in the final stage. pre Add them together to generate an optimized depth map D. c .

[0087] The depth map optimization module proposed in this invention, which integrates residual and channel attention guidance, can optimize the initial depth map of the last stage while minimizing the risk of gradient vanishing and gradient explosion in deep learning algorithms.

[0088] The fusion unit is used to generate a 3D dense point cloud based on the optimized depth map. The fusion unit is also used to perform consistency screening on the 3D points generated from the optimized depth map, specifically including:

[0089] Obtain the depth value d of image i at pixel p in the optimized depth map. i (p) followed by the projection matrix P of the camera parameters. i =[Mi |t i The pixel p on image i is back-projected into 3D space to generate 3D point T. ref (x,y,z):

[0090] Then, the 3D point T ref (x,y,z) is projected onto the neighborhood view of image i to generate projected pixel q: Among them, P j This represents the camera parameters of the neighborhood view, where d is the projection depth; then, the projected pixels q of the neighborhood view are calculated based on their estimated depth d. j (q) Back-project to 3D space and project back to the reference image to generate pixel p': Where d' is the depth value at the reprojected pixel p' on the reference image; the reprojection error of the pixel is calculated as: ξ p =||p-p'||2; The reprojection error ξ in image i p Pixels corresponding to >θ1 are filtered out; where θ1 is the forward projection error threshold.

[0091] Then, the pixels at the same location in the neighborhood view are compared according to their estimated depth d. j (q) Back-projection onto 3D space to generate 3D point T src (x',y',z'): Then, the 3D points of the neighborhood view are distorted and mapped onto image I using a homography matrix. i Point T is obtained in 3D space proj (x”,y”,z”); and calculate point T. proj Reprojection error: ξ n =(x”-x) 2 +(y”-y) 2 +(z”-z) 2 The reprojection error ξ n > θ2 corresponds to the 3D point T proj Sieve out; where θ2 is the back projection error threshold;

[0092] Global multi-view geometric consistency is achieved by aggregating the 3D point matching consistency from all neighboring views. Where n is the number of initial images; then, the 3D points T corresponding to η(p)≥τ are... ref Delete; where τ is the global multi-view geometric consistency error threshold.

[0093] Compared with existing depth map fusion methods, this invention fully considers geometric consistency and combines the calculation of pixel reprojection error and 3D point reprojection error, thus significantly improving the robustness, integrity and accuracy of 3D reconstructed point cloud.

[0094] The reconstruction unit is used to process 3D dense point clouds to obtain a reconstructed 3D view.

[0095] Compared with existing technologies, this invention proposes an adaptive cost aggregation method that incorporates visibility awareness. In the cost volume generation stage, a similarity measurement method is employed, using a visibility awareness network to determine the visibility of pixels in the view. Based on variance prediction, the disparity range per pixel is divided into small, learned intervals, and depth estimation is performed in stages from low to high resolution. In the final stage, a depth map optimization module is proposed, using residual and channel attention-guided fusion to achieve reconstruction in a coarse-to-fine manner. An improved depth map fusion algorithm is also employed, combining pixel and 3D point reprojection errors for geometric consistency checks.

[0096] Experiments demonstrate that, based on quantitative and qualitative comparisons with other methods on the DTU dataset, the proposed method can reconstruct scenes with better detail while achieving the goal of reducing GPU memory consumption and computation time. The specific details of the experiments are as follows:

[0097] Experimental setup

[0098] The network in this invention is trained on the DTU dataset. The DTU dataset contains a wide variety of scenes and objects, including very similar scenes such as house models, allowing for the exploration of intra-class variability. The dataset is divided into training, validation, and test sets. Similar to previous deep learning-based methods, scene {3,5,17,21,28,35,37,38,40,43,56,59,66,67,82,86,106,117} is used as the validation set, scene {1,4,9,10,11,12,13,15,23,24,29,32,33,34,48,49,62,75,77,110,114,118} is used as the test set, and the training set consists of the remaining 78 scenes. The training input images have a resolution of 640x512 and a view count of 3; planar scan volumes are constructed using N1=64, N2=32, and N3=8. The complete 3rd-order network was trained end-to-end for 30 epochs. The initial learning rate was 0.0016.

[0099] Experimental conclusions

[0100] The system was evaluated on the DTU test set with n=5 views, image size W=1600, H=1184, and initial depth range d. min =425mm,d max=933.8mm. The distance metric in MSG-Net was used to compare the accuracy of the final reconstruction. The following parameters were performed under the point cloud model, with the ground truth being the point cloud model obtained by structured light scanning. Comp. Integrity was calculated by calculating the distance from each point of the structured light scanning model to the nearest point of the MVS-reconstructed model; Acc. Accuracy was calculated by the distance from the MVS-reconstructed point within the visible mask to the nearest point of the structured light scanning model; Ovrall measures the overall performance of accuracy and integrity. The traditional method and the learning-based method were compared, and the quantitative results are shown in Table 1. Although Gipuma

[27] performed best in terms of inaccuracy, this method outperformed the other methods in terms of integrity and achieved competitive performance in terms of overall quality. Note that, with the same input, the depth map size predicted by MVSNet and R-MVSNet is only 1. The final depth map is estimated at the original image size, which has a much higher resolution and results in significantly better integrity.

[0101] Table 1. Quantitative results of reconstruction quality on the DTU evaluation dataset (lower is better).

[0102] Table 1 quantitative results of reconstruction quality on DTUevaluation data set(the lower the better)

[0103]

[0104] Meanwhile, a comparison was made between the overall accuracy error and memory consumption of 3D reconstruction, as well as the accuracy error and runtime. Compared with other methods, the model of this invention has relatively lower memory consumption and runtime. Figure 5 , Figure 6 As shown, the model's memory consumption and runtime are reduced by 36.64% and 22.95% compared to CasMVSNet, 39.54% and 61.48% compared to CVP-MVSNet, and 14.84% and 16.07% compared to UCSNet, respectively. Simultaneously, the overall error is reduced by 9.30%, 6.40%, and [missing data] compared to CasMVSNet, CVP-MVSNet, and UCSNet, respectively. Regarding the quality of the generated point cloud, the 3D reconstruction results of this invention are qualitatively compared with UCSNet, CasMVSNet, and the ground truth on the DTU dataset using scan15, scan23, and scan32. Figure 7-8As shown in the figure, (a) is CasMVSNet, (b) is UCSNet; (c) is Ground Truth; and (d) is the present invention. These examples achieve considerable integrity, and due to the ability to handle high input resolution, the results of the present invention are denser, with finer details in weakly textured areas such as doors, banners, and beverage bottles, and in… Figure 7 The occluded areas performed well and could be more easily identified from the 3D reconstruction results.

[0105] To demonstrate the effectiveness of the network training model, depth map predictions were performed on scenes 1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49, 62, 75, 77, 110, 114, and 118 in the DTU data, and then converted into point cloud models for demonstration. A total of 22 scenes were presented. Figure 9 As shown.

[0106] Ablation Experiment Analysis

[0107] Ablation experiments and quantitative analyses are provided to evaluate the advantages and limitations of key components in this invention framework, including adaptive cost aggregation, the depth map optimization module, and the improved depth map fusion algorithm. In all subsequent studies, experiments were conducted and evaluated on the DTU dataset, and accuracy and completeness were used to measure reconstruction quality. The number of groups G was set to 4, and all other settings were the same as those used previously. The results are shown in Table 2.

[0108] Table 2 Comparison of Model Ablation Experiments

[0109] Table 9 Comparison of model ablation experiments

[0110]

[0111] Table 2 shows that the algorithm proposed in this invention significantly improves performance on the Baseline network. Meanwhile, as... Figure 10 As shown, for the adaptive cost aggregation method, a depth map with an output size of 1200x1986 was visualized to illustrate the network's perception of global and visible information in the image. From left to right: RGB image; depth map of the baseline network; depth map using adaptive cost aggregation; depth map of the network model after adding the depth map optimization module. By comparison, it can be seen that the network model proposed in this invention has a more complete depth map with fewer holes and clearer edges, resulting in better prediction results.

[0112] The results show that this invention reduces GPU memory consumption and improves computation speed while enhancing the overall accuracy of prediction results. The reconstruction of point cloud text and weakly textured and occluded areas of scene details is significantly improved. Compared with most learning-based MVS methods, this invention achieves competitive performance.

[0113] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit the technical solutions. Those skilled in the art should understand that any modifications or equivalent substitutions to the technical solutions of the present invention without departing from the spirit and scope of the present invention should be covered within the scope of the claims of the present invention.

Claims

1. A deep learning-based 3D object reconstruction system, characterized in that: It includes an input unit, a processing unit, a fusion unit, and a reconstruction unit; the input unit is used to input an initial image for 3D reconstruction, the initial image including a source image and a reference image; The processing unit includes a cascaded 3D reconstruction network and a depth map optimization module. The cascaded 3D reconstruction network is used to perform depth estimation in stages from low to high resolution. Each stage of the cascaded 3D reconstruction network includes a feature extraction module, a volume construction module, an adaptive aggregation module, and a depth map construction module. The feature extraction module extracts features from the initial image according to preset requirements to obtain the corresponding feature map. The preset requirements are that the feature extraction modules of each stage extract features sequentially in order of increasing resolution. The cost volume construction module processes the feature map of this stage to obtain the visibility of each pixel and construct the corresponding cost volume. The adaptive aggregation module analyzes and processes the cost volume of this stage to obtain the corresponding probability volume, then uses a variance-based disparity range to predict the disparity range of spatial changes for each pixel and constructs a spatially varying depth hypothesis surface. The depth map construction module predicts the corresponding initial depth map based on the probability volume. If the cost volume construction module does not belong to the first stage of the cascaded 3D reconstruction network, then the cost volume construction module constructs the cost volume based on the feature map of this stage and the depth hypothesis surface of the previous stage. The depth map optimization module optimizes the initial depth map of the last stage to obtain an optimized depth map. The fusion unit is used to generate a 3D dense point cloud based on the optimized depth map; the reconstruction unit is used to process the 3D dense point cloud to obtain a reconstructed 3D view. Specifically, the adaptive aggregation module uses variance-based disparity range prediction to determine the spatial variation range of each pixel, and constructs a depth hypothesis surface for spatial variation, including: Calculate the variance of the probability distribution of pixel p in stage k. : ;in, Indicates that pixel p is in The probability value of depth; Represent the predicted depth probability volume of pixel p at stage k; and calculate the corresponding standard deviation. L represents the number of depth assumption surfaces in this stage; Denotes the l-th hypothesis plane in the k-th stage. express The value at pixel p; Use variance-based confidence intervals to measure disparity range prediction: Where λ is a preset scalar parameter used to determine the size of the confidence interval; Then, for each pixel p, from the confidence interval of the k-th stage... Uniform sampling A depth value is used to obtain the depth value of the hypothetical surface of the pixel in stage k+1. And construct the corresponding depth hypothesis surface; The depth map optimization module includes an attention-guided deep residual network; the training of the deep residual network includes optimizing the deep residual network to learn output residuals and adding the residuals to the depth map estimation of the deep residual network. The depth map optimization module works by: processing the feature map of the reference image after passing it through a 2D convolutional layer. Compared with the initial depth map generated in the final stage The concatenated feature map D1 is obtained by concatenating the features along the spatial dimensions. Then, feature compression, global average pooling, and other methods are applied to feature map D1. Convolution yields a tensor tensor Used to represent the weights in the corresponding channels of the spliced ​​feature map D1; then, the tensor After normalization using the sigmoid function, the image is multiplied by the concatenated feature map D1, resulting in each channel's concatenated feature map D1 being multiplied by its weight. Finally, the weighted concatenated feature map D1 is combined with the initial depth map generated in the final stage. Add them together to generate an optimized depth map. ; The fusion unit is also used to perform consistency screening on the 3D points generated by the optimized depth map, the consistency screening specifically including: Obtain the depth value of image i at pixel p in the optimized depth map. Then, the projection matrix is ​​combined with the camera parameters. The pixel p on image i is back-projected into 3D space to generate 3D points. : ; Then, 3D points The neighborhood view projected onto image i generates projected pixels. : ;in, This represents the camera parameters of the neighborhood view, where d is the projection depth; then the projected pixels of the neighborhood view are... Based on its estimated depth Back-projected into 3D space and projected back onto the reference image to generate pixels : ;in, Reprojected pixels onto the reference image Depth value at the location; calculate the reprojection error of the pixel: ; Reprojection error in image i The corresponding pixels are filtered out; among them, This is the forward projection error threshold; Then, the pixels at the same location in the neighborhood view are compared according to their estimated depth. Back projection onto 3D space to generate 3D points : Then, the 3D points of the neighborhood view are distorted and mapped onto the image using a homography matrix. Points are obtained in 3D space ; and calculate the points Reprojection error: ; Reprojection error Corresponding 3D points Screening out; among them, The back projection error threshold; Global multi-view geometric consistency is achieved by aggregating the 3D point matching consistency from all neighboring views. Where n is the initial number of images; then... Corresponding 3D points Delete; among them, This is the global multi-view geometric consistency error threshold.

2. The object 3D reconstruction system based on deep learning as described in claim 1, characterized in that: The feature extraction module includes an encoder and a feature extractor; the encoder includes a set of convolutional layers, with the uniform layer of the encoder being INPLACE-ABN, and the encoder is used to downsample the initial image size by convolution with a preset stride; the feature extractor is used to extract feature maps from the decoder according to preset requirements.

3. The object 3D reconstruction system based on deep learning as described in claim 2, characterized in that: The first phase of the cost construction module's workflow includes: Establish a standard planar scan volume, starting from a predefined depth interval. Uniform sampling yields L hypothetical depth layers. The corresponding cost volume is obtained by warping the mapping between the feature map of the source view and the pixel correspondence between the reference image and the feature map of the source view; the pixel correspondence between the feature map of the source view and the reference image is as follows: ;in, For the pixels in the i-th source image The depth assumption of layer l in the reference image The corresponding pixel: The intrinsic parameter matrix of the reference image and the i-th source image; Let be the rotation and translation matrix between the reference image and the i-th source image.

4. The object 3D reconstruction system based on deep learning as described in claim 3, characterized in that: Apart from the first stage, the working process of the cost body construction modules in the remaining stages includes: After dividing the feature channels into G groups, the features of the reference image are calculated. and the i-th source view in the th Feature map after layer depth assumption surface distortion mapping Similarity in group g : ;in, G represents the number of feature channels; G represents the number of feature channel groups. Calculate pixel P and the first The final similarity of each group of layers assuming depth. ;in, Indicates the features of the reference image and the i-th source image in Similarity on layer feature maps; n represents the number of initial images; Let be the visibility mask for the i-th source image; Calculate the cost volume of the i-th source image ; Indicates the i-th source view at the th... The final similarity of each group of layers assuming surface depth; Recalculate the cost body : .

5. The deep learning-based 3D object reconstruction system as described in claim 4, characterized in that: The adaptive aggregation module uses a similarity metric calculated by average group correlation to represent the structural weight cost, and then uses a visibility perception network to determine whether pixels in the source image are visible. The step of obtaining whether pixels in a view are visible through a visibility perception network includes: using reference image features and source image features similarity Input a visibility-aware network and output the visibility mask of view i. Furthermore, the visibility mask shares weights across all pixels and independently predicts the visibility of each pixel; middle, ;in, In the i-th source image, pixel p represents the pixel at the i-th position. The pixel value of the layer depth assumption plane; L is the number of depth assumption planes in this stage.

6. The deep learning-based 3D object reconstruction system as described in claim 5, characterized in that: The adaptive aggregation module processes the volume using a 3D CNN, and then applies a softmax algorithm in the depth direction at the end of the 3D CNN to analyze the predicted depth of each pixel, thus obtaining the corresponding probability volume.

7. The object 3D reconstruction system based on deep learning as described in claim 6, characterized in that: Predicted depth of pixel p in stage k The formula for calculation is: Where L is the number of depth assumption surfaces in this stage; Denotes the l-th hypothesis plane in the k-th stage. express The value at pixel p; Indicates that pixel p is in The probability value.