Depth-prior-guided multi-view stereo 3D reconstruction method

The multi-view stereo 3D reconstruction method guided by depth priors utilizes Transformer to generate depth prior features at different resolution stages, which solves the problem of inaccurate depth estimation in complex scenes in existing methods and achieves more efficient and stable 3D reconstruction results.

CN120125635BActive Publication Date: 2026-06-30XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2025-01-23
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing multi-view stereo 3D reconstruction methods suffer from inaccurate depth estimation when dealing with complex scenes such as drastic lighting changes and occlusion. Existing cost aggregation methods lack adaptability, leading to a decrease in reconstruction accuracy and stability.

Method used

A depth-prior-guided multi-view stereo 3D reconstruction method is adopted. Multi-view features are obtained through a feature extraction network, and depth prior features are generated at different resolution stages using Transformer. Depth maps are estimated step by step, and cost volumes are aggregated by combining depth priors and feature maps to generate dense 3D point clouds.

Benefits of technology

It improves the quality of depth maps and point clouds, avoids error propagation introduced by external models, improves the efficiency and stability of the model, and generates more accurate and complete 3D point clouds.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120125635B_ABST
    Figure CN120125635B_ABST
Patent Text Reader

Abstract

This invention provides a depth prior-guided multi-view stereo 3D reconstruction method. It acquires N original images of the target to be reconstructed from different perspectives; it iterates through the N original images, using each image as a reference image and the remaining N-1 images as source images, inputting them into a trained depth map generation network to obtain N depth maps; and then fuses these N depth maps to obtain a 3D point cloud of the target to be reconstructed. Based on its network architecture, the subsequent depth map generation module uses the depth maps output by the preceding depth map generation module to generate a depth prior. This depth prior guides the Transformer in the subsequent depth map generation module to complete cost volume aggregation, generating the final depth map, thereby improving the final quality of the depth map and point cloud.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image technology, and more specifically to a depth-prior-guided multi-view stereoscopic reconstruction method. Background Technology

[0002] Multi-view stereo matching (MVS) is a technique for recovering the 3D structure of a scene from multiple 2D images, widely used in computer vision, remote sensing, robot navigation, and 3D reconstruction. Traditional MVS methods primarily rely on geometric algorithms to estimate the depth and structure of a 3D scene by matching feature points or regions in multiple images with significantly different viewpoints. However, traditional methods often face problems such as unstable matching and error propagation when dealing with lighting, occlusion, and large-scale changes, leading to a decline in the quality of reconstructed dense point clouds.

[0003] Despite significant progress in deep learning-based MVS methods across various fields, inaccurate depth estimation remains a challenge under difficult conditions, such as scenes with complex geometry, specular reflection, and varying lighting. In these scenarios, cost aggregation is a crucial step affecting depth estimation accuracy. However, existing cost aggregation methods (such as simple additive aggregation or variance aggregation) have inherent limitations: they fail to adequately account for differences between views and lack adaptive handling of factors like occlusion and lighting variations. This makes cost aggregation ineffective at suppressing noise in scenes with drastic geometric and lighting changes, leading to depth estimation bias and consequently impacting the accuracy and stability of 3D reconstruction.

[0004] Therefore, improving the adaptability of the cost aggregation process, especially in scenarios with drastic changes in geometry and lighting, has become one of the main technical challenges faced by current deep learning-based MVS methods.

[0005] Currently, multi-view stereo 3D reconstruction methods, such as MVSNet and CASMVSNet, typically aggregate all cost volumes using variance aggregation; for example, DPSNet uses simple addition to aggregate all cost volumes. Whether using variance aggregation or additive aggregation, the core principle is to consider the contributions of all views equally. To reduce invalid matches in cost volume aggregation, PVA-MVSNet uses gated convolution to adaptively aggregate cost volumes. This method adjusts the weights of occluded regions in the matching process using a weight map, making the weights of occluded regions relatively small. The weight map is generated based on cost volume information and follows a self-attention mechanism; this adaptive approach depends on training parameters rather than simply on dynamic updates of the data itself. Vis-MVSNet, on the other hand, explicitly introduces a metric for cost volumes by examining the uncertainty or confidence of the probability distribution, treating it as a visibility metric. Some researchers have also proposed an adaptive aggregation module to enhance the reliability of cost volume aggregation, while simultaneously constraining the consistency of edge features along the epipolar direction by adding an edge detection branch.

[0006] However, existing algorithms have significant limitations. Some simply aggregate cost bodies, which often leads to invalid matching relationships. While some methods adaptively aggregate cost bodies, the aggregation operation itself often requires additional training of adaptive parameters. Other methods introduce external models, such as monocular depth estimation networks or edge detection networks, to enhance the aggregation effect, but this may introduce error propagation from the external models. Summary of the Invention

[0007] To address the aforementioned problems in the existing technology, this invention provides a depth-prior-guided multi-view stereoscopic reconstruction method, specifically comprising:

[0008] In a first aspect, the present invention provides a depth-prior-guided multi-view stereoscopic reconstruction method, comprising:

[0009] Obtain N original images of the target to be reconstructed, taken from different perspectives, where N is a positive integer greater than or equal to 2;

[0010] Iterate through the N original images obtained, take each original image as a reference image, and take the remaining N-1 original images as source images. Input them into the trained depth map generation network to obtain N depth maps.

[0011] After fusing N depth maps, a 3D point cloud of the target to be reconstructed is obtained;

[0012] The depth map generation network includes a feature extraction network, a coarse resolution depth map generation module, a first refined resolution depth map generation module, and a second refined resolution depth map generation module.

[0013] During the generation of any first depth map among N depth maps:

[0014] The feature extraction network is used to obtain three different scales of reference feature maps corresponding to the input reference image and three different scales of source feature maps corresponding to the input source image, based on the input reference image and the input source image.

[0015] A coarse-resolution depth map generation module is used to obtain a first intermediate depth map based on the internal Transformer, using the smallest-scale reference feature map and the smallest-scale source feature maps at each scale.

[0016] The first refined resolution depth map generation module is used to obtain a first depth prior based on the first intermediate depth map and the input reference image, and based on the internal Transformer, to complete the cost volume aggregation based on the first depth prior, the scale-centered reference feature map and the source feature maps centered at each scale, and to obtain a second intermediate depth map based on the aggregation result.

[0017] The second refined resolution depth map generation module is used to obtain a second depth prior based on the second intermediate depth map and the input reference image, and based on the internal Transformer, to complete the cost volume aggregation according to the second depth prior, the reference feature map with the largest scale and the source feature maps with the largest scale, and to obtain the first depth map based on the aggregation result.

[0018] In a second aspect, the present invention also provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus.

[0019] Memory, used to store computer programs;

[0020] The processor, when executing a program stored in memory, implements any of the methods provided in the first aspect.

[0021] The beneficial effects of this invention are:

[0022] The depth prior-guided multi-view stereo 3D reconstruction method provided by this invention acquires N original images of the target to be reconstructed from different perspectives; it iterates through the acquired N original images, using each original image as a reference image and the remaining N-1 original images as source images, and inputs them into a trained depth map generation network to obtain N depth maps; the N depth maps are then fused to obtain a 3D point cloud of the target to be reconstructed. Based on its network architecture, the subsequent depth map generation module uses the depth map output by the preceding depth map generation module to generate a depth prior. The depth prior guides the Transformer in the subsequent depth map generation module to complete cost volume aggregation and generate the subsequent depth map, thereby effectively calculating the contribution weights of different views to the cost volume, improving the final quality of the depth map and point cloud. At the same time, the generation of the depth prior depends entirely on the internal structural design of the network, without relying on external models, avoiding the error propagation that may be caused by external models, and improving the efficiency and stability of the overall model.

[0023] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description

[0024] Figure 1 A flowchart illustrating a depth-prior-guided multi-view stereoscopic reconstruction method provided by the present invention;

[0025] Figure 2 A schematic diagram of simulation results provided by the present invention;

[0026] Figure 3 This is another simulation result diagram provided by the present invention;

[0027] Figure 4 This is a schematic diagram of another simulation result provided by the present invention. Detailed Implementation

[0028] The present invention will be further described in detail below with reference to specific embodiments, but the implementation of the present invention is not limited thereto.

[0029] To address the problems existing in the prior art, this invention provides a depth-prior-guided multi-view stereo 3D reconstruction method. This method first utilizes a feature extraction network to acquire multi-view features and construct a feature volume. In the initial stage of coarse resolution, the feature volume is directly aggregated using a Transformer to generate a cost volume. In subsequent stages of refined resolution, depth prior features are generated using the depth information estimated in the previous stage and the RGB image, further enhancing the Transformer to achieve more accurate cost volume aggregation. Depth maps are progressively estimated at each stage through 3D regularization. Finally, the final depth maps are fused to generate a dense 3D point cloud.

[0030] Figure 1 A flowchart illustrating a depth-prior-guided multi-view stereoscopic reconstruction method provided by this invention is shown below. Figure 1 As shown, the method includes:

[0031] S101. Obtain N original images of the target to be reconstructed taken from different perspectives.

[0032] Where N is a positive integer greater than or equal to 2.

[0033] S102. Traverse the N original images obtained, take each original image as a reference image, and take the remaining N-1 original images as source images, input them into the trained depth map generation network to obtain N depth maps.

[0034] For example, suppose the total number of original images acquired is 3, namely image A, image B, and image C. Using image A as the reference image and images B and C as the source images, input the trained depth map generation network to obtain 1 depth map. Using image B as the reference image and images A and C as the source images, input the trained depth map generation network to obtain 1 depth map. Using image C as the reference image and images A and B as the source images, input the trained depth map generation network to obtain 1 depth map, thus obtaining 3 depth maps.

[0035] S103. After fusing N depth maps, a 3D point cloud of the target to be reconstructed is obtained.

[0036] The depth map generation network includes a feature extraction network, a coarse resolution depth map generation module, a first fine resolution depth map generation module, a second fine resolution depth map generation module, and a depth map generation module.

[0037] During the generation of any first depth map among N depth maps:

[0038] The feature extraction network is used to obtain three different scales of reference feature maps corresponding to the input reference image and three different scales of source feature maps corresponding to the input source image, based on the input reference image and the input source image.

[0039] The coarse-resolution depth map generation module is used to obtain the first intermediate depth map based on the internal Transformer, using the smallest-scale reference feature map and the smallest-scale source feature maps at each scale.

[0040] The first refined resolution depth map generation module is used to obtain a first depth prior based on the first intermediate depth map and the input reference image, and based on the internal Transformer, to complete the cost volume aggregation according to the first depth prior, the scale-centered reference feature map and the source feature maps centered at each scale, and to obtain a second intermediate depth map based on the aggregation result.

[0041] The second refined resolution depth map generation module is used to obtain a second depth prior based on the second intermediate depth map and the input reference image, and based on the internal Transformer, to complete the cost volume aggregation according to the second depth prior, the reference feature map with the largest scale and the source feature maps with the largest scale, and to obtain the first depth map based on the aggregation result.

[0042] The depth map generation module is used to obtain a depth map based on the aggregated cost volume.

[0043] Optionally, the feature extraction network is a multi-scale feature network, specifically a feature pyramid structure. That is, after inputting a single-scale image, the downsampling module of UNet performs progressive convolution operations to extract feature maps of three different scales. Then, through the upsampling module's convolution operations and skip connections, these feature maps are synthesized and output to obtain feature maps of three different scales.

[0044] Furthermore, in order to make the predicted depth closer to the actual depth and obtain a more accurate depth map, before step S102, the method further includes setting the number of depth samples, the depth sampling range, and the depth sampling interval for the coarse resolution depth map generation module, the first refined resolution depth map generation module, and the second refined resolution depth map generation module, respectively.

[0045] Optionally, the coarse resolution depth map generation module, the first refined resolution depth map generation module, and the second refined resolution depth map generation module are connected sequentially. Wherein, if the number of depth samples in the first module is D, the depth of the depth map is d, and the sampling interval is L, then the sampling interval in the second module is... Depth sampling range is

[0046] Furthermore, based on the multi-stage cascaded structure provided by this invention, consisting of a coarse-resolution depth map generation module, a first refined-resolution depth map generation module, and a second refined-resolution depth map generation module, in the initial stage, the coarse-resolution depth map generation module acquires small-scale features. In the gradually refined stage, the first refined-resolution depth map generation module and the second refined-resolution depth map generation module successively acquire larger-scale features. The resolution of each stage is twice that of the previous stage. Optionally, the resolutions of the coarse-resolution depth map generation module, the first refined-resolution depth map generation module, and the second refined-resolution depth map generation module are as follows: And H×W, where H represents the image height and W represents the image width. Correspondingly, during image feature extraction, the extracted features at different scales will also correspond to the resolutions above, for... And H×W×C, where C represents the number of channel features.

[0047] Furthermore, in the process of generating any first depth map among the N depth maps, the coarse resolution depth map generation module is specifically used to perform the following steps A1-A4:

[0048] A1. Based on the smallest reference feature map and the smallest source feature maps at each scale, obtain the smallest reference feature volume and the smallest source feature volume at each scale.

[0049] A2. Based on the Transformer within the coarse-resolution depth map generation module, the smallest-scale reference feature volume is used as the query vector, and the smallest-scale source feature volumes are used as key vectors to determine the correlation weights between the smallest-scale source feature volumes and the smallest-scale reference feature volumes. The corresponding expression is:

[0050]

[0051] Among them, w i Let represent the correlation weight between the i-th smallest source feature and the smallest reference feature. Softmax(·) represents a function that normalizes the correlation weights to a probability distribution. Let Q represent the key vector transformed from the i-th smallest source feature body. r The query vector is transformed from the smallest reference feature body, c represents the number of channels corresponding to the key vector and the query vector, T represents the transpose, and i represents the index of the smallest source feature body.

[0052] A3. Based on the smallest reference feature volume and the smallest source feature volumes at each scale, determine the single-view cost volume corresponding to each smallest source feature volume. The corresponding expression is:

[0053] C i = <F i ·F r >,

[0054] Among them, C i F represents the single-view cost volume corresponding to the i-th smallest source feature volume. i F represents the source feature volume with the smallest scale in the i-th dimension. r The smallest reference feature volume is represented by <·>, and the inner product is represented by <·>.

[0055] A4. Aggregate the single-view cost volume corresponding to the smallest source feature volume at each scale, and obtain the first intermediate depth map based on the aggregation result.

[0056] Furthermore, during the generation of any first depth map among the N depth maps, the first refined resolution depth map generation module is specifically used to execute the following steps B1-B6:

[0057] B1. The input reference image is sampled to obtain a first reference image. The scale of the first reference image is the same as that of the reference feature map with the center scale.

[0058] B2. Based on the first reference image and the first intermediate depth map, obtain the first depth prior.

[0059] B3. Based on the scale-centered reference feature map and the scale-centered source feature maps, obtain the scale-centered reference feature body and the scale-centered source feature body.

[0060] B4. Based on the Transformer within the first refined resolution depth map generation module, the first depth prior is used as the query vector, and the source feature volumes centered at each scale are used as the key vectors to determine the correlation weights between the source feature volumes centered at each scale and the reference feature volumes centered at each scale. The corresponding expression is:

[0061]

[0062] in, This represents the correlation weight between the source feature volume at the i1th scale and the reference feature volume at the i1th scale. This represents the key vector of the source feature body transformation centered at the i1th scale. This represents the query vector transformed from the scale-centered reference feature body, where i1 represents the index of the scale-centered source feature body. This represents the query vector transformed from the first depth prior.

[0063] B5. Based on the reference feature volume centered at each scale and the source feature volumes centered at each scale, determine the single-view cost volume corresponding to the source feature volume centered at each scale. The corresponding expression is:

[0064]

[0065] in, This represents the single-view cost volume corresponding to the source feature volume centered at the i1th scale. This represents the source feature volume centered at the i1th scale. A reference feature body that is centered at the scale.

[0066] B6. Aggregate the single-view cost volumes corresponding to the source feature volumes centered at each scale, and obtain the second intermediate depth map based on the aggregation results.

[0067] This method uses the depth map generated by the previous depth map generation module and the corresponding reference image to generate a depth prior, and then introduces the depth prior as the query vector of the Transformer, thereby obtaining more accurate adaptive weights and improving the cost volume aggregation effect.

[0068] In the process of generating any first depth map from N depth maps, the processing of the second refined resolution depth map generation module is similar to that of the first refined resolution depth map generation module, specifically including the following steps C1-C6:

[0069] C1. Sample the input reference image to obtain a second reference image. The scale of the second reference image is the same as that of the reference feature map with the largest scale.

[0070] C2. Based on the second reference image and the second intermediate depth map, obtain the second depth prior.

[0071] C3. Based on the reference feature map with the largest scale and the source feature maps with the largest scale, obtain the reference feature volume with the largest scale and the source feature volume with the largest scale.

[0072] C4. Based on the Transformer within the second-refinement-resolution depth map generation module, the second depth prior is used as the query vector, and the largest source feature volume at each scale is used as the key vector. The correlation weight between the largest source feature volume at each scale and the largest reference feature volume at each scale is determined. The corresponding expression is:

[0073]

[0074] in, This represents the correlation weight between the source feature volume with the largest scale (i2th scale) and the reference feature volume with the largest scale. This represents the key vector transformed from the source feature body with the largest scale at the i2th scale. i1 represents the query vector transformed from the largest reference feature volume, and i2 represents the index of the largest source feature volume. This represents the query vector transformed from the second-depth prior.

[0075] By incorporating deep prior features, the deep prior can guide cost aggregation, reduce noise interference, and more accurately capture the geometric distribution of the scene.

[0076] C5. Based on the reference feature volume with the largest scale and the source feature volumes with the largest scales, determine the single-view cost volume corresponding to the source feature volume with the largest scale. The corresponding expression is:

[0077]

[0078] in, This represents the single-view cost volume corresponding to the source feature volume with the largest scale at the i2th scale. This represents the source feature volume with the largest value at the i2th scale. This represents the reference feature volume with the largest scale.

[0079] C6. Aggregate the single-view cost volumes corresponding to the largest source feature volumes at each scale, and obtain the first depth map based on the aggregation results.

[0080] Specifically, the first refined resolution depth map generation module includes several convolutional layers (encoders) and deconvolutional layers (decoders). The encoder and decoder parts of the first refined resolution depth map generation module generate a first depth prior. The encoder is connected to the encoder part via skip connections to enhance the transmission of detailed information. Furthermore, the decoder part of the first refined resolution depth map generation module is concatenated with the encoder part of the second refined resolution depth map generation module along its feature dimensions to further optimize the expression of the depth prior. The second refined resolution depth map generation module also consists of several convolutional layers (decoders) and deconvolutional layers (decoders). The input to this module includes the first depth prior passed to the first refined resolution depth map generation module. Notably, the encoder part of the second refined resolution depth map generation module is connected to the decoder part of the first refined resolution depth map generation module through feature dimension concatenation. This design allows the encoder of the second refined resolution depth map generation module to receive more information from the decoder. Therefore, in the second refined resolution depth map generation module, the depth prior receives more comprehensive information fusion, further optimizing its expression. Finally, a second depth prior is generated through the encoder and decoder parts of the second refined resolution depth map generation module. This module also employs skip connections between the encoder and decoder to enhance the expressive power of the features.

[0081] Furthermore, the coarse resolution depth map generation module, the first refined resolution depth map generation module, and the second refined resolution depth map generation module all involve how to obtain feature volumes, how to aggregate single-view cost volumes, and how to obtain depth maps based on the aggregation results of single-view cost volumes. The implementation processes of each module are similar, as detailed below:

[0082] Optionally, the feature volume is obtained from the feature map, including: constructing a homography matrix based on the camera pose and depth assumptions corresponding to the feature map; and converting the feature map into a feature volume using the homography matrix.

[0083] The corresponding expression is:

[0084]

[0085] Where H represents the homography matrix, and K2 represents the characteristic matrix. Figure 2 The camera extrinsic parameter matrix corresponding to the viewpoint, where R1 represents the features. Figure 1 The corresponding rotation matrix, R2 represents the feature. Figure 2 The corresponding rotation matrix, where I represents the identity matrix and t1 represents the characteristic matrix. Figure 1The corresponding translation matrix, t2, represents the feature. Figure 2 The corresponding translation matrix, where n represents the feature. Figure 1 The corresponding normal vector, d1, represents the feature. Figure 1 The corresponding depth hypothesis, K1 represents the feature Figure 2 The camera extrinsic parameter matrix corresponding to the viewpoint, where the superscript T denotes transpose, features Figure 2 For reference feature map; features Figure 1 With features Figure 2 Same, or with features Figure 2 Any source feature map with the same scale.

[0086] The coarse resolution depth map generation module, the first refined resolution depth map generation module, and the second refined resolution depth map generation module all perform homography transformation on the corresponding reference feature map and the source feature map using the corresponding homography matrix, so as to project the pixel features on the pixel plane corresponding to the source feature map onto the pixel plane corresponding to the reference feature map, thereby obtaining multiple feature volumes.

[0087] For example, the dimensions of the feature body at each stage are as follows: D3×H×W×C, where D1, D2, and D3 represent the number of depth assumptions corresponding to the coarse-resolution depth map generation module, the first refined-resolution depth map generation module, and the second refined-resolution depth map generation module, respectively. The feature volume corresponding to each depth map generation module is divided into one reference feature volume F. r and N-1 source feature bodies F i This represents the feature body corresponding to the i-th source feature map. The reference feature body is a copy of the original feature map along the depth direction; it is equivalent to the ground truth. The closer the depth assumption is to the true depth, the higher the similarity between the source feature body and the reference feature body.

[0088] Optionally, aggregate the cost volumes of each single view, represented as:

[0089]

[0090] Among them, C agg w represents the aggregated result of each single-view cost body. j C represents the correlation weight between the source feature body and the corresponding reference feature body corresponding to the j-th single-view cost body. j Let J represent the j-th single-view cost body, and J represent the total number of single-view cost bodies used for aggregation.

[0091] This invention does not involve any additional parameter learning in the cost aggregation process itself. Instead, it dynamically calculates the feature volume and deep prior feature volume and directly updates the cost volume, effectively avoiding the learning burden caused by the introduction of additional parameters, while ensuring the flexibility and adaptability of the cost aggregation process.

[0092] Optionally, the depth map is obtained from the aggregation results of single-view cost volumes, including: using a multi-scale 3D convolutional network to denoise the aggregation results of single-view cost volumes and converting the denoised single-view cost volumes into probability volumes; obtaining depth hypotheses based on preset depth sampling ranges, depth sampling intervals, and the number of depth samples; calculating depth probabilities based on the probability volumes; and predicting image depth based on the calculated depth probabilities and depth hypotheses to obtain the depth map, with the corresponding expression being:

[0093]

[0094] Among them, D yc Indicates the predicted image depth, [l min ,l max ] represents the preset depth sampling range, l represents the depth assumption of the current image, and P(l) represents the probability under the depth assumption of l.

[0095] It's important to note that depth assumptions are applied to each pixel, and each pixel corresponds to a set of depth assumptions. By selecting appropriate depth assumptions for each pixel from this set, a depth map that more closely approximates reality can be obtained.

[0096] Alternatively, the loss function of the depth map generation network can be expressed as:

[0097]

[0098] Among them, L total This represents the loss function of the depth map generation network. This represents the loss function of the coarse-resolution depth map generation module. Let represent the loss function of the first refined resolution depth map generation module. This represents the loss function of the second-refinement depth map generation module.

[0099] The depth maps output by each depth map generation module have varying degrees of error compared to the true values. Cross-entropy can be used as the loss function for each depth map generation module, expressed as:

[0100]

[0101] in, Let P represent the loss function for the g-th stage, ρ represent the effective pixels, and P represent the loss function for the g-th stage. GTP(z) represents the true value of the probability distribution, and P(z) represents the predicted probability distribution.

[0102] The depth prior-guided multi-view stereo 3D reconstruction method provided by this invention acquires N original images of the target to be reconstructed from different perspectives; it iterates through the acquired N original images, using each original image as a reference image and the remaining N-1 original images as source images, and inputs them into a trained depth map generation network to obtain N depth maps; the N depth maps are then fused to obtain a 3D point cloud of the target to be reconstructed. Based on its network architecture, the subsequent depth map generation module uses the depth map output by the preceding depth map generation module to generate a depth prior. The depth prior guides the Transformer in the subsequent depth map generation module to complete cost volume aggregation and generate the subsequent depth map, thereby effectively calculating the contribution weights of different views to the cost volume, improving the final quality of the depth map and point cloud. At the same time, the generation of the depth prior depends entirely on the internal structural design of the network, without relying on external models, avoiding the error propagation that may be caused by external models, and improving the efficiency and stability of the overall model.

[0103] The present invention also provides an electronic device structure, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus.

[0104] Memory, used to store computer programs;

[0105] When a processor executes a program stored in memory, it implements the steps provided in the above method embodiments.

[0106] The communication interface is used for communication between the aforementioned electronic devices and other devices.

[0107] To further demonstrate the beneficial effects of the present invention, a set of simulation experiments is also provided, as follows:

[0108] Quantitative experiments were conducted on the DTU dataset. The DTU dataset is an indoor dataset with multi-view images and camera poses, containing 124 scenes, each covering either 49 or 64 views and 7 lighting conditions. Based on MVSNet, the dataset was divided into 79 training sets, 18 validation sets, and 22 test sets, for a total of 27,097 training samples.

[0109] The input images consist of N = 5 images, each 1152×864 pixels. Depth filtering is performed using geometric and photometric constraints, and Gipuma's depth fusion is used to obtain the final 3D point cloud. The accuracy (Acc.), completeness (Comp.), and overall performance (overall) of the reconstructed point cloud are calculated using the official MATLAB code provided by DTU. Overall performance is the average of accuracy and completeness (lower is better), calculated using the following formula.

[0110] The reconstruction method of this invention was compared with traditional methods and some learning-based MVS methods, and the quantitative results are shown in Table 1. The reconstruction method of this invention outperforms most traditional methods and learning-based methods in terms of completeness, achieving better overall performance. Figure 2 The result corresponding to the CasMVSNET method. Figure 3 The result corresponding to the MVSNET method. Figure 4 The result corresponds to the method provided by this invention. For example... Figure 2-4 As shown, details are highlighted within the rectangles, demonstrating that the reconstruction results of this invention are more complete and the geometric structure is clearer.

[0111] Table 1. Quantitative results of the method of the present invention and other methods on the DTU test set.

[0112] Method Acc.(mm) Comp.(mm) Overall (mm) Furu 0.613 0.941 0.777 Gipuma 0.283 0.873 0.578 COLMAP 0.400 0.664 0.532 MVSNet 0.396 0.527 0.462 CasMVSNet 0.325 0.385 0.355 PVA-MVSNet 0.379 0.336 0.357 Vis-MVSNet 0.369 0.361 0.365 Ours 0.382 0.291 0.336

[0113] As the electronic device embodiment is basically similar to the method embodiment, the description is relatively simple. For details and beneficial effects, please refer to the description of the method embodiment.

[0114] The terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0115] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various simple deductions or substitutions can be made without departing from the concept of the present invention, and all such modifications and substitutions should be considered within the scope of protection of the present invention.

Claims

1. A method of depth-prior-guided multi-view stereo 3D reconstruction, characterized in that, include: Obtain N original images of the target to be reconstructed, taken from different perspectives, where N is a positive integer greater than or equal to 2; The N original images are traversed, and each original image is used as a reference image, while the remaining N-1 original images are used as source images. These are then input into the trained depth map generation network to obtain N depth maps. The three-dimensional point cloud of the target to be reconstructed is obtained by fusing the N depth maps. The depth map generation network includes a feature extraction network, a coarse resolution depth map generation module, a first refined resolution depth map generation module, and a second refined resolution depth map generation module. During the generation of any first depth map among the N depth maps: The feature extraction network is used to obtain three different scale reference feature maps corresponding to the input reference image and three different scale source feature maps corresponding to the input source image based on the input reference image and the input source image. The coarse resolution depth map generation module is used to obtain a first intermediate depth map based on the internal Transformer, according to the smallest scale reference feature map and the smallest scale source feature maps. The first refined resolution depth map generation module is used to obtain a first depth prior based on the first intermediate depth map and the input reference image, and based on the internal Transformer, to complete cost volume aggregation based on the first depth prior, the scale-centered reference feature map and the source feature maps centered at each scale, and to obtain a second intermediate depth map based on the aggregation result. The second refined resolution depth map generation module is used to obtain a second depth prior based on the second intermediate depth map and the input reference image, and based on the internal Transformer, to complete cost volume aggregation based on the second depth prior, the reference feature map with the largest scale and the source feature maps with the largest scale, and to obtain the first depth map based on the aggregation result. In the process of generating any first depth map among the N depth maps, the first refined resolution depth map generation module is specifically used for: The input reference image is sampled to obtain a first reference image, the scale of which is the same as the scale of the reference feature map centered at the same scale. Based on the first reference image and the first intermediate depth map, a first depth prior is obtained; Based on the scale-centered reference feature map and each scale-centered source feature map, the scale-centered reference feature body and each scale-centered source feature body are obtained; Based on the Transformer within the first refined resolution depth map generation module, the first depth prior is used as the query vector, and the source feature bodies centered at each scale are used as the key vectors. The correlation weights between the source feature bodies centered at each scale and the reference feature bodies centered at each scale are determined, and the corresponding expressions are as follows: , wherein, represents a correlation weight between a source feature of a first scale and a reference feature of a second scale, represents a key vector converted from a source feature of a first scale, represents a query vector converted from a reference feature of a second scale, represents an index of a source feature of a first scale, represents a query vector converted from a first deep prior.​​ Based on the scale-centered reference feature volume and each scale-centered source feature volume, the single-view cost volume corresponding to each scale-centered source feature volume is determined, and the corresponding expression is: , in, Indicates the first The single-view cost volume corresponding to the source feature volume centered at each scale. Indicates the first A source feature body centered at a certain scale A reference feature volume that is centered at the scale; The single-view cost volume corresponding to the source feature volumes centered at each of the aforementioned scales is aggregated, and a second intermediate depth map is obtained based on the aggregation result.

2. The method according to claim 1, characterized in that, During the generation of any first depth map among the N depth maps, the coarse resolution depth map generation module is specifically used for: Based on the smallest reference feature map and each smallest source feature map, the smallest reference feature volume and each smallest source feature volume are obtained. Based on the Transformer within the coarse resolution depth map generation module, the smallest scale reference feature volume is used as the query vector, and each smallest scale source feature volume is used as the key vector. The correlation weight between each smallest scale source feature volume and the smallest scale reference feature volume is determined, and the corresponding expression is: , in, Indicates the first The correlation weight between the smallest source feature volume and the smallest reference feature volume. This represents a function that normalizes the correlation weights into a probability distribution. Indicates the first The key vector transformed from the smallest source feature body. This represents the query vector transformed from the smallest scale reference feature volume. This represents the number of channels corresponding to the key vector and the query vector. Indicates transpose. The index representing the smallest source feature volume; Based on the smallest reference feature volume and each of the smallest source feature volumes, the single-view cost volume corresponding to each of the smallest source feature volumes is determined, and the corresponding expression is: , in, Indicates the first The single-view cost volume corresponding to the smallest source feature volume. Indicates the first The smallest source feature volume at the smallest scale, Represents the smallest reference feature volume. Indicates the inner product; The single-view cost volume corresponding to the smallest source feature volume of each scale is aggregated, and the first intermediate depth map is obtained based on the aggregation result.

3. The method according to claim 2, characterized in that, The aggregated cost volumes of each individual view are represented as follows: , in, This represents the aggregated result of the cost volumes of each individual view. Indicates the first The correlation weights between the source feature body and the corresponding reference feature body for each single-view cost body. Indicates the first A single-view cost body This represents the total number of single-view cost bodies used for aggregation.

4. The method according to claim 3, characterized in that, The depth map is obtained based on the aggregation results of the single-view cost volume, including: A multi-scale 3D convolutional network is used to denoise the aggregated single-view cost volume, and the denoised single-view cost volume is converted into a probability volume. Based on the preset depth sampling range, depth sampling interval, and number of depth samples, a depth hypothesis is obtained; Calculate the depth probability based on the probability volume; Based on the calculated depth probability and the depth hypothesis, the image depth is predicted to obtain a depth map, and the corresponding expression is: , in, Indicates the predicted image depth. This indicates the preset depth sampling range. This represents the depth hypothesis of the current image. express Probability under the depth assumption.

5. The method according to claim 4, characterized in that, The coarse resolution depth map generation module, the first refined resolution depth map generation module, and the second refined resolution depth map generation module are connected sequentially. Wherein, if the number of depth samples corresponding to the preceding module is... The depth of the depth map is The sampling interval is Then the sampling interval of the module is The depth sampling range is .

6. The method according to claim 5, characterized in that, The loss function of the depth map generation network is expressed as: , in, This represents the loss function of the depth map generation network. This represents the loss function of the coarse-resolution depth map generation module. Let represent the loss function of the first refined resolution depth map generation module. This represents the loss function of the second-refinement depth map generation module.

7. The method according to claim 6, characterized in that, The feature body is obtained from the feature map, including: Based on the camera pose and depth assumptions corresponding to the feature maps, a homography matrix is ​​constructed, and the corresponding expression is: , in, Represents the homography matrix. This represents the camera extrinsic parameter matrix corresponding to the viewpoint of feature map 2. This represents the rotation matrix corresponding to feature map 1. This represents the rotation matrix corresponding to feature map 2. Represents the identity matrix. This represents the translation matrix corresponding to feature map 1. This represents the translation matrix corresponding to feature map 2. This represents the normal vector corresponding to feature map 1. This represents the depth hypothesis corresponding to feature map 1. The superscript represents the camera extrinsic parameter matrix corresponding to the viewpoint of feature map 2. This indicates transpose, with feature map 2 serving as a reference feature map; feature map 1 is the same as feature map 2, or is any source feature map with the same scale as feature map 2. The feature map is converted into a feature volume using the homography matrix.

8. The method according to claim 7, characterized in that, The resolutions of the coarse resolution depth map generation module, the first refined resolution depth map generation module, and the second refined resolution depth map generation module are as follows: , , ,in, Indicates the image height. Indicates the image width.

9. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the method described in any one of claims 1-8.