Unsupervised multi-view stereo method based on frequency domain perceptual contrast consistency
By employing an unsupervised multi-view stereo method based on frequency domain-aware contrast consistency, and utilizing a feature pyramid module and a frequency domain loss function to optimize depth estimation, the problem of occluded region reconstruction is solved, generating a high-quality point cloud model and achieving high-precision 3D reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ANHUI UNIV
- Filing Date
- 2026-04-03
- Publication Date
- 2026-06-26
AI Technical Summary
Existing unsupervised multi-view stereo reconstruction methods suffer from limited reconstruction accuracy in occluded areas due to the failure of the photometric consistency assumption, making it difficult to generate high-quality point cloud models. Furthermore, they fail to fully utilize the semantic information of deep features and the detailed information of shallow features, and lack effective utilization of the prior knowledge of the global structure in the frequency domain.
An unsupervised multi-view stereo method based on frequency domain perception contrast consistency is adopted. Multi-scale features are extracted through the feature pyramid module, and the feature consistency is enhanced by combining the hybrid attention module. A loss function is constructed in the frequency domain to provide pseudo-supervisory signals, optimize depth estimation, and generate a high-quality point cloud model.
High-quality point cloud models of occluded scenes were generated without the need for real depth labels, improving the global structural consistency and local geometric details of the occluded areas and achieving high-precision point cloud reconstruction.
Smart Images

Figure CN122289554A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to multi-view Figure 3 This research relates to the fields of 3D reconstruction, computer graphics, and computer vision, specifically involving an unsupervised multi-view stereo method based on frequency domain-aware contrast consistency. Background Technology
[0002] Multi-view Stereo (MVS) is a technique for reconstructing a dense 3D point cloud model of a scene from multiple viewpoint images. Traditional MVS methods mainly rely on prior knowledge, multi-view photometric consistency, and regularization techniques to improve the quality of the reconstructed point cloud. However, due to challenges such as image quality variations and illumination interference, traditional MVS methods often struggle to generate high-precision and complete point clouds.
[0003] In recent years, deep learning-based MVS methods have made substantial progress. Based on the data format of the output reconstruction results, these methods can generally be categorized as follows: (1) voxel-based methods; (2) implicit representation-based methods; and (3) depth map fusion-based methods. Supervised learning-based MVS methods rely on large-scale labeled datasets to train neural networks for depth map estimation, which are then fused into a dense point cloud model. However, the performance and generalization ability of supervised MVS methods are often limited by the need for large-scale, high-quality labeled data. Therefore, unsupervised MVS methods are increasingly attracting research attention. These methods reduce the dependence on ground truth depth by employing effective loss functions and optimization strategies, and can generate high-quality point clouds.
[0004] Existing unsupervised methods are primarily based on the photometric consistency assumption, which states that the photometric properties (such as grayscale or color) of the same 3D point remain relatively stable when viewed from different viewpoints or in adjacent frames. However, in occluded scenes, mutual occlusion between objects prevents some viewpoints from obtaining accurate surface information of the target area, resulting in pixel differences between views. Therefore, the photometric consistency assumption fails, severely reducing the accuracy of depth estimation.
[0005] Some methods attempt to enhance the model's inference ability regarding occluded regions by introducing auxiliary geometric constraints or rendering consistency, thereby mitigating the reliance on pure photometric consistency. While these methods alleviate the problem of photometric consistency assumption failure to some extent by utilizing different feature priors, they are still inherently limited by local feature representations in the spatial domain. Therefore, under large-scale occlusion, these methods still struggle to generate high-quality depth maps, resulting in incomplete point cloud reconstructions and even large holes.
[0006] In other words, existing unsupervised multi-view stereo reconstruction methods still face the following challenges when performing depth estimation: (1) Existing unsupervised multi-view stereo reconstruction methods are based on the photometric consistency assumption. However, in occluded areas, the photometric consistency assumption between viewpoints fails, resulting in limited reconstruction accuracy and difficulty in meeting the application requirements of complex scenes in reality; (2) Existing unsupervised multi-view stereo reconstruction methods fail to fully combine the semantic information of deep features with the detailed information of shallow features, resulting in poor robustness of depth estimation in occluded and texture-deficient areas; (3) Existing unsupervised multi-view stereo methods are limited to spatial domain optimization and lack effective utilization of the prior global structure in the frequency domain, resulting in insufficient geometric recovery capability of occluded areas. Summary of the Invention
[0007] Purpose of the invention: The purpose of this invention is to address the shortcomings of existing technologies and provide an unsupervised multi-view stereo method based on frequency domain sensing contrast consistency.
[0008] Technical solution: The present invention provides an unsupervised multi-view stereo method based on frequency domain sensing contrast consistency, characterized by comprising the following steps:
[0009] Step 1: Input multi-view image dataset and camera parameters ;
[0010] Indicates the number of multi-view images. These are the sequence numbers of the multi-view images. Refers to the reference view. It is the source view; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras;
[0011] Step 2
[0012] For the multi-view images obtained in step 1, a feature pyramid module is used to extract multi-scale features from the multi-view images. The feature pyramid module embeds a hybrid attention module, and a total of three stages are used, each stage... Both are extracted using the feature pyramid module based on the previous section. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the enhanced multi-viewpoint image features are obtained. ; ;
[0013] Step 3: Generate a depth hypothesis plane based on the known depth range of the original multi-view image. The resulting depth hypothesis plane is evenly distributed within the interval of the minimum and maximum depth values, and the depth hypothesis plane is extended to the same resolution as the multi-view image.
[0014] Step 4: Use differentiable homography transformation to transform the multi-view image features obtained in Step 2. Projecting onto a plane based on the extended depth assumption yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body;
[0015] Then, a 3D convolutional neural network is used to aggregate the cost volume to obtain the probability volume. The probability volume is then normalized to obtain the initial depth map. The initial depth map here is the result of a single-stage estimation.
[0016] Step 5: Repeat steps 2 to 4 to obtain the final depth map. The final depth map here is a high-precision depth estimation result obtained through multi-stage progressive refinement and local depth resampling optimization in the cascaded structure.
[0017] Step 6: During the unsupervised learning training process, use photometric consistency loss to optimize the final depth map obtained in step 5;
[0018] Step 7: During the unsupervised learning training process, the final depth map obtained in Step 5 is optimized using frequency domain-aware contrast consistency loss.
[0019] Step 7: Repeat steps 2 to 6 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud model.
[0020] Steps 6 and 7 above construct complementary constraints from both the spatial and frequency domains for the same final depth map, and perform collaborative optimization through a joint loss function (i.e., photometric consistency loss + frequency domain perceptual contrast consistency loss) to finally generate the optimized depth result.
[0021] Furthermore, in step 2, the feature pyramid module first extracts multi-scale features from the input image through the feature pyramid network to generate basic feature maps with different receptive fields; then, a hybrid attention module is embedded at each scale level. The hybrid attention module includes an attention module SE and a spatial attention module, and the features are refined and enhanced using the hybrid attention module.
[0022] The specific process is as follows:
[0023] Step 2.1: Input the two-dimensional multi-view image into the feature pyramid network. Through stepwise downsampling and convolution operations, a multi-scale feature hierarchy is constructed to generate basic feature representations with different receptive fields. This multi-scale structure is not only used to extract hierarchical information from local details to global semantics, but also provides a unified structural expression basis for subsequent cross-view feature matching.
[0024] Step 2.2: Embed channel attention mechanism in each scale feature level of the feature pyramid. Input the feature maps of each scale into the attention module SE. The attention module SE aggregates global context information along the channel dimension through global average pooling and uses fully connected mapping to learn the dependencies between channels to generate adaptive channel weights and recalibrate the original features.
[0025] Step 2.3: Input the feature map refined in Step 2.2 into the spatial attention module. The spatial attention module calculates the attention weights in the spatial dimension, highlights the key response positions in the local area along the spatial axis, and adaptively focuses on the context-related area, thereby enhancing the feature consistency and depth information integrity among multiple views.
[0026] After the above steps, the multi-view image features are output. , The resolutions are all ;in, Indicates the width of a multi-view image; Indicates the height of a multi-view image; Indicates the number of stages.
[0027] Each of the three stages described above does not independently generate a completely new set of feature maps, but rather optimizes and enhances the features from the previous stage step by step. Specifically, the first stage extracts initial multi-scale features from the original multi-view images; subsequent stages use these features as input, further enhancing feature representation capabilities through attention mechanisms and multi-scale fusion. The final output is still a set of feature representations (N in total) corresponding to each input image. The feature pyramid module based on the attention mechanism adopts a multi-level structure design, which essentially corresponds to the step-by-step construction and fusion process of pyramid features. This invention embeds the hybrid attention mechanism into the multi-scale feature pyramid in a structured manner and makes targeted designs for multi-view geometric consistency, thereby achieving a significant improvement in the stability of cross-view feature matching, rather than a simple enhancement of a single feature representation.
[0028] Furthermore, the photometric consistency loss function The expression is as follows:
[0029] ;
[0030] in, Indicates the original reference view. Indicates the first The image is inversely distorted, and M represents the mask used to filter out invalid pixels (such as occluded or boundary areas).
[0031] Furthermore, the frequency domain sensing contrast consistency function Providing pseudo-supervision signals, the specific optimization steps are as follows:
[0032] First, generate the final depth map output in step 5. Treating the anchor point sample depth map as an example, a local area is randomly selected from the anchor point sample depth map for masking, generating a corresponding augmented sample depth map. Subsequently, the anchor point sample depth map was generated. With enhanced sample depth map Perform two-dimensional Fourier transforms on each element to map them from the spatial domain to the frequency domain, obtaining the corresponding frequency domain representation. and ;
[0033] ;
[0034] in, This represents a two-dimensional Fourier transform operation;
[0035] Then, in the anchor point sample depth map A depth map of hard-to-bear samples is constructed by superimposing random noise that follows a normal distribution with a mean of 0 and a standard deviation of 0.1. ;
[0036] ;
[0037] in, This represents a value randomly sampled from a normal distribution with a mean of 0 and a standard deviation of 0.1.
[0038] right A two-dimensional Fourier transform is performed, and high-frequency components are removed by low-pass filtering, retaining only the central low-frequency region, thereby obtaining a frequency domain representation of the hard-to-negative sample that matches the characteristics of the occluded region. ;
[0039] ;
[0040] in, This indicates an operation that preserves the central low-frequency region;
[0041] Next, the frequency domain sensing contrast consistency loss is calculated. Frequency domain sensing contrast consistency loss pass The norm is used to calculate the frequency domain distance between the anchor sample and the positive sample, and the frequency domain distance between the anchor sample and the hard negative sample, respectively. The aim is to minimize the frequency domain distance between the anchor and the positive sample, while maximizing the spectral distance between the anchor and the negative sample.
[0042] ;
[0043] in, Indicates frequency domain sensing contrast consistency loss The weight.
[0044] Beneficial effects: Compared with the prior art, the present invention has the following advantages:
[0045] (1) This invention provides an unsupervised multi-view stereo (MVS) method that can calculate a high-quality point cloud model of an occluded scene from a multi-view image without requiring real depth.
[0046] (2) The present invention employs a feature pyramid module based on an attention mechanism, which can adaptively fuse multi-scale features across views, effectively enhancing the global structural consistency and local geometric details under occlusion conditions.
[0047] (3) This invention introduces a frequency domain-aware contrast consistency strategy and designs a special loss function to provide a pseudo-supervision signal. By explicitly modeling the consistency of multi-view features in the frequency domain, the fuzzy features of the occluded area are effectively captured. Attached Figure Description
[0049] Figure 1 This is a schematic diagram of the overall process of the present invention.
[0050] Figure 2 This is a diagram of the overall network structure of the present invention.
[0051] Figure 3 This is a sample of multi-view image data in the embodiment.
[0052] Figure 4 This is a sample of the depth map in the embodiment.
[0053] Figure 5 This is the final point cloud model output for the example.
[0054] Figure 6 This is the point cloud model output by the existing method.
[0055] Figure 7 This is a point cloud model output by the existing unsupervised method RC-MVSNet.
[0056] Figure 8This is the point cloud model output by the technical solution of this invention. Detailed Implementation
[0058] The technical solution of the present invention will be described in detail below, but the scope of protection of the present invention is not limited to the embodiments described.
[0059] like Figure 1 and Figure 2 As shown, the unsupervised multi-view stereo method based on frequency domain-aware contrast consistency of the present invention includes the following steps:
[0060] Step 1: Input multi-view image dataset and camera parameters ;
[0061] Indicates the number of multi-view images. These are the sequence numbers of the multi-view images. Refers to the reference view. It is the source view; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras;
[0062] Step 2: For the multi-view images obtained in Step 1, use the feature pyramid module to extract multi-scale features from the multi-view images. The feature pyramid module embeds a hybrid attention module, using a total of three stages, each stage... Both are extracted using the feature pyramid module based on the previous section. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the enhanced multi-viewpoint image features are obtained. ; ;
[0063] Step 3: Generate a depth hypothesis plane based on the known depth range of the original multi-view image. The resulting depth hypothesis plane is evenly distributed within the interval of the minimum and maximum depth values, and the depth hypothesis plane is extended to the same resolution as the multi-view image.
[0064] Step 4: Use differentiable homography transformation to transform the multi-view image features obtained in Step 2. Projecting onto a plane based on the extended depth assumption yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body;
[0065] Then, a 3D convolutional neural network is used to aggregate the cost volume to obtain the probability volume. The probability volume is then normalized to obtain the initial depth map.
[0066] Step 5: Repeat steps 2 to 4 to obtain the final depth map. ;
[0067] Step 6: During the unsupervised learning training process, use photometric consistency loss to optimize the final depth map obtained in step 5. ;
[0068] Step 7: During the unsupervised learning training process, optimize the final depth map obtained in Step 5 using frequency domain-aware contrast consistency loss. ;
[0069] Step 7: Repeat steps 2 to 6 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud model.
[0070] In step 2 of this embodiment, the feature pyramid module first performs multi-scale feature extraction on the input image through a feature pyramid network to generate basic feature maps with different receptive fields; then, a hybrid attention module is embedded at each scale level, which includes an attention module SE and a spatial attention module; the specific process is as follows:
[0071] Step 2.1: Input the two-dimensional multi-view image into the feature pyramid network, and construct multi-scale feature layers through stepwise downsampling and convolution operations to generate basic feature representations with different receptive fields;
[0072] Step 2.2: Embed channel attention mechanism in each scale feature level of the feature pyramid. Input the feature maps of each scale into the attention module SE. The attention module SE aggregates global context information along the channel dimension through global average pooling and learns the dependencies between channels using fully connected mapping to generate adaptive channel weights. Recalibrate the original features to obtain the refined feature map.
[0073] Step 2.3: Input the refined feature map into the spatial attention module. The spatial attention module calculates the attention weights in the spatial dimension, highlights the key response positions in the local area along the spatial axis, and adaptively focuses on the context-related area, thereby enhancing the feature consistency and depth information integrity among multiple views.
[0074] After the above steps, the multi-view image features are output. , The resolutions are all ;in, Indicates the width of a multi-view image; Indicates the height of a multi-view image; Indicates the number of stages.
[0075] The photometric consistency loss function described in this embodiment The expression is as follows:
[0076] ;
[0077] in, Indicates the original reference view. Indicates the first A distorted image, where M represents the mask used to filter out invalid pixels.
[0078] The frequency domain sensing contrast consistency function described in this embodiment Providing pseudo-supervision signals, the specific optimization steps are as follows:
[0079] First, generate the final depth map output in step 5. Treating the anchor point sample depth map as an example, a local area is randomly selected from the anchor point sample depth map for masking, generating a corresponding augmented sample depth map. Subsequently, the anchor point sample depth map was generated. With enhanced sample depth map Perform two-dimensional Fourier transforms on each element to map them from the spatial domain to the frequency domain, obtaining the corresponding frequency domain representation. and ;
[0080] ;
[0081] in, This represents a two-dimensional Fourier transform operation;
[0082] Then, in the anchor point sample depth map A depth map of hard-to-bear samples is constructed by superimposing random noise that follows a normal distribution with a mean of 0 and a standard deviation of 0.1. ;
[0083] ;
[0084] in, This represents a value randomly sampled from a normal distribution with a mean of 0 and a standard deviation of 0.1.
[0085] Depth map of hard-to-bear samples A two-dimensional Fourier transform is performed, and high-frequency components are removed by low-pass filtering, retaining only the central low-frequency region, thereby obtaining a frequency domain representation of the hard-to-negative sample that matches the characteristics of the occluded region. ;
[0086] ;
[0087] in, This indicates an operation that preserves the central low-frequency region;
[0088] Next, the frequency domain sensing contrast consistency loss is calculated. Frequency domain sensing contrast consistency loss pass The norm is used to calculate the frequency domain distance between the anchor sample and the positive sample, and the frequency domain distance between the anchor sample and the hard negative sample, respectively. The aim is to minimize the frequency domain distance between the anchor and the positive sample, while maximizing the spectral distance between the anchor and the negative sample.
[0089] ;
[0090] in, Indicates frequency domain sensing contrast consistency loss The weight.
[0091] This embodiment is... Figure 3 The application of this invention's unsupervised multi-view stereo method based on frequency domain-aware contrast consistency in the mid-view includes the following steps:
[0092] Step S1: Input multi-view image data and camera parameters ; Includes The multi-view image includes a reference view. and Amplitude source view ;
[0093] in, express The sequence number of the multi-view images. ; express The number of multi-view images in the middle; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras;
[0094] Step S2, for In Image The attention-based feature pyramid module is used to compute multi-scale features of multi-view images, and this stage is repeated three times. A hybrid attention module is embedded within the attention-based feature pyramid module.
[0095] Each stage All feature pyramids based on the attention mechanism described above are used for extraction. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the features of the multiple viewpoint images are obtained.
[0096] Step S3: Based on the input multi-view images A series of depth hypothesis planes are generated from the known depth range. These depth hypothesis planes are evenly distributed within the intervals of minimum and maximum depth values, and all depth hypothesis planes are extended to the input multi-view image. ( Same resolution;
[0097] Step S4: Use differentiable homography transformation to transform the multi-view image features obtained in step S2. Projecting onto the depth assumption plane expanded in step S3 yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body;
[0098] Step S5: Based on the cost volume obtained in step S4, use a 3D convolutional neural network to aggregate the cost volumes to obtain the probability volume, and normalize the probability volume to obtain the initial depth map.
[0099] Step S6: Repeat steps S2 to S5 to obtain the final depth map;
[0100] Step S7: During the unsupervised learning training process, the final depth map obtained in step S6 is optimized using photometric consistency loss.
[0101] Step S8: During the unsupervised learning training process, the final depth map obtained in step S6 is optimized using frequency domain-aware contrast consistency loss.
[0102] Step S9: Repeat steps S2 to S8 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud reconstruction result.
[0103] This invention first designs a feature pyramid module based on an attention mechanism to extract multi-scale features from multi-view images. Unlike traditional feature pyramid networks that use a top-down path and a fixed fusion strategy, the attention-based feature pyramid module introduces an adaptive weight allocation mechanism during feature extraction. This mechanism can dynamically adjust the contribution of features at different scales based on their importance, significantly improving the discriminative power of feature representation.
[0104] Secondly, to overcome the limitations of spatial domain methods in handling occlusion, this invention innovatively proposes a frequency domain-aware contrast consistency strategy. This strategy explicitly models the consistency of multi-view features in the frequency domain space, utilizing the global structural prior implied in the frequency domain to guide feature matching. Through a specially designed loss function, the model can effectively capture features in occluded areas that become blurred due to texture loss, thus providing a reliable optimization signal even when the photometric consistency assumption fails.
[0105] Ultimately, this invention enables the recovery of more complete and accurate depth maps from multi-view images without the need for real depth labels, thereby achieving high-quality and complete point cloud reconstruction.
[0106] As can be seen from the above embodiments, the overall network structure diagram of the method of the present invention is as follows: Figure 2 As shown, by fully utilizing the multi-scale adaptive fusion capability of the feature pyramid module based on the attention mechanism and the global prior constraint of frequency domain perception contrast consistency, the depth estimation bias problem caused by the failure of the photometric consistency assumption when dealing with occluded and texture-deficient regions in existing unsupervised multi-view stereo reconstruction methods is effectively solved, thereby calculating an accurate and complete point cloud model.
[0107] This invention first extracts multi-scale feature maps from multi-view input images using a feature pyramid module based on an attention mechanism; then, it aggregates global information to calculate an initial depth map and a confidence map; next, it constructs a frequency-domain perceptual contrast consistency loss in the frequency domain space, using global structural information in the frequency domain to provide robust pseudo-supervision signals for depth estimation, effectively guiding depth optimization of occluded regions; finally, it performs point cloud fusion based on the optimized depth map to obtain a high-quality 3D point cloud model.
[0108] In this embodiment, the input image data sample is as follows: Figure 3 As shown, Figure 3 The image in the middle is from four different perspectives, and the resulting depth map is as follows. Figure 4 As shown, the point cloud model output in this embodiment is as follows: Figure 5 As shown, if existing technology is used to... Figure 3 After processing, the resulting depth map is as follows: Figure 6 As shown, the cloud model reconstructed from the image data has a high degree of geometric consistency with the real scene.
[0109] according to Figure 8 The final experimental results presented; for Figure 3 The same view sample, if processed using the existing technology RC-MVSNet, will produce the following results: Figure 7 As shown, large areas of voids exist in areas such as the roof. The output of the method provided by this invention is complete in areas such as the "roof" and "windows".
[0110] As described above, although the invention has been shown and described with reference to specific preferred embodiments, it should not be construed as limiting the invention itself. Various changes in form and detail may be made without departing from the spirit and scope of the invention as defined in the appended claims.
Claims
1. An unsupervised multi-view stereo method based on frequency domain sensing contrast consistency, characterized in that, Includes the following steps: Step 1: Input multi-view image dataset and camera parameters ; Indicates the number of multi-view images. These are the sequence numbers of the multi-view images. Refers to the reference view. It is the source view; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras; Step 2: For the multi-view images obtained in Step 1, use the feature pyramid module to extract multi-scale features from the multi-view images. The feature pyramid module embeds a hybrid attention module, using a total of three stages, each stage... Both are extracted using the feature pyramid module based on the previous section. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the enhanced multi-viewpoint image features are obtained. ; ; Step 3: Generate a depth hypothesis plane based on the known depth range of the original multi-view image. The resulting depth hypothesis plane is evenly distributed within the interval of the minimum and maximum depth values, and the depth hypothesis plane is extended to the same resolution as the multi-view image. Step 4: Use differentiable homography transformation to transform the multi-view image features obtained in Step 2. Projecting onto a plane based on the extended depth assumption yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body; Then, a 3D convolutional neural network is used to aggregate the cost volume to obtain the probability volume. The probability volume is then normalized to obtain the initial depth map. Step 5: Repeat steps 2 to 4 to obtain the final depth map. ; Step 6: During the unsupervised learning training process, use photometric consistency loss to optimize the final depth map obtained in step 5; Step 7: During the unsupervised learning training process, the final depth map obtained in Step 5 is optimized using frequency domain-aware contrast consistency loss. Step 8: Repeat steps 2 to 7 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud model.
2. The unsupervised multi-view stereo method based on frequency domain sensing contrast consistency according to claim 1, characterized in that, In step 2, the feature pyramid module first extracts multi-scale features from the input image using a feature pyramid network to generate basic feature maps with different receptive fields. Then, a hybrid attention module is embedded at each scale level. The hybrid attention module includes an attention module SE and a spatial attention module. The specific process is as follows: Step 2.1: Input the two-dimensional multi-view image into the feature pyramid network, and construct multi-scale feature layers through stepwise downsampling and convolution operations to generate basic feature representations with different receptive fields; Step 2.2: Embed channel attention mechanism in each scale feature level of the feature pyramid. Input the feature maps of each scale into the attention module SE. The attention module SE aggregates global context information along the channel dimension through global average pooling and learns the dependencies between channels using fully connected mapping to generate adaptive channel weights. Recalibrate the original features to obtain the refined feature map. Step 2.3: Input the refined feature map into the spatial attention module. The spatial attention module calculates the attention weights in the spatial dimension, highlights the key response positions in the local area along the spatial axis, and adaptively focuses on the context-related area, thereby enhancing the feature consistency and depth information integrity among multiple views. After the above steps, the multi-view image features are output. , The resolutions are all ;in, Indicates the width of a multi-view image; Indicates the height of a multi-view image; Indicates the number of stages.
3. The unsupervised multi-view stereo method based on frequency domain sensing contrast consistency according to claim 1, characterized in that, The photometric consistency loss function The expression is as follows: ; in, Indicates the original reference view. Indicates the first A distorted image, where M represents the mask used to filter out invalid pixels.
4. The unsupervised multi-view stereo method and system based on frequency domain sensing contrast consistency according to claim 1, characterized in that, The frequency domain sensing contrast consistency function Providing pseudo-supervision signals, the specific optimization steps are as follows: First, generate the final depth map output in step 5. Treating the anchor point sample depth map as an example, a local area is randomly selected from the anchor point sample depth map for masking, generating a corresponding augmented sample depth map. ; Subsequently, the anchor point sample depth map With enhanced sample depth map Perform two-dimensional Fourier transforms on each element to map them from the spatial domain to the frequency domain, obtaining the corresponding frequency domain representation. and ; ; in, This represents a two-dimensional Fourier transform operation; Then, in the anchor point sample depth map A depth map of hard-to-bear samples is constructed by superimposing random noise that follows a normal distribution with a mean of 0 and a standard deviation of 0.
1. ; ; in, This represents a value randomly sampled from a normal distribution with a mean of 0 and a standard deviation of 0.
1. Depth map of hard-to-bear samples A two-dimensional Fourier transform is performed, and high-frequency components are removed by low-pass filtering, retaining only the central low-frequency region, thereby obtaining a frequency domain representation of the hard-to-negative sample that matches the characteristics of the occluded region. ; ; in, This indicates an operation that preserves the central low-frequency region; Next, the frequency domain sensing contrast consistency loss is calculated. Frequency domain sensing contrast consistency loss pass The norm is used to calculate the frequency domain distance between the anchor sample and the positive sample, and the frequency domain distance between the anchor sample and the hard negative sample, respectively. The aim is to minimize the frequency domain distance between the anchor and the positive sample, while maximizing the spectral distance between the anchor and the negative sample. ; in, Indicates frequency domain sensing contrast consistency loss The weight.