Unsupervised multi-view stereo method based on frequency domain perceptual contrast consistency

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing an unsupervised multi-view stereo method based on frequency domain-aware contrast consistency, and utilizing a feature pyramid module and a frequency domain loss function to optimize depth estimation, the problem of occluded region reconstruction is solved, generating a high-quality point cloud model and achieving high-precision 3D reconstruction.

CN122289554APending Publication Date: 2026-06-26ANHUI UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ANHUI UNIV
Filing Date: 2026-04-03
Publication Date: 2026-06-26

Application Information

Patent Timeline

03 Apr 2026

Application

26 Jun 2026

Publication

CN122289554A

IPC: G06T17/00; G06T7/593; G06T7/80; G06V10/52; G06V10/77; G06T5/50; G06V10/82; G06N3/045; G06N3/0464; G06N3/088

AI Tagging

Technology Topics

Point cloudConfidence map

Technical Efficacy Phrases

improve consistencyEnhance local geometric details

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Debugging method of distribution network protection device self-modeling access master station
CN122267993ARealize automatic identificationRealize automatic point-to-point storageEmergency protective arrangement detailsMaster stationSelf adaptive
Double-head grabbing and transplanting mechanism for medical assembly testing machine
CN224377008UIncreased number of crawlsReduce idle travel timeControl devices for conveyors
A generative three-dimensional reconstruction method and system based on global constraints of position information
CN122244335AImprove precision surveying and mapping capabilitiesimprove consistency Biological models 3D-image rendering
Gas inlet device and wafer processing device
CN224362865UImprove uniformity improve consistency Chemical vapor deposition coating
Horizontal wheel hub end face cleaning device
CN224389420UImplement automatic cleanupImprove cleaning efficiency Cleaning using tools Electric machinery Brush

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing unsupervised multi-view stereo reconstruction methods suffer from limited reconstruction accuracy in occluded areas due to the failure of the photometric consistency assumption, making it difficult to generate high-quality point cloud models. Furthermore, they fail to fully utilize the semantic information of deep features and the detailed information of shallow features, and lack effective utilization of the prior knowledge of the global structure in the frequency domain.

Method used

An unsupervised multi-view stereo method based on frequency domain perception contrast consistency is adopted. Multi-scale features are extracted through the feature pyramid module, and the feature consistency is enhanced by combining the hybrid attention module. A loss function is constructed in the frequency domain to provide pseudo-supervisory signals, optimize depth estimation, and generate a high-quality point cloud model.

Benefits of technology

High-quality point cloud models of occluded scenes were generated without the need for real depth labels, improving the global structural consistency and local geometric details of the occluded areas and achieving high-precision point cloud reconstruction.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122289554A_ABST

Patent Text Reader

Abstract

This invention discloses an unsupervised multi-view stereo method based on frequency domain-aware contrast consistency. It uses an attention-based feature pyramid module to extract multi-scale features from multi-view images; aggregates global information to calculate depth maps and confidence maps; constructs positive and hard-negative samples based on anchor point depth maps and maps them to the frequency domain, establishing frequency domain-aware contrast consistency constraints; and provides pseudo-supervisory signals by calculating frequency domain-aware contrast consistency loss to optimize the depth maps. Finally, it fuses the multi-view depth maps to obtain a high-precision point cloud model. This invention fully utilizes the frequency domain contrast learning mechanism, effectively mitigating the problem of photometric consistency failure in occluded scenes by modeling the inherent consistency between multiple views in the frequency domain, thereby significantly improving the accuracy and robustness of depth estimation and ultimately achieving a high-quality, complete point cloud model. This invention can generate a high-precision point cloud model by processing multi-view images without relying on real depth information.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to multi-view Figure 3 This research relates to the fields of 3D reconstruction, computer graphics, and computer vision, specifically involving an unsupervised multi-view stereo method based on frequency domain-aware contrast consistency. Background Technology

[0002] Multi-view Stereo (MVS) is a technique for reconstructing a dense 3D point cloud model of a scene from multiple viewpoint images. Traditional MVS methods mainly rely on prior knowledge, multi-view photometric consistency, and regularization techniques to improve the quality of the reconstructed point cloud. However, due to challenges such as image quality variations and illumination interference, traditional MVS methods often struggle to generate high-precision and complete point clouds.

[0003] In recent years, deep learning-based MVS methods have made substantial progress. Based on the data format of the output reconstruction results, these methods can generally be categorized as follows: (1) voxel-based methods; (2) implicit representation-based methods; and (3) depth map fusion-based methods. Supervised learning-based MVS methods rely on large-scale labeled datasets to train neural networks for depth map estimation, which are then fused into a dense point cloud model. However, the performance and generalization ability of supervised MVS methods are often limited by the need for large-scale, high-quality labeled data. Therefore, unsupervised MVS methods are increasingly attracting research attention. These methods reduce the dependence on ground truth depth by employing effective loss functions and optimization strategies, and can generate high-quality point clouds.

[0004] Existing unsupervised methods are primarily based on the photometric consistency assumption, which states that the photometric properties (such as grayscale or color) of the same 3D point remain relatively stable when viewed from different viewpoints or in adjacent frames. However, in occluded scenes, mutual occlusion between objects prevents some viewpoints from obtaining accurate surface information of the target area, resulting in pixel differences between views. Therefore, the photometric consistency assumption fails, severely reducing the accuracy of depth estimation.

[0005] Some methods attempt to enhance the model's inference ability regarding occluded regions by introducing auxiliary geometric constraints or rendering consistency, thereby mitigating the reliance on pure photometric consistency. While these methods alleviate the problem of photometric consistency assumption failure to some extent by utilizing different feature priors, they are still inherently limited by local feature representations in the spatial domain. Therefore, under large-scale occlusion, these methods still struggle to generate high-quality depth maps, resulting in incomplete point cloud reconstructions and even large holes.

[0006] In other words, existing unsupervised multi-view stereo reconstruction methods still face the following challenges when performing depth estimation: (1) Existing unsupervised multi-view stereo reconstruction methods are based on the photometric consistency assumption. However, in occluded areas, the photometric consistency assumption between viewpoints fails, resulting in limited reconstruction accuracy and difficulty in meeting the application requirements of complex scenes in reality; (2) Existing unsupervised multi-view stereo reconstruction methods fail to fully combine the semantic information of deep features with the detailed information of shallow features, resulting in poor robustness of depth estimation in occluded and texture-deficient areas; (3) Existing unsupervised multi-view stereo methods are limited to spatial domain optimization and lack effective utilization of the prior global structure in the frequency domain, resulting in insufficient geometric recovery capability of occluded areas. Summary of the Invention

[0007] Purpose of the invention: The purpose of this invention is to address the shortcomings of existing technologies and provide an unsupervised multi-view stereo method based on frequency domain sensing contrast consistency.

[0008] Technical solution: The present invention provides an unsupervised multi-view stereo method based on frequency domain sensing contrast consistency, characterized by comprising the following steps:

[0009] Step 1: Input multi-view image dataset and camera parameters ;

[0010] Indicates the number of multi-view images. These are the sequence numbers of the multi-view images. Refers to the reference view. It is the source view; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras;

[0011] Step 2

[0012] For the multi-view images obtained in step 1, a feature pyramid module is used to extract multi-scale features from the multi-view images. The feature pyramid module embeds a hybrid attention module, and a total of three stages are used, each stage... Both are extracted using the feature pyramid module based on the previous section. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the enhanced multi-viewpoint image features are obtained. ; ;

[0013] Step 3: Generate a depth hypothesis plane based on the known depth range of the original multi-view image. The resulting depth hypothesis plane is evenly distributed within the interval of the minimum and maximum depth values, and the depth hypothesis plane is extended to the same resolution as the multi-view image.

[0014] Step 4: Use differentiable homography transformation to transform the multi-view image features obtained in Step 2. Projecting onto a plane based on the extended depth assumption yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body;

[0015] Then, a 3D convolutional neural network is used to aggregate the cost volume to obtain the probability volume. The probability volume is then normalized to obtain the initial depth map. The initial depth map here is the result of a single-stage estimation.

[0016] Step 5: Repeat steps 2 to 4 to obtain the final depth map. The final depth map here is a high-precision depth estimation result obtained through multi-stage progressive refinement and local depth resampling optimization in the cascaded structure.

[0017] Step 6: During the unsupervised learning training process, use photometric consistency loss to optimize the final depth map obtained in step 5;

[0018] Step 7: During the unsupervised learning training process, the final depth map obtained in Step 5 is optimized using frequency domain-aware contrast consistency loss.

[0019] Step 7: Repeat steps 2 to 6 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud model.

[0020] Steps 6 and 7 above construct complementary constraints from both the spatial and frequency domains for the same final depth map, and perform collaborative optimization through a joint loss function (i.e., photometric consistency loss + frequency domain perceptual contrast consistency loss) to finally generate the optimized depth result.

[0021] Furthermore, in step 2, the feature pyramid module first extracts multi-scale features from the input image through the feature pyramid network to generate basic feature maps with different receptive fields; then, a hybrid attention module is embedded at each scale level. The hybrid attention module includes an attention module SE and a spatial attention module, and the features are refined and enhanced using the hybrid attention module.

[0022] The specific process is as follows:

[0023] Step 2.1: Input the two-dimensional multi-view image into the feature pyramid network. Through stepwise downsampling and convolution operations, a multi-scale feature hierarchy is constructed to generate basic feature representations with different receptive fields. This multi-scale structure is not only used to extract hierarchical information from local details to global semantics, but also provides a unified structural expression basis for subsequent cross-view feature matching.

[0024] Step 2.2: Embed channel attention mechanism in each scale feature level of the feature pyramid. Input the feature maps of each scale into the attention module SE. The attention module SE aggregates global context information along the channel dimension through global average pooling and uses fully connected mapping to learn the dependencies between channels to generate adaptive channel weights and recalibrate the original features.

[0025] Step 2.3: Input the feature map refined in Step 2.2 into the spatial attention module. The spatial attention module calculates the attention weights in the spatial dimension, highlights the key response positions in the local area along the spatial axis, and adaptively focuses on the context-related area, thereby enhancing the feature consistency and depth information integrity among multiple views.

[0026] After the above steps, the multi-view image features are output. , The resolutions are all ;in, Indicates the width of a multi-view image; Indicates the height of a multi-view image; Indicates the number of stages.

[0027] Each of the three stages described above does not independently generate a completely new set of feature maps, but rather optimizes and enhances the features from the previous stage step by step. Specifically, the first stage extracts initial multi-scale features from the original multi-view images; subsequent stages use these features as input, further enhancing feature representation capabilities through attention mechanisms and multi-scale fusion. The final output is still a set of feature representations (N in total) corresponding to each input image. The feature pyramid module based on the attention mechanism adopts a multi-level structure design, which essentially corresponds to the step-by-step construction and fusion process of pyramid features. This invention embeds the hybrid attention mechanism into the multi-scale feature pyramid in a structured manner and makes targeted designs for multi-view geometric consistency, thereby achieving a significant improvement in the stability of cross-view feature matching, rather than a simple enhancement of a single feature representation.

[0028] Furthermore, the photometric consistency loss function The expression is as follows:

[0029] ;

[0030] in, Indicates the original reference view. Indicates the first The image is inversely distorted, and M represents the mask used to filter out invalid pixels (such as occluded or boundary areas).

[0031] Furthermore, the frequency domain sensing contrast consistency function Providing pseudo-supervision signals, the specific optimization steps are as follows:

[0032] First, generate the final depth map output in step 5. Treating the anchor point sample depth map as an example, a local area is randomly selected from the anchor point sample depth map for masking, generating a corresponding augmented sample depth map. Subsequently, the anchor point sample depth map was generated. With enhanced sample depth map Perform two-dimensional Fourier transforms on each element to map them from the spatial domain to the frequency domain, obtaining the corresponding frequency domain representation. and ;

[0033] ;

[0034] in, This represents a two-dimensional Fourier transform operation;

[0035] Then, in the anchor point sample depth map A depth map of hard-to-bear samples is constructed by superimposing random noise that follows a normal distribution with a mean of 0 and a standard deviation of 0.1. ;

[0036] ;

[0037] in, This represents a value randomly sampled from a normal distribution with a mean of 0 and a standard deviation of 0.1.

[0038] right A two-dimensional Fourier transform is performed, and high-frequency components are removed by low-pass filtering, retaining only the central low-frequency region, thereby obtaining a frequency domain representation of the hard-to-negative sample that matches the characteristics of the occluded region. ;

[0039] ;

[0040] in, This indicates an operation that preserves the central low-frequency region;

[0041] Next, the frequency domain sensing contrast consistency loss is calculated. Frequency domain sensing contrast consistency loss pass The norm is used to calculate the frequency domain distance between the anchor sample and the positive sample, and the frequency domain distance between the anchor sample and the hard negative sample, respectively. The aim is to minimize the frequency domain distance between the anchor and the positive sample, while maximizing the spectral distance between the anchor and the negative sample.

[0042] ;

[0043] in, Indicates frequency domain sensing contrast consistency loss The weight.

[0044] Beneficial effects: Compared with the prior art, the present invention has the following advantages:

[0045] (1) This invention provides an unsupervised multi-view stereo (MVS) method that can calculate a high-quality point cloud model of an occluded scene from a multi-view image without requiring real depth.

[0046] (2) The present invention employs a feature pyramid module based on an attention mechanism, which can adaptively fuse multi-scale features across views, effectively enhancing the global structural consistency and local geometric details under occlusion conditions.

[0047] (3) This invention introduces a frequency domain-aware contrast consistency strategy and designs a special loss function to provide a pseudo-supervision signal. By explicitly modeling the consistency of multi-view features in the frequency domain, the fuzzy features of the occluded area are effectively captured. Attached Figure Description

[0049] Figure 1 This is a schematic diagram of the overall process of the present invention.

[0050] Figure 2 This is a diagram of the overall network structure of the present invention.

[0051] Figure 3 This is a sample of multi-view image data in the embodiment.

[0052] Figure 4 This is a sample of the depth map in the embodiment.

[0053] Figure 5 This is the final point cloud model output for the example.

[0054] Figure 6 This is the point cloud model output by the existing method.

[0055] Figure 7 This is a point cloud model output by the existing unsupervised method RC-MVSNet.

[0056] Figure 8This is the point cloud model output by the technical solution of this invention. Detailed Implementation

[0058] The technical solution of the present invention will be described in detail below, but the scope of protection of the present invention is not limited to the embodiments described.

[0059] like Figure 1 and Figure 2 As shown, the unsupervised multi-view stereo method based on frequency domain-aware contrast consistency of the present invention includes the following steps:

[0060] Step 1: Input multi-view image dataset and camera parameters ;

[0061] Indicates the number of multi-view images. These are the sequence numbers of the multi-view images. Refers to the reference view. It is the source view; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras;

[0062] Step 2: For the multi-view images obtained in Step 1, use the feature pyramid module to extract multi-scale features from the multi-view images. The feature pyramid module embeds a hybrid attention module, using a total of three stages, each stage... Both are extracted using the feature pyramid module based on the previous section. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the enhanced multi-viewpoint image features are obtained. ; ;

[0063] Step 3: Generate a depth hypothesis plane based on the known depth range of the original multi-view image. The resulting depth hypothesis plane is evenly distributed within the interval of the minimum and maximum depth values, and the depth hypothesis plane is extended to the same resolution as the multi-view image.

[0064] Step 4: Use differentiable homography transformation to transform the multi-view image features obtained in Step 2. Projecting onto a plane based on the extended depth assumption yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body;

[0065] Then, a 3D convolutional neural network is used to aggregate the cost volume to obtain the probability volume. The probability volume is then normalized to obtain the initial depth map.

[0066] Step 5: Repeat steps 2 to 4 to obtain the final depth map. ;

[0067] Step 6: During the unsupervised learning training process, use photometric consistency loss to optimize the final depth map obtained in step 5. ;

[0068] Step 7: During the unsupervised learning training process, optimize the final depth map obtained in Step 5 using frequency domain-aware contrast consistency loss. ;

[0069] Step 7: Repeat steps 2 to 6 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud model.

[0070] In step 2 of this embodiment, the feature pyramid module first performs multi-scale feature extraction on the input image through a feature pyramid network to generate basic feature maps with different receptive fields; then, a hybrid attention module is embedded at each scale level, which includes an attention module SE and a spatial attention module; the specific process is as follows:

[0071] Step 2.1: Input the two-dimensional multi-view image into the feature pyramid network, and construct multi-scale feature layers through stepwise downsampling and convolution operations to generate basic feature representations with different receptive fields;

[0072] Step 2.2: Embed channel attention mechanism in each scale feature level of the feature pyramid. Input the feature maps of each scale into the attention module SE. The attention module SE aggregates global context information along the channel dimension through global average pooling and learns the dependencies between channels using fully connected mapping to generate adaptive channel weights. Recalibrate the original features to obtain the refined feature map.

[0073] Step 2.3: Input the refined feature map into the spatial attention module. The spatial attention module calculates the attention weights in the spatial dimension, highlights the key response positions in the local area along the spatial axis, and adaptively focuses on the context-related area, thereby enhancing the feature consistency and depth information integrity among multiple views.

[0074] After the above steps, the multi-view image features are output. , The resolutions are all ;in, Indicates the width of a multi-view image; Indicates the height of a multi-view image; Indicates the number of stages.

[0075] The photometric consistency loss function described in this embodiment The expression is as follows:

[0076] ;

[0077] in, Indicates the original reference view. Indicates the first A distorted image, where M represents the mask used to filter out invalid pixels.

[0078] The frequency domain sensing contrast consistency function described in this embodiment Providing pseudo-supervision signals, the specific optimization steps are as follows:

[0079] First, generate the final depth map output in step 5. Treating the anchor point sample depth map as an example, a local area is randomly selected from the anchor point sample depth map for masking, generating a corresponding augmented sample depth map. Subsequently, the anchor point sample depth map was generated. With enhanced sample depth map Perform two-dimensional Fourier transforms on each element to map them from the spatial domain to the frequency domain, obtaining the corresponding frequency domain representation. and ;

[0080] ;

[0081] in, This represents a two-dimensional Fourier transform operation;

[0082] Then, in the anchor point sample depth map A depth map of hard-to-bear samples is constructed by superimposing random noise that follows a normal distribution with a mean of 0 and a standard deviation of 0.1. ;

[0083] ;

[0084] in, This represents a value randomly sampled from a normal distribution with a mean of 0 and a standard deviation of 0.1.

[0085] Depth map of hard-to-bear samples A two-dimensional Fourier transform is performed, and high-frequency components are removed by low-pass filtering, retaining only the central low-frequency region, thereby obtaining a frequency domain representation of the hard-to-negative sample that matches the characteristics of the occluded region. ;

[0086] ;

[0087] in, This indicates an operation that preserves the central low-frequency region;

[0088] Next, the frequency domain sensing contrast consistency loss is calculated. Frequency domain sensing contrast consistency loss pass The norm is used to calculate the frequency domain distance between the anchor sample and the positive sample, and the frequency domain distance between the anchor sample and the hard negative sample, respectively. The aim is to minimize the frequency domain distance between the anchor and the positive sample, while maximizing the spectral distance between the anchor and the negative sample.

[0089] ;

[0090] in, Indicates frequency domain sensing contrast consistency loss The weight.

[0091] This embodiment is... Figure 3 The application of this invention's unsupervised multi-view stereo method based on frequency domain-aware contrast consistency in the mid-view includes the following steps:

[0092] Step S1: Input multi-view image data and camera parameters ; Includes The multi-view image includes a reference view. and Amplitude source view ;

[0093] in, express The sequence number of the multi-view images. ; express The number of multi-view images in the middle; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras;

[0094] Step S2, for In Image The attention-based feature pyramid module is used to compute multi-scale features of multi-view images, and this stage is repeated three times. A hybrid attention module is embedded within the attention-based feature pyramid module.

[0095] Each stage All feature pyramids based on the attention mechanism described above are used for extraction. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the features of the multiple viewpoint images are obtained.

[0096] Step S3: Based on the input multi-view images A series of depth hypothesis planes are generated from the known depth range. These depth hypothesis planes are evenly distributed within the intervals of minimum and maximum depth values, and all depth hypothesis planes are extended to the input multi-view image. ( Same resolution;

[0097] Step S4: Use differentiable homography transformation to transform the multi-view image features obtained in step S2. Projecting onto the depth assumption plane expanded in step S3 yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body;

[0098] Step S5: Based on the cost volume obtained in step S4, use a 3D convolutional neural network to aggregate the cost volumes to obtain the probability volume, and normalize the probability volume to obtain the initial depth map.

[0099] Step S6: Repeat steps S2 to S5 to obtain the final depth map;

[0100] Step S7: During the unsupervised learning training process, the final depth map obtained in step S6 is optimized using photometric consistency loss.

[0101] Step S8: During the unsupervised learning training process, the final depth map obtained in step S6 is optimized using frequency domain-aware contrast consistency loss.

[0102] Step S9: Repeat steps S2 to S8 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud reconstruction result.

[0103] This invention first designs a feature pyramid module based on an attention mechanism to extract multi-scale features from multi-view images. Unlike traditional feature pyramid networks that use a top-down path and a fixed fusion strategy, the attention-based feature pyramid module introduces an adaptive weight allocation mechanism during feature extraction. This mechanism can dynamically adjust the contribution of features at different scales based on their importance, significantly improving the discriminative power of feature representation.

[0104] Secondly, to overcome the limitations of spatial domain methods in handling occlusion, this invention innovatively proposes a frequency domain-aware contrast consistency strategy. This strategy explicitly models the consistency of multi-view features in the frequency domain space, utilizing the global structural prior implied in the frequency domain to guide feature matching. Through a specially designed loss function, the model can effectively capture features in occluded areas that become blurred due to texture loss, thus providing a reliable optimization signal even when the photometric consistency assumption fails.

[0105] Ultimately, this invention enables the recovery of more complete and accurate depth maps from multi-view images without the need for real depth labels, thereby achieving high-quality and complete point cloud reconstruction.

[0106] As can be seen from the above embodiments, the overall network structure diagram of the method of the present invention is as follows: Figure 2 As shown, by fully utilizing the multi-scale adaptive fusion capability of the feature pyramid module based on the attention mechanism and the global prior constraint of frequency domain perception contrast consistency, the depth estimation bias problem caused by the failure of the photometric consistency assumption when dealing with occluded and texture-deficient regions in existing unsupervised multi-view stereo reconstruction methods is effectively solved, thereby calculating an accurate and complete point cloud model.

[0107] This invention first extracts multi-scale feature maps from multi-view input images using a feature pyramid module based on an attention mechanism; then, it aggregates global information to calculate an initial depth map and a confidence map; next, it constructs a frequency-domain perceptual contrast consistency loss in the frequency domain space, using global structural information in the frequency domain to provide robust pseudo-supervision signals for depth estimation, effectively guiding depth optimization of occluded regions; finally, it performs point cloud fusion based on the optimized depth map to obtain a high-quality 3D point cloud model.

[0108] In this embodiment, the input image data sample is as follows: Figure 3 As shown, Figure 3 The image in the middle is from four different perspectives, and the resulting depth map is as follows. Figure 4 As shown, the point cloud model output in this embodiment is as follows: Figure 5 As shown, if existing technology is used to... Figure 3 After processing, the resulting depth map is as follows: Figure 6 As shown, the cloud model reconstructed from the image data has a high degree of geometric consistency with the real scene.

[0109] according to Figure 8 The final experimental results presented; for Figure 3 The same view sample, if processed using the existing technology RC-MVSNet, will produce the following results: Figure 7 As shown, large areas of voids exist in areas such as the roof. The output of the method provided by this invention is complete in areas such as the "roof" and "windows".

[0110] As described above, although the invention has been shown and described with reference to specific preferred embodiments, it should not be construed as limiting the invention itself. Various changes in form and detail may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An unsupervised multi-view stereo method based on frequency domain sensing contrast consistency, characterized in that, Includes the following steps: Step 1: Input multi-view image dataset and camera parameters ; Indicates the number of multi-view images. These are the sequence numbers of the multi-view images. Refers to the reference view. It is the source view; This represents the intrinsic parameter matrix of the k-th camera. ; and These represent the first camera and the second camera, respectively. Rotation and translation matrices between cameras; Step 2: For the multi-view images obtained in Step 1, use the feature pyramid module to extract multi-scale features from the multi-view images. The feature pyramid module embeds a hybrid attention module, using a total of three stages, each stage... Both are extracted using the feature pyramid module based on the previous section. Two-dimensional feature maps of multiple viewpoint images are generated. After three stages, the enhanced multi-viewpoint image features are obtained. ; ; Step 3: Generate a depth hypothesis plane based on the known depth range of the original multi-view image. The resulting depth hypothesis plane is evenly distributed within the interval of the minimum and maximum depth values, and the depth hypothesis plane is extended to the same resolution as the multi-view image. Step 4: Use differentiable homography transformation to transform the multi-view image features obtained in Step 2. Projecting onto a plane based on the extended depth assumption yields a set of reprojected features. Then reproject the features With reference view Corresponding features Comparisons are made to construct the cost body; Then, a 3D convolutional neural network is used to aggregate the cost volume to obtain the probability volume. The probability volume is then normalized to obtain the initial depth map. Step 5: Repeat steps 2 to 4 to obtain the final depth map. ; Step 6: During the unsupervised learning training process, use photometric consistency loss to optimize the final depth map obtained in step 5; Step 7: During the unsupervised learning training process, the final depth map obtained in Step 5 is optimized using frequency domain-aware contrast consistency loss. Step 8: Repeat steps 2 to 7 to calculate the depth maps of all views, fuse the depth maps of all views, and output a globally consistent 3D point cloud model.

2. The unsupervised multi-view stereo method based on frequency domain sensing contrast consistency according to claim 1, characterized in that, In step 2, the feature pyramid module first extracts multi-scale features from the input image using a feature pyramid network to generate basic feature maps with different receptive fields. Then, a hybrid attention module is embedded at each scale level. The hybrid attention module includes an attention module SE and a spatial attention module. The specific process is as follows: Step 2.1: Input the two-dimensional multi-view image into the feature pyramid network, and construct multi-scale feature layers through stepwise downsampling and convolution operations to generate basic feature representations with different receptive fields; Step 2.2: Embed channel attention mechanism in each scale feature level of the feature pyramid. Input the feature maps of each scale into the attention module SE. The attention module SE aggregates global context information along the channel dimension through global average pooling and learns the dependencies between channels using fully connected mapping to generate adaptive channel weights. Recalibrate the original features to obtain the refined feature map. Step 2.3: Input the refined feature map into the spatial attention module. The spatial attention module calculates the attention weights in the spatial dimension, highlights the key response positions in the local area along the spatial axis, and adaptively focuses on the context-related area, thereby enhancing the feature consistency and depth information integrity among multiple views. After the above steps, the multi-view image features are output. , The resolutions are all ;in, Indicates the width of a multi-view image; Indicates the height of a multi-view image; Indicates the number of stages.

3. The unsupervised multi-view stereo method based on frequency domain sensing contrast consistency according to claim 1, characterized in that, The photometric consistency loss function The expression is as follows: ； in, Indicates the original reference view. Indicates the first A distorted image, where M represents the mask used to filter out invalid pixels.

4. The unsupervised multi-view stereo method and system based on frequency domain sensing contrast consistency according to claim 1, characterized in that, The frequency domain sensing contrast consistency function Providing pseudo-supervision signals, the specific optimization steps are as follows: First, generate the final depth map output in step 5. Treating the anchor point sample depth map as an example, a local area is randomly selected from the anchor point sample depth map for masking, generating a corresponding augmented sample depth map. ; Subsequently, the anchor point sample depth map With enhanced sample depth map Perform two-dimensional Fourier transforms on each element to map them from the spatial domain to the frequency domain, obtaining the corresponding frequency domain representation. and ; ； in, This represents a two-dimensional Fourier transform operation; Then, in the anchor point sample depth map A depth map of hard-to-bear samples is constructed by superimposing random noise that follows a normal distribution with a mean of 0 and a standard deviation of 0.

1. ; ； in, This represents a value randomly sampled from a normal distribution with a mean of 0 and a standard deviation of 0.

1. Depth map of hard-to-bear samples A two-dimensional Fourier transform is performed, and high-frequency components are removed by low-pass filtering, retaining only the central low-frequency region, thereby obtaining a frequency domain representation of the hard-to-negative sample that matches the characteristics of the occluded region. ; ； in, This indicates an operation that preserves the central low-frequency region; Next, the frequency domain sensing contrast consistency loss is calculated. Frequency domain sensing contrast consistency loss pass The norm is used to calculate the frequency domain distance between the anchor sample and the positive sample, and the frequency domain distance between the anchor sample and the hard negative sample, respectively. The aim is to minimize the frequency domain distance between the anchor and the positive sample, while maximizing the spectral distance between the anchor and the negative sample. ； in, Indicates frequency domain sensing contrast consistency loss The weight.