Image depth estimation method, device, system and readable storage medium

By generating a set of pseudo-depth labels and iteratively training the depth estimation network, combined with geometric consistency testing, the problems of low training efficiency and poor quality in multi-view image depth estimation methods are solved, achieving a high-efficiency improvement in depth estimation quality.

CN115578434BActive Publication Date: 2026-06-16PEKING UNIV SHENZHEN GRADUATE SCHOOL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PEKING UNIV SHENZHEN GRADUATE SCHOOL
Filing Date
2022-09-30
Publication Date
2026-06-16

Smart Images

  • Figure CN115578434B_ABST
    Figure CN115578434B_ABST
Patent Text Reader

Abstract

The application discloses an image depth estimation method, device, system and readable storage medium, the method comprises the following steps: acquiring a training set, obtaining a first pseudo-depth label set in the process of training based on the training set, and iteratively training a first depth estimation network created in advance according to the training set and the first pseudo-depth label set to obtain a first depth estimation model; obtaining a first depth image set based on the first depth estimation model and the training set, and determining a second pseudo-depth label set according to the first depth image set; iteratively training a second depth estimation network created in advance according to the training set and the second pseudo-depth label set to obtain a second depth estimation model; acquiring a to-be-estimated image set, inputting the to-be-estimated image set into the second depth estimation model, and obtaining depth images corresponding to the to-be-estimated image set. Through two depth estimation networks and two pseudo-depth label training models, the training efficiency and depth estimation quality of the model are improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to image depth estimation methods, apparatus, systems and readable storage media. Background Technology

[0002] Depth estimation is a classic research focus with wide applications in areas such as autonomous driving control. With the popularization of deep learning technology, multi-view image depth estimation methods have been developed in recent years. Some multi-view image depth estimation methods require real-world depth images as supervised data to train the depth estimation model. However, obtaining real depth images is difficult, and it's challenging to guarantee the scene applicability and scalability of the trained depth estimation model, leading to slow training and low estimation quality. Some multi-view image depth estimation methods replace real depth with image reconstruction strategies, guiding the network to learn depth estimation from image data. However, image resolution, geometric occlusion between multiple views, and specular reflection can result in insufficient supervision of the image reconstruction strategy, easily leading to low depth estimation quality. Therefore, improving the training efficiency and depth estimation quality of depth estimation models is an urgent problem to be solved. Summary of the Invention

[0003] The main objective of this invention is to propose an image depth estimation method, apparatus, system, and readable storage medium, aiming to solve the problem of how to improve the training efficiency and depth estimation quality of depth estimation models.

[0004] To achieve the above objectives, the present invention provides an image depth estimation method, which includes the following steps:

[0005] A training set is obtained, and a first pseudo-depth label set is obtained during the training process based on the training set. The first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain a first depth estimation model.

[0006] A first depth image set is obtained based on the first depth estimation model and the training set, and a second pseudo depth label set is determined based on the first depth image set.

[0007] The pre-created second depth estimation network is iteratively trained based on the training set and the second pseudo-depth label set to obtain the second depth estimation model.

[0008] Obtain a set of images to be estimated, input the set of images to be estimated into the second depth estimation model, and obtain the depth image corresponding to the set of images to be estimated.

[0009] Optionally, the step of obtaining the first pseudo-depth label set during training based on the training set includes:

[0010] The first depth estimation network is trained without iteration based on the training set, and the first depth estimation network generates a first set of pseudo-depth labels corresponding to the target training viewpoint images in the training set according to the training set.

[0011] Optionally, the step of iteratively training a pre-created first depth estimation network based on the training set and the first pseudo-depth label set to obtain a first depth estimation model includes:

[0012] The first depth estimation network is iteratively trained based on the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

[0013] The first pseudo-depth label set is updated based on the first depth estimation pre-model, and the photometric loss value and depth consistency loss value corresponding to the first depth estimation pre-model are calculated based on the updated first pseudo-depth label set, the preset photometric loss function and the preset depth consistency loss function.

[0014] The first depth estimation pre-model is validated to obtain the validation results, and it is determined whether the validation results, the photometric loss value, and the depth consistency loss value meet the preset conditions.

[0015] If the verification result, the photometric loss value, and the depth consistency loss value meet the preset conditions, then the first depth estimation pre-model is used as the first depth estimation model.

[0016] If the verification result, the photometric loss value, or the depth consistency loss value does not meet the preset conditions, then based on the updated first pseudo-depth label set, the following steps are repeated: the pre-created first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

[0017] Optionally, the first depth image set includes: a target depth image and a reference depth image set, and the step of determining the second pseudo depth label set based on the first depth image set includes:

[0018] Obtain the depth value and coordinate value corresponding to each pixel in the target depth image;

[0019] Projecting the depth values ​​onto the reference depth image set yields a set of projected depth values ​​and a set of projected coordinate values ​​for each pixel in the target depth image.

[0020] The depth error set corresponding to each pixel is calculated based on the depth value and the set of projected depth values, and the coordinate offset set corresponding to each pixel is calculated based on the coordinate value and the set of projected coordinate values.

[0021] Based on the depth error set and the coordinate offset set, a reliable pixel set is determined in the target depth image, and a second pseudo depth label set is determined based on the reliable pixel set.

[0022] Optionally, the step of iteratively training a pre-created second depth estimation network based on the training set and the second pseudo-depth label set to obtain a second depth estimation model includes:

[0023] The pre-created second depth estimation network is iteratively trained based on the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

[0024] A depth image is obtained based on the second depth estimation pre-model and the training set, and the second pseudo-depth label set is updated based on the depth image;

[0025] Obtain the reliability of the updated second pseudo-depth label set, and determine whether the reliability meets the preset conditions;

[0026] If the reliability meets the preset conditions, then the second depth estimation pre-model is used as the second depth estimation model;

[0027] If the reliability does not meet the preset conditions, then based on the updated second pseudo-depth label set, the following steps are repeated: the pre-created second depth estimation network is iteratively trained according to the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

[0028] Optionally, the step of updating the second pseudo-depth label set based on the depth image includes:

[0029] A third set of pseudo-depth labels is determined based on the depth image, and the reliability of each third pseudo-depth label in the third set of pseudo-depth labels is calculated.

[0030] Obtain the reliability of each second pseudo-depth label in the second pseudo-depth label set;

[0031] The reliability of the second pseudo-depth label is compared with the reliability of the corresponding third pseudo-depth label to obtain the comparison result, and the second pseudo-depth label set is updated according to the comparison result.

[0032] Optionally, the step of inputting the set of images to be estimated into the second depth estimation model to obtain the depth image corresponding to the set of images to be estimated includes:

[0033] Input the set of images to be estimated into the second depth estimation model;

[0034] The second depth estimation model is used to extract the first feature map set of the target image to be estimated and the second feature map set of the reference image to be estimated from the set of images to be estimated.

[0035] The second depth estimation model determines the assumed depth range based on the scene range and camera parameters corresponding to the set of images to be estimated;

[0036] The second depth estimation model obtains the depth image corresponding to the set of images to be estimated based on the first feature map set, the second feature map set, and the assumed depth range.

[0037] Furthermore, to achieve the above objectives, the present invention also provides an image depth estimation apparatus, the image depth estimation apparatus comprising:

[0038] The first training module is used to acquire a training set, obtain a first pseudo-depth label set during training based on the training set, and iteratively train a pre-created first depth estimation network according to the training set and the first pseudo-depth label set to obtain a first depth estimation model.

[0039] The determination module is used to obtain a first depth image set based on the first depth estimation model and the training set, and to determine a second pseudo depth label set based on the first depth image set;

[0040] The second training module is used to iteratively train the pre-created second depth estimation network based on the training set and the second pseudo-depth label set to obtain the second depth estimation model.

[0041] The input module is used to acquire a set of images to be estimated, input the set of images to be estimated into the second depth estimation model, and obtain the depth image corresponding to the set of images to be estimated.

[0042] Furthermore, the first training module is also used for:

[0043] The first depth estimation network is trained without iteration based on the training set, and the first depth estimation network generates a first set of pseudo-depth labels corresponding to the target training viewpoint images in the training set according to the training set.

[0044] Furthermore, the first training module is also used for:

[0045] The first depth estimation network is iteratively trained based on the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

[0046] The first pseudo-depth label set is updated based on the first depth estimation pre-model, and the photometric loss value and depth consistency loss value corresponding to the first depth estimation pre-model are calculated based on the updated first pseudo-depth label set, the preset photometric loss function and the preset depth consistency loss function.

[0047] The first depth estimation pre-model is validated to obtain the validation results, and it is determined whether the validation results, the photometric loss value, and the depth consistency loss value meet the preset conditions.

[0048] If the verification result, the photometric loss value, and the depth consistency loss value meet the preset conditions, then the first depth estimation pre-model is used as the first depth estimation model.

[0049] If the verification result, the photometric loss value, or the depth consistency loss value does not meet the preset conditions, then based on the updated first pseudo-depth label set, the following steps are repeated: the pre-created first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

[0050] Furthermore, the determining module is also used for:

[0051] Obtain the depth value and coordinate value corresponding to each pixel in the target depth image;

[0052] Projecting the depth values ​​onto the reference depth image set yields a set of projected depth values ​​and a set of projected coordinate values ​​for each pixel in the target depth image.

[0053] The depth error set corresponding to each pixel is calculated based on the depth value and the set of projected depth values, and the coordinate offset set corresponding to each pixel is calculated based on the coordinate value and the set of projected coordinate values.

[0054] Based on the depth error set and the coordinate offset set, a reliable pixel set is determined in the target depth image, and a second pseudo depth label set is determined based on the reliable pixel set.

[0055] Furthermore, the second training module is also used for:

[0056] The pre-created second depth estimation network is iteratively trained based on the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

[0057] A depth image is obtained based on the second depth estimation pre-model and the training set, and the second pseudo-depth label set is updated based on the depth image;

[0058] Obtain the reliability of the updated second pseudo-depth label set, and determine whether the reliability meets the preset conditions;

[0059] If the reliability meets the preset conditions, then the second depth estimation pre-model is used as the second depth estimation model;

[0060] If the reliability does not meet the preset conditions, then based on the updated second pseudo-depth label set, the following steps are repeated: the pre-created second depth estimation network is iteratively trained according to the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

[0061] Furthermore, the second training module is also used for:

[0062] A third set of pseudo-depth labels is determined based on the depth image, and the reliability of each third pseudo-depth label in the third set of pseudo-depth labels is calculated.

[0063] Obtain the reliability of each second pseudo-depth label in the second pseudo-depth label set;

[0064] The reliability of the second pseudo-depth label is compared with the reliability of the corresponding third pseudo-depth label to obtain the comparison result, and the second pseudo-depth label set is updated according to the comparison result.

[0065] Furthermore, the input module is also used for:

[0066] Input the set of images to be estimated into the second depth estimation model;

[0067] The second depth estimation model is used to extract the first feature map set of the target image to be estimated and the second feature map set of the reference image to be estimated from the set of images to be estimated.

[0068] The second depth estimation model determines the assumed depth range based on the scene range and camera parameters corresponding to the set of images to be estimated;

[0069] The second depth estimation model obtains the depth image corresponding to the set of images to be estimated based on the first feature map set, the second feature map set, and the assumed depth range.

[0070] In addition, to achieve the above objectives, the present invention also provides an image depth estimation system, the image depth estimation system comprising: a memory, a processor, and an image depth estimation method program stored in the memory and executable on the processor, wherein the image depth estimation method program, when executed by the processor, implements the steps of the image depth estimation method as described above.

[0071] In addition, to achieve the above objectives, the present invention also provides a readable storage medium storing an image depth estimation method program, which, when executed by a processor, implements the steps of the image depth estimation method as described above.

[0072] The proposed image depth estimation method involves acquiring a training set, obtaining a first set of pseudo-depth labels during training based on the training set, and iteratively training a pre-created first depth estimation network using the training set and the first set of pseudo-depth labels to obtain a first depth estimation model. A first set of depth images is then obtained based on the first depth estimation model and the training set, and a second set of pseudo-depth labels is determined based on the first set of depth images. The pre-created second depth estimation network is then iteratively trained using the training set and the second set of pseudo-depth labels to obtain a second depth estimation model. Finally, a set of images to be estimated is acquired and input into the second depth estimation model to obtain the depth images corresponding to the set of images to be estimated. This invention combines two multi-view depth estimation networks with two types of pseudo-depth labels to enhance the performance of the overall unsupervised method, strengthens the effect of using image data for unsupervised learning, and eliminates the need to use complex 3D reconstruction techniques to update pseudo-depth labels. Based on geometric consistency checks and iterative training, the target viewpoint depth estimation quality of the model is improved, thus increasing the training efficiency and depth estimation quality of the depth estimation model. Attached Figure Description

[0073] Figure 1 This is a schematic diagram of the device structure of the hardware operating environment involved in the embodiments of the present invention;

[0074] Figure 2 This is a flowchart illustrating the first embodiment of the image depth estimation method of the present invention;

[0075] Figure 3 This is a schematic diagram of the structure of the first depth estimation network or the second depth estimation network of the present invention;

[0076] Figure 4 A schematic diagram illustrating the process for determining the reliability of pixel depth values ​​in this invention;

[0077] Figure 5 This is a flowchart illustrating the second embodiment of the image depth estimation method of the present invention;

[0078] Figure 6 This is a schematic diagram of the image depth estimation device of the present invention.

[0079] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0080] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0081] like Figure 1 As shown, Figure 1 This is a schematic diagram of the device structure of the hardware operating environment involved in the embodiments of the present invention.

[0082] The device in this embodiment of the invention can be a PC or a server.

[0083] like Figure 1 As shown, the device may include: a processor 1001, such as a CPU; a network interface 1004; a user interface 1003; a memory 1005; and a communication bus 1002. The communication bus 1002 is used to enable communication between these components. The user interface 1003 may include a display screen or an input unit such as a keyboard; optionally, the user interface 1003 may also include a standard wired interface or a wireless interface. The network interface 1004 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface). The memory 1005 may be high-speed RAM or non-volatile memory, such as a disk drive. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.

[0084] Those skilled in the art will understand that Figure 1 The device structure shown does not constitute a limitation on the device and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0085] like Figure 1 As shown, the memory 1005, which serves as a computer storage medium, may include an operating system, a network communication module, a user interface module, and an image depth estimation method program.

[0086] The operating system is a program that manages and controls portable storage devices and software resources, and supports the operation of the network communication module, user interface module, image depth estimation method program, and other programs or software; the network communication module is used to manage and control the network interface 1002; and the user interface module is used to manage and control the user interface 1003.

[0087] exist Figure 1 In the device shown, the device calls the image depth estimation method program stored in the memory 1005 through the processor 1001, and performs the operations in the various embodiments of the image depth estimation method described below.

[0088] Based on the above hardware structure, an embodiment of the image depth estimation method of the present invention is proposed.

[0089] Reference Figure 2 , Figure 2 This is a flowchart illustrating the first embodiment of the image depth estimation method of the present invention. The method includes:

[0090] Step S10: Obtain a training set, obtain a first pseudo-depth label set during the training process based on the training set, and iteratively train a pre-created first depth estimation network according to the training set and the first pseudo-depth label set to obtain a first depth estimation model.

[0091] Step S20: Obtain a first depth image set based on the first depth estimation model and the training set, and determine a second pseudo depth label set based on the first depth image set;

[0092] Step S30: Iteratively train the pre-created second depth estimation network based on the training set and the second pseudo-depth label set to obtain the second depth estimation model;

[0093] Step S40: Obtain the set of images to be estimated, input the set of images to be estimated into the second depth estimation model, and obtain the depth image corresponding to the set of images to be estimated.

[0094] This embodiment of the image depth estimation method is applied to an image depth estimation system, which can be applied to smart devices such as terminal devices and PC terminals. For ease of description, an image depth estimation system is used as an example. Before performing image depth estimation, the image depth estimation system first obtains a training set and trains a pre-created first depth estimation network based on the training set. During the training process, a first pseudo-depth label set is generated, and the pre-created first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain a first depth estimation model. After obtaining the first depth estimation model, the image depth estimation system inputs the training set into the first depth estimation model, so that the first depth estimation model outputs a first depth image set based on the training set. The system then checks and filters out unreliable depth values ​​in the first depth image set to obtain a second pseudo-depth label set. The image depth estimation system iteratively trains the pre-created second depth estimation network according to the training set and the second pseudo-depth label set to obtain a second depth estimation model. After obtaining the second depth estimation model, the image depth estimation system obtains a set of images to be estimated and inputs the set of images to be estimated into the second depth estimation model to obtain the depth images corresponding to the set of images to be estimated. It should be noted that the training set includes the target training viewpoint image and several corresponding reference training viewpoint images. The target training viewpoint image is the image whose depth needs to be estimated, and the reference training viewpoint images are images with the same content as the target training viewpoint image taken from different perspectives. The pre-created first depth estimation network and the pre-created second depth estimation network each include: a multi-scale multi-view feature extraction network, a differentiable homography transformation module, a matching cost calculation module, a matching cost regularization network, and a depth probability normalization module. The first pseudo-depth label set and the second pseudo-depth label set include depth images of different scales corresponding to the target training viewpoint image in the training set.

[0095] This embodiment of the image depth estimation method involves: acquiring a training set; obtaining a first set of pseudo-depth labels based on the training set; iteratively training a pre-created first depth estimation network using the training set and the first set of pseudo-depth labels to obtain a first depth estimation model; obtaining a first set of depth images based on the first depth estimation model and the training set; determining a second set of pseudo-depth labels based on the first set of depth images; iteratively training a pre-created second depth estimation network using the training set and the second set of pseudo-depth labels to obtain a second depth estimation model; and acquiring a set of images to be estimated, inputting the set of images to be estimated into the second depth estimation model to obtain the depth images corresponding to the set of images to be estimated. This invention combines two multi-view depth estimation networks with two types of pseudo-depth labels to enhance the performance of the overall unsupervised method, strengthens the effect of using image data for unsupervised learning, and eliminates the need to use complex 3D reconstruction techniques to update pseudo-depth labels. Based on geometric consistency checks and iterative training, it improves the target viewpoint depth estimation quality of the model, thereby improving the training efficiency and depth estimation quality of the depth estimation model.

[0096] The following will provide a detailed explanation of each step:

[0097] Step S10: Obtain a training set, obtain a first pseudo-depth label set during the training process based on the training set, and iteratively train a pre-created first depth estimation network according to the training set and the first pseudo-depth label set to obtain a first depth estimation model.

[0098] In this embodiment, the image depth estimation system acquires a training set. During the training of a pre-created first depth estimation network based on the training set, a first pseudo-depth label set is generated. The pre-created first depth estimation network is then iteratively trained by combining the training set and the first pseudo-depth label set. The training results are analyzed using a preset validation set and loss function to obtain the first depth estimation model.

[0099] Specifically, the steps for obtaining the first pseudo-depth label set during training based on the training set include:

[0100] Step S101: Iteratively train the pre-created first depth estimation network based on the training set, and generate a first pseudo depth label set corresponding to the target training viewpoint image in the training set through the first depth estimation network according to the training set.

[0101] In this step, after the image depth estimation system acquires the training set, it iteratively trains the pre-created first depth estimation network with the training set, and generates a target training viewpoint depth image corresponding to the target training viewpoint image in the training set through the first depth estimation network. The image depth estimation system then generates a first pseudo-depth label set corresponding to the target training viewpoint image based on the target training viewpoint depth image.

[0102] Specifically, the step of iteratively training the pre-created first depth estimation network based on the training set and the first pseudo-depth label set to obtain the first depth estimation model includes:

[0103] Step S102: Iteratively train the pre-created first depth estimation network according to the training set and the first pseudo depth label set to obtain the first depth estimation pre-model;

[0104] In this step, after obtaining the first pseudo-depth label set, the image depth estimation system iteratively trains the pre-created first depth estimation network based on the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model; such as Figure 3 As shown, the first depth estimation network includes: a multi-scale multi-view feature extraction network, a differentiable homography transformation module, a matching cost calculation module, a matching cost regularization network, and a depth probability normalization module.

[0105] The specific steps for iteratively training the pre-created first depth estimation network based on the training set and the first pseudo-depth label set include:

[0106] 1. The image depth estimation system inputs the training set into the first depth estimation network. A multi-scale, multi-view feature extraction network then extracts features from several reference training viewpoint images corresponding to the target training viewpoint image in the training set, resulting in a target training viewpoint feature map set corresponding to the target training viewpoint image and a reference training viewpoint feature map set corresponding to the reference training viewpoint images. These sets include downsampled, full-resolution, or multi-scale feature maps, such as... Figure 3 As shown, the multi-scale multi-view feature extraction network extracts feature maps at three scales for both the target training view image and the reference training view image. It can be understood that multiple feature maps at different scales can be extracted depending on the specific situation, and the specific number is not limited here. The shallow feature map is used to learn information such as color and texture observed in the target training view image, while the deep feature map is used to learn the spatial layout and geometric cues of the target training view image.

[0107] 2. Input the target training viewpoint feature map set, the reference training viewpoint feature map set, and the training set into the differentiable homography transformation module. The differentiable homography transformation module acquires and determines the spatial position of each pixel in the target and reference training viewpoint images based on the scene range and camera parameters corresponding to the target and reference training viewpoint images in the training set. This determines the depth hypothesis range corresponding to each pixel in the target and reference training viewpoint images. The differentiable homography transformation module also acquires the first-scale target training viewpoint feature map from the target training viewpoint feature map set and the first-scale reference training viewpoint feature map from the reference training viewpoint feature map set, and determines the depth hypothesis range corresponding to each hypothetical depth within the depth hypothesis range. The homography matrix projects all first-scale reference training viewpoint feature maps onto the first-scale target training viewpoint feature map, thereby dividing a series of depth hypothesis planes within the depth hypothesis range as the basis for subsequent calculation of matching costs. For example, if the depth hypothesis range of each pixel on the first-scale target training viewpoint feature map is determined to be 1 meter to 3 meters, then 1-meter, 2-meter, and 3-meter depth hypothesis planes can be divided within the depth hypothesis range. It should be noted that 1-meter, 1.5-meter, 2-meter, 2.5-meter, and 3-meter depth hypothesis planes can also be divided, that is, the specific number of depth hypothesis planes is determined according to the specific settings and is not limited here.

[0108] 3. Through the matching cost calculation module, the matching deviation of each pixel on the first-scale target training viewpoint feature map on each different depth hypothesis plane with the corresponding pixel on each first-scale reference training viewpoint feature map is calculated, and the matching cost is determined based on the deviation. The smaller the matching cost, the closer the depth of the pixel is to the true depth.

[0109] 4. Input the matching cost of each pixel in the first-scale target training viewpoint feature map on each different depth plane with the corresponding pixel in each first-scale reference training viewpoint feature map into the matching cost regularization network. The matching cost regularization network uses a deep neural network to adapt to and resist the unavoidable noise in the matching cost. By learning the maximum likelihood estimate between the matching cost distribution and the supervision data distribution, it outputs the discrete probability of each pixel in the first-scale target training viewpoint feature map falling on each different depth hypothesis plane. The larger the discrete probability, the greater the probability that the depth of the pixel corresponds to the depth of the depth hypothesis plane. It should be noted that the matching cost distribution is the result distribution of the matching cost of each pixel in the target training viewpoint feature map obtained by the matching cost calculation module. The supervision data distribution is the result distribution of the loss value of each pixel in the target training viewpoint feature map obtained based on the first pseudo-depth label set and photometric loss function, and based on the first pseudo-depth label set and depth consistency loss function.

[0110] 5. Input the discrete probability of each pixel in the first-scale target training viewpoint feature map falling on each different depth hypothesis plane into the depth probability normalization module. The depth probability normalization module obtains the floating-point depth of each pixel in the first-scale target training viewpoint feature map by weighted summation based on the discrete probabilities. For example, assuming that the discrete probabilities of a pixel in the first-scale target training viewpoint feature map falling on the 1-meter depth hypothesis plane, the 2-meter depth hypothesis plane, and the 3-meter depth hypothesis plane are 0.7, 0.2, and 0.1 respectively, the weighted summation based on the discrete probabilities is: 1*0.7 + 2*0.2 + 3*0.1, which gives the floating-point depth of the pixel as 1.4 meters.

[0111] 6. After obtaining the floating-point depth of each pixel in the first-scale target training viewpoint feature map, the image depth estimation system inputs the floating-point depth into the differentiable homography transformation module in the first depth estimation network. The differentiable homography transformation module obtains the second-scale target training viewpoint feature map in the target training viewpoint feature map set and the second-scale reference training viewpoint feature map in the reference training viewpoint feature map set. Then, based on the second-scale reference training viewpoint feature map, the second-scale target training viewpoint feature map, and the floating-point depth, the depth assumption range of each pixel in the second-scale target training viewpoint feature map is determined. Steps 3-5 are repeated until the feature maps of each scale in the target training viewpoint feature map set and the reference training viewpoint feature map set are calculated, and the first depth estimation pre-model is obtained.

[0112] Step S103: Update the first pseudo-depth label set based on the first depth estimation pre-model, and calculate the photometric loss value and depth consistency loss value corresponding to the first depth estimation pre-model based on the updated first pseudo-depth label set, the preset photometric loss function and the preset depth consistency loss function.

[0113] In this step, the image depth estimation system acquires the depth image output by the first depth estimation pre-model, generates depth images of different scales according to the floating-point depth in the depth image, updates the first pseudo-depth label set, and calculates the photometric loss value and depth consistency loss value corresponding to the first depth estimation pre-model based on the updated first pseudo-depth label set, the preset photometric loss function, and the preset depth consistency loss function.

[0114] The preset photometric loss function is:

[0115]

[0116] Where p is the position index of all points in the target viewpoint, Ω is the binary mask representing whether the projection of the reference training viewpoint feature map onto the target training viewpoint feature map is out of bounds, and I t It refers to the target viewpoint image, I i→t This refers to the reconstructed viewpoint image projected onto the target training viewpoint feature map from the reference training viewpoint feature map. ε is the threshold of the Huber loss, α represents the loss weight, and K represents that the function ultimately takes only the minimum K losses for each pixel on the target training viewpoint feature map. This can resist the large pollution loss caused by some reference viewpoints being occluded, and avoid affecting the stability of network learning.

[0117] The preset depth consistency loss function is:

[0118]

[0119] Among them, D l This is the depth estimation result of the training viewpoint feature map of the target at scale l, D. s→l It is the downsampling of the target training viewpoint feature map at scale s to the target training viewpoint feature map at scale l. max represents the pair of depths with the largest difference when downsampling the target training viewpoint feature map at scale s to the target training viewpoint feature map at scale l, which is used to enhance robustness.

[0120] Step S104: Validate the first depth estimation pre-model, obtain the validation result, and determine whether the validation result, the photometric loss value, and the depth consistency loss value meet the preset conditions.

[0121] In this step, the image depth estimation system obtains a preset validation set, inputs the validation set into the first depth estimation pre-model to obtain a depth image, validates the depth image, obtains the validation result, and determines whether the validation result, photometric loss value, and depth consistency loss value meet the preset conditions. It should be noted that the preset conditions can be set according to specific circumstances and are not limited here.

[0122] Step S105: If the verification result, the photometric loss value, and the depth consistency loss value meet the preset conditions, then the first depth estimation pre-model is used as the first depth estimation model.

[0123] In this step, if the image depth estimation system determines that the verification results, photometric loss value, and depth consistency loss value meet preset conditions, then the first depth estimation pre-model is used as the first depth estimation model.

[0124] Step S106: If the verification result, the photometric loss value, or the depth consistency loss value does not meet the preset conditions, then based on the updated first pseudo-depth label set, the following steps are executed again: the pre-created first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

[0125] In this step, if the image depth estimation system determines that the verification result, photometric loss value, or depth consistency loss value does not meet the preset conditions, it will re-execute the following steps based on the updated first pseudo-depth label set: iteratively train the pre-created first depth estimation network according to the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model; until the trained first depth estimation pre-model is obtained.

[0126] Step S20: Obtain a first depth image set based on the first depth estimation model and the training set, and determine a second pseudo depth label set based on the first depth image set;

[0127] In this embodiment, after training a first depth estimation model, the image depth estimation system inputs a training set into the first depth estimation model. The first depth estimation model then outputs a first depth image set based on the training set. The image depth estimation system then determines a second pseudo-depth label set based on the first depth image set. It should be noted that the first depth image set includes a target depth image and a reference depth image set. The target depth image is used to generate the second pseudo-depth label set, and the reference depth image set is used to assist in generating the second pseudo-depth label set. The second pseudo-depth label set includes depth images of different scales corresponding to the target depth image.

[0128] Specifically, the step of determining the second pseudo-depth label set based on the first depth image set includes:

[0129] Step S201: Obtain the depth value and coordinate value corresponding to each pixel in the target depth image;

[0130] In this step, the image depth estimation system obtains the depth value and coordinate value corresponding to each pixel in the target depth image. It can be understood that the target depth image is the depth image output by the first depth estimation model. Therefore, the depth value corresponding to each pixel in the target depth image is the floating-point depth of each pixel. The image depth estimation system can establish a spatial coordinate system that conforms to the actual situation of the target depth image and the reference depth image set based on the pixel distribution of the target depth image and the reference depth image set, and then determine the coordinate value corresponding to each pixel in the target depth image.

[0131] Step S202: Project the depth values ​​based on the reference depth image set to obtain the projection depth value set and projection coordinate value set corresponding to each pixel in the target depth image;

[0132] In this step, the image depth estimation system sequentially selects a reference depth image from the set of reference depth images, obtains the depth value of each pixel in the target depth image, determines the projection of the depth value of each pixel in the target depth image onto the reference depth image based on the 3D projection and the relative pose of the camera, and resamples the depth based on the position of this projection on the reference depth image to obtain the coordinates and depth values ​​projected onto the reference depth image. These coordinates and depth values ​​are then projected back onto the target depth image to determine the projected depth value and projected coordinates of each pixel in the target depth image. After determining the projected depth value and projected coordinates of each pixel in the target depth image based on the currently selected reference depth image in the set of reference depth images, the image depth estimation system selects another reference depth image from the set of reference depth images and repeats the above projection operation until all reference depth images in the set have been selected, thus obtaining the set of projected depth values ​​and the set of projected coordinate values ​​corresponding to each pixel in the target depth image. It should be noted that the image depth estimation system can project the depth value of each pixel in the target depth image onto the reference depth image sequentially, or it can project the depth value of each pixel in the target depth image onto the reference depth image simultaneously.

[0133] Step S203: Calculate the depth error set corresponding to each pixel based on the depth value and the set of projected depth values, and calculate the coordinate offset set corresponding to each pixel based on the coordinate value and the set of projected coordinate values.

[0134] In this step, after obtaining the set of projected depth values ​​and the set of projected coordinate values ​​corresponding to each pixel in the target depth image, the image depth estimation system calculates the set of depth errors corresponding to each pixel in the target depth image based on the depth values ​​and the set of projected depth values. It also calculates the set of coordinate offsets corresponding to each pixel in the target depth image based on the coordinate values ​​and the set of projected coordinate values. For example, assuming the reference depth image set includes three reference depth images at different locations, meaning the set of projected depth values ​​corresponding to each pixel in the target depth image includes three projected depth values ​​and the set of projected coordinate values ​​also includes three projected coordinate values, the image depth estimation system calculates the depth error between the depth values ​​corresponding to each pixel in the target depth image and the three projected depth values ​​in the set of projected depth values, obtaining the set of depth errors corresponding to each pixel. The image depth estimation system also calculates the coordinate offset between the coordinate values ​​corresponding to each pixel in the target depth image and the three projected coordinate values ​​in the set of projected coordinate values, obtaining the set of coordinate offsets corresponding to each pixel.

[0135] Step S204: Based on the depth error set and the coordinate offset set, determine a reliable pixel set in the target depth image, and determine a second pseudo depth label set based on the reliable pixel set.

[0136] In this step, the image depth estimation system compares each depth error in the depth error set of each pixel with a preset depth error threshold to obtain a first comparison result. It then compares each coordinate offset in the coordinate offset set of each pixel with a preset coordinate offset threshold to obtain a second comparison result. Based on the first and second comparison results, the image depth estimation system determines a reliable pixel set in the target depth image. The system retains the depth values ​​corresponding to the determined reliable pixel set in the target depth image and determines a second pseudo-depth label set based on the depth values ​​corresponding to each reliable pixel in the reliable pixel set. For example, continuing with the above example, if the depth error set corresponding to each pixel includes three depth errors and the coordinate offset set corresponding to each pixel includes three coordinate offsets, and the image depth estimation system obtains the following results: the first comparison result shows that the number of depth errors in the depth error set less than the preset depth error threshold is greater than a preset number threshold, and the second comparison result shows that the number of coordinate offsets in the coordinate offset set less than the preset coordinate offset threshold is greater than the preset number threshold, then the corresponding pixel can be determined to be a reliable pixel.

[0137] Furthermore, in practical applications, steps S201 to S204 should be referenced. Figure 4The image depth estimation system utilizes the relative pose of camera i (which acquires the target depth image) to camera j (which acquires a reference depth image). The relative camera pose includes the camera intrinsic parameter K. i→j Rotation matrix R i→j Translation matrix T i→j First, the target depth image D is projected using 3D projection. i Projected onto reference depth image D j Projection D′ j A pixel p on the target depth image is projected onto a reference depth image to obtain a point p′; then the image depth estimation system uses the projected point p′ to represent the target depth image D. j The location on the depth D is resampled j (p′), at this point, the camera relative pose from camera j of the reference depth image to camera i of the target depth image is used, including the camera intrinsic parameter K. j→i Rotation matrix R j→i Translation matrix T j→i Projecting point p′ onto the reference depth image D j The location on the depth D is resampled j (p′) is projected back onto the target depth image to obtain point p″ and its projected depth value D(p″); the image depth estimation system determines the relationship between the projected depth value D(p″) and the depth D. i The depth error of (p) is less than the threshold τ2, and the coordinate offset T from point p″ to point p is less than the threshold τ1. This determines whether the depth value of each pixel in the target depth image is qualified. If the depth value is qualified in verification with multiple reference depth images, the depth value corresponding to the pixel is considered to be a reliable depth and is retained, thus obtaining the second pseudo depth label set.

[0138] Step S30: Iteratively train the pre-created second depth estimation network based on the training set and the second pseudo-depth label set to obtain the second depth estimation model;

[0139] In this embodiment, after obtaining the second pseudo-depth label set, the image depth estimation system iteratively trains the pre-created second depth estimation network based on the training set and the second pseudo-depth label set to obtain the second depth estimation model.

[0140] Specifically, step S30 includes:

[0141] Step S301: Iteratively train the pre-created second depth estimation network according to the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model;

[0142] In this step, the specific steps of iteratively training the pre-created second depth estimation network based on the training set and the second pseudo-depth label set include:

[0143] 1. The image depth estimation system inputs the training set into the second depth estimation network. A multi-scale, multi-view feature extraction network extracts features from several reference training viewpoint images corresponding to the target training viewpoint image in the training set, resulting in a target training viewpoint feature map set corresponding to the target training viewpoint image and a reference training viewpoint feature map set corresponding to the reference training viewpoint images. These sets include downsampled, full-resolution, or multi-scale feature maps, such as... Figure 3 As shown, the multi-scale multi-view feature extraction network extracts feature maps at three scales for both the target training view image and the reference training view image. It can be understood that multiple feature maps at different scales can be extracted depending on the specific situation, and the specific number is not limited here. The shallow feature map is used to learn information such as color and texture observed in the target training view image, while the deep feature map is used to learn the spatial layout and geometric cues of the target training view image.

[0144] 2. Input the target training viewpoint feature map set, the reference training viewpoint feature map set, and the training set into the differentiable homography transformation module. The module acquires and determines the spatial position of each pixel in the target and reference training viewpoint images based on the scene range and camera parameters corresponding to the target and reference training viewpoint images in the training set. This determines the depth assumption range corresponding to each pixel in the target and reference training viewpoint images. The module also acquires the first-scale target training viewpoint feature map from the target training viewpoint feature map set and the first-scale reference training viewpoint feature map from the reference training viewpoint feature map set. Based on the camera parameters and shooting angle corresponding to the first-scale target training viewpoint feature map, the position of each pixel in the first-scale target training viewpoint feature map is determined. The location of each pixel in space is determined, and all first-scale reference training viewpoint feature maps are projected onto the first-scale target training viewpoint feature map based on the homography matrix corresponding to each hypothetical depth within the depth hypothesis range. This is done to divide the depth hypothesis range into a series of depth hypothesis planes as the basis for subsequent calculation of the matching cost. For example, if the depth hypothesis range of each pixel on the first-scale target training viewpoint feature map is determined to be 1 meter to 3 meters, then 1-meter, 2-meter, and 3-meter depth hypothesis planes can be divided within the depth hypothesis range. It should be noted that 1-meter, 1.5-meter, 2-meter, 2.5-meter, and 3-meter depth hypothesis planes can also be divided. That is, the specific number of depth hypothesis planes is determined according to the specific settings and is not limited here.

[0145] 3. Through the matching cost calculation module, the matching deviation of each pixel on the first-scale target training viewpoint feature map on each different depth hypothesis plane with the corresponding pixel on each first-scale reference training viewpoint feature map is calculated, and the matching cost is determined based on the deviation. The smaller the matching cost, the closer the depth of the pixel is to the true depth.

[0146] 4. Input the matching cost of each pixel in the first-scale target training viewpoint feature map at each different depth plane with the corresponding pixel in each first-scale reference training viewpoint feature map into the matching cost regularization network. This network utilizes a deep neural network to adapt to and resist unavoidable noise in the matching cost. By learning the maximum likelihood estimate between the matching cost distribution and the supervision data distribution, it outputs the discrete probability that each pixel in the first-scale target training viewpoint feature map falls on each different depth hypothesis plane. A larger discrete probability indicates a greater likelihood that the pixel's depth corresponds to the depth of the depth hypothesis plane. It should be noted that the matching cost distribution is the result distribution of the matching cost for each pixel in the target training viewpoint feature map obtained by the matching cost calculation module, while the supervision data distribution is the result distribution of the loss values ​​for each pixel in the target training viewpoint feature map obtained based on the second pseudo-depth label set and the second pseudo-depth label loss function. The second pseudo-depth label loss function is:

[0147]

[0148] in, To downsample to a pseudo-depth label at scale l, and compare it with the estimated result D l Calculate the difference.

[0149] 5. Input the discrete probability of each pixel in the first-scale target training viewpoint feature map falling on each different depth hypothesis plane into the depth probability normalization module. The depth probability normalization module obtains the floating-point depth of each pixel in the first-scale target training viewpoint feature map by weighted summation based on the discrete probabilities. For example, assuming that the discrete probabilities of a pixel in the first-scale target training viewpoint feature map falling on the 1-meter depth hypothesis plane, the 2-meter depth hypothesis plane, and the 3-meter depth hypothesis plane are 0.7, 0.2, and 0.1 respectively, the weighted summation based on the discrete probabilities is: 1*0.7 + 2*0.2 + 3*0.1, which gives the floating-point depth of the pixel as 1.4 meters.

[0150] 6. After obtaining the floating-point depth of each pixel in the first-scale target training viewpoint feature map, the image depth estimation system inputs the floating-point depth into the differentiable homography transformation module in the second depth estimation network. The differentiable homography transformation module obtains the second-scale target training viewpoint feature map in the target training viewpoint feature map set and the second-scale reference training viewpoint feature map in the reference training viewpoint feature map set. Then, based on the second-scale reference training viewpoint feature map, the second-scale target training viewpoint feature map, and the floating-point depth, the depth assumption range of each pixel in the second-scale target training viewpoint feature map is determined. Steps 3-5 are repeated until the feature maps of each scale in the target training viewpoint feature map set and the reference training viewpoint feature map set are calculated, and the first depth estimation pre-model is obtained.

[0151] Step S302: Obtain a depth image based on the second depth estimation pre-model and the training set, and update the second pseudo-depth label set based on the depth image;

[0152] In this step, the image depth estimation system inputs the training set into the second depth estimation pre-model, outputs a depth image through the second depth estimation pre-model, and updates the second pseudo-depth label set based on the floating-point depth in the depth image.

[0153] Step S303: Obtain the reliability of the updated second pseudo-depth label set, and determine whether the reliability meets the preset conditions;

[0154] In this step, the image depth estimation system obtains the reliability of the updated second pseudo-depth label set and the reliability of the second pseudo-depth label set before the update. It then determines the similarity between the reliability of the updated second pseudo-depth label set and the reliability of the second pseudo-depth label set before the update. If the two reliability values ​​are similar and greater than a preset reliability threshold, step S302 is repeated multiple times to update the second pseudo-depth label set. The system then determines whether the reliability values ​​of the second pseudo-depth labels after multiple updates are similar and greater than the preset reliability threshold. If so, the reliability of the second pseudo-depth label meets the preset condition; otherwise, the reliability of the second pseudo-depth label does not meet the preset condition.

[0155] Step S304: If the reliability meets the preset conditions, then the second depth estimation pre-model is used as the second depth estimation model.

[0156] In this step, if the image depth estimation system determines that the reliability of the second pseudo-depth label meets the preset conditions, then the second depth estimation pre-model is used as the second depth estimation model.

[0157] Step S305: If the reliability does not meet the preset conditions, then based on the updated second pseudo-depth label set, the following steps are repeated: the pre-created second depth estimation network is iteratively trained according to the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

[0158] In this step, if the image depth estimation system determines that the reliability of the second pseudo-depth label does not meet the preset conditions, it will retrain the pre-created second depth estimation network based on the training set and the updated set of second pseudo-depth labels to obtain the second depth estimation pre-model, and perform subsequent steps until it is determined that the reliability of the second pseudo-depth label meets the preset conditions, and then use the second depth estimation pre-model as the second depth estimation model.

[0159] Step S40: Obtain the set of images to be estimated, input the set of images to be estimated into the second depth estimation model, and obtain the depth image corresponding to the set of images to be estimated.

[0160] In this embodiment, the image depth estimation system acquires a set of images to be estimated, which includes a target image to be estimated and a reference image to be estimated. The set of images to be estimated is then input into a second depth estimation model to obtain a depth image corresponding to the set of images to be estimated.

[0161] Specifically, step S40 includes:

[0162] Step S401: Input the set of images to be estimated into the second depth estimation model;

[0163] Step S402: Extract the first feature map set of the target image to be estimated and the second feature map set of the reference image to be estimated from the set of images to be estimated using the second depth estimation model;

[0164] Step S403: Determine the assumed depth range using the second depth estimation model based on the first feature map set and the second feature map set;

[0165] Step S404: The second depth estimation model obtains the depth image corresponding to the set of images to be estimated based on the first feature map set, the second feature map set, and the assumed depth range.

[0166] In steps S401 to S402, the image depth estimation system inputs the set of images to be estimated into the second depth estimation model, extracts the first feature map set of the target image to be estimated and the second feature map set of the reference image to be estimated from the set of images to be estimated through the second depth estimation model, determines the assumed depth range of each pixel in the target image to be estimated based on the first feature map set and the second feature map set, and obtains the depth image corresponding to the set of images to be estimated through the second depth estimation model based on the first feature map set, the second feature map set and the assumed depth range.

[0167] The specific steps include:

[0168] 1. The image depth estimation system inputs the set of images to be estimated into the second depth estimation network. Through a multi-scale, multi-view feature extraction network, features are extracted from several reference images corresponding to the target image to be estimated in the set of images to be estimated, to obtain a first feature map set corresponding to the target image to be estimated and a second feature map set corresponding to the reference images to be estimated, respectively. The first feature map set and the second feature map set include downsampled, full-resolution, or multi-scale feature maps.

[0169] 2. Input the first feature map set, the second feature map set, and the image set to be estimated into the differentiable homography transformation module. The differentiable homography transformation module obtains and determines the spatial position of each pixel in the target image to be estimated and the reference image to be estimated based on the scene range and camera parameters corresponding to the target image to be estimated and the reference image to be estimated in the image set to be estimated. Then, the depth assumption range corresponding to each pixel in the target image to be estimated and the reference image to be estimated is determined. The differentiable homography transformation module obtains the first feature map at the first scale and the second feature map at the first scale in the first feature map set. Based on the homography matrix corresponding to each assumed depth in the depth assumption range, all the first scale second feature maps are projected onto the first scale first feature map to divide a series of depth assumption planes within the depth assumption range as the basis for subsequent calculation of the matching cost.

[0170] 3. Through the matching cost calculation module, the matching deviation of each pixel on the first feature map of the first scale with the corresponding pixel on each different depth assumption plane is calculated, and the matching cost is determined based on the deviation. The smaller the matching cost, the closer the depth of the pixel is to the true depth.

[0171] 4. Input the matching cost of each pixel in the first feature map at the first scale and its corresponding pixel in each of the different depth planes into the matching cost regularization network. This network utilizes a deep neural network to adapt to and resist unavoidable noise in the matching cost. By learning the maximum likelihood estimate between the matching cost distribution and the supervised data distribution, it outputs the discrete probability that each pixel in the first feature map at the first scale falls on each different depth hypothesis plane. The supervised data distribution is the result distribution of the loss values ​​of each pixel in the target training viewpoint feature map, obtained based on the second pseudo-depth label set and the second pseudo-depth label loss function. The second pseudo-depth label loss function is:

[0172]

[0173] in, To downsample to a pseudo-depth label at scale l, and compare it with the estimated result D l Calculate the difference.

[0174] 5. Input the discrete probability of each pixel on the first feature map of the first scale falling on each different depth hypothesis plane into the depth probability normalization module. The depth probability normalization module obtains the floating-point depth of each pixel on the first feature map of the first scale by weighted summation based on the discrete probabilities.

[0175] 6. After obtaining the floating-point depth of each pixel in the first feature map at the first scale, the image depth estimation system inputs the floating-point depth into the differentiable homography transformation module in the second depth estimation network. The differentiable homography transformation module obtains the first feature map at the second scale in the first feature map set and the second feature map at the second scale in the second feature map set. Then, based on the second feature map at the second scale, the first feature map at the second scale, and the floating-point depth, the depth assumption range of each pixel in the first feature map at the second scale is determined. Steps 3-5 are repeated until the feature maps at each scale in the first and second feature map sets have been calculated. Finally, the system combines the floating-point depth of each pixel obtained in the last calculation with the target image to be estimated and outputs the depth image corresponding to the image set to be estimated.

[0176] Before performing image depth estimation, the image depth estimation system of this embodiment first acquires a training set and trains a pre-created first depth estimation network based on the training set. During the training process, a first pseudo-depth label set is generated, and the pre-created first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain a first depth estimation model. After obtaining the first depth estimation model, the image depth estimation system inputs the training set into the first depth estimation model, so that the first depth estimation model outputs a first depth image set based on the training set. The system then checks and filters out unreliable depth values ​​in the first depth image set to obtain a second pseudo-depth label set. The image depth estimation system iteratively trains the pre-created second depth estimation network according to the training set and the second pseudo-depth label set to obtain a second depth estimation model. After obtaining the second depth estimation model, the image depth estimation system acquires a set of images to be estimated and inputs the set of images to be estimated into the second depth estimation model to obtain the depth images corresponding to the set of images to be estimated. This invention combines two multi-view depth estimation networks with two pseudo-depth labels to enhance the performance of the overall unsupervised method, strengthens the effect of using image data for unsupervised learning, and eliminates the need to use complex 3D reconstruction techniques to update pseudo-depth labels. Based on geometric consistency checks and iterative training, it improves the target viewpoint depth estimation quality of the model, thereby improving the training efficiency and depth estimation quality of the depth estimation model.

[0177] Further, refer to Figure 5 The second embodiment of the present invention is proposed. The difference between the second embodiment and the first embodiment is that the step of updating the second pseudo-depth label set according to the depth image includes:

[0178] Step S3021: Determine a third pseudo-depth label set based on the depth image, and calculate the reliability of each third pseudo-depth label in the third pseudo-depth label set;

[0179] Step S3022: Obtain the reliability of each second pseudo-depth label in the second pseudo-depth label set;

[0180] Step S3023: Compare the reliability of the second pseudo-depth label with the reliability of the corresponding third pseudo-depth label to obtain the comparison result, and update the second pseudo-depth label set according to the comparison result.

[0181] In this embodiment, after obtaining the depth image from the training set using the second depth estimation pre-model, the image depth estimation system determines a third pseudo-depth label set based on the depth image. It then calculates a first error map based on the third pseudo-depth label set and a preset depth quality comparison map, and determines the reliability of each third pseudo-depth label in the third pseudo-depth label set based on the first error map. The image depth estimation system calculates a second error map based on the second pseudo-depth label set and the preset depth quality comparison map, and determines the reliability of each second pseudo-depth label in the second pseudo-depth label set based on the second error map. The image depth estimation system compares the reliability of each second pseudo-depth label in the second pseudo-depth label set with the reliability of the corresponding third pseudo-depth label in the third pseudo-depth label set, obtains a comparison result, and removes second pseudo-depth labels in the second pseudo-depth label set whose reliability is lower than that of the corresponding third pseudo-depth label in the third pseudo-depth label set, replacing them with the corresponding third pseudo-depth label from the third pseudo-depth label set, thereby updating the second pseudo-depth label set.

[0182] The image depth estimation system in this embodiment determines a third set of pseudo-depth labels based on the depth image and calculates the reliability of each third pseudo-depth label in the set. It then obtains the reliability of each second pseudo-depth label in the second set. The system compares the reliability of the second pseudo-depth labels with the reliability of their corresponding third pseudo-depth labels to obtain a comparison result, and updates the second pseudo-depth label set based on this result. Unlike existing technologies that use 3D reconstruction to maintain the quality of pseudo-depth labels, this system uses existing reliable depth labels and employs multi-threshold, multi-viewpoint geometric consistency checks of varying stringency to determine the reliability of all current depth labels. These reliability values ​​are then prioritized, and the most reliable depth value among the old and new pseudo-depth labels is selected for updating at each point. This significantly reduces computational load, improves model training efficiency, and ensures the quality of depth estimation.

[0183] like Figure 6 As shown, the present invention also provides an image depth estimation device. The image depth estimation device of the present invention includes:

[0184] The first training module 101 is used to acquire a training set, obtain a first pseudo-depth label set during training based on the training set, and iteratively train a pre-created first depth estimation network according to the training set and the first pseudo-depth label set to obtain a first depth estimation model.

[0185] The determining module 102 is used to obtain a first depth image set based on the first depth estimation model and the training set, and to determine a second pseudo depth label set based on the first depth image set;

[0186] The second training module 13 is used to iteratively train the pre-created second depth estimation network according to the training set and the second pseudo-depth label set to obtain the second depth estimation model.

[0187] The input module 104 is used to acquire a set of images to be estimated, input the set of images to be estimated into the second depth estimation model, and obtain the depth image corresponding to the set of images to be estimated.

[0188] Furthermore, the first training module is also used for:

[0189] The first depth estimation network is trained without iteration based on the training set, and the first depth estimation network generates a first set of pseudo-depth labels corresponding to the target training viewpoint images in the training set according to the training set.

[0190] Furthermore, the first training module is also used for:

[0191] The first depth estimation network is iteratively trained based on the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

[0192] The first pseudo-depth label set is updated based on the first depth estimation pre-model, and the photometric loss value and depth consistency loss value corresponding to the first depth estimation pre-model are calculated based on the updated first pseudo-depth label set, the preset photometric loss function and the preset depth consistency loss function.

[0193] The first depth estimation pre-model is validated to obtain the validation results, and it is determined whether the validation results, the photometric loss value, and the depth consistency loss value meet the preset conditions.

[0194] If the verification result, the photometric loss value, and the depth consistency loss value meet the preset conditions, then the first depth estimation pre-model is used as the first depth estimation model.

[0195] If the verification result, the photometric loss value, or the depth consistency loss value does not meet the preset conditions, then based on the updated first pseudo-depth label set, the following steps are repeated: the pre-created first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

[0196] Furthermore, the determining module is also used for:

[0197] Obtain the depth value and coordinate value corresponding to each pixel in the target depth image;

[0198] Projecting the depth values ​​onto the reference depth image set yields a set of projected depth values ​​and a set of projected coordinate values ​​for each pixel in the target depth image.

[0199] The depth error set corresponding to each pixel is calculated based on the depth value and the set of projected depth values, and the coordinate offset set corresponding to each pixel is calculated based on the coordinate value and the set of projected coordinate values.

[0200] Based on the depth error set and the coordinate offset set, a reliable pixel set is determined in the target depth image, and a second pseudo depth label set is determined based on the reliable pixel set.

[0201] Furthermore, the second training module is also used for:

[0202] The pre-created second depth estimation network is iteratively trained based on the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

[0203] A depth image is obtained based on the second depth estimation pre-model and the training set, and the second pseudo-depth label set is updated based on the depth image;

[0204] Obtain the reliability of the updated second pseudo-depth label set, and determine whether the reliability meets the preset conditions;

[0205] If the reliability meets the preset conditions, then the second depth estimation pre-model is used as the second depth estimation model;

[0206] If the reliability does not meet the preset conditions, then based on the updated second pseudo-depth label set, the following steps are repeated: the pre-created second depth estimation network is iteratively trained according to the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

[0207] Furthermore, the second training module is also used for:

[0208] A third set of pseudo-depth labels is determined based on the depth image, and the reliability of each third pseudo-depth label in the third set of pseudo-depth labels is calculated.

[0209] Obtain the reliability of each second pseudo-depth label in the second pseudo-depth label set;

[0210] The reliability of the second pseudo-depth label is compared with the reliability of the corresponding third pseudo-depth label to obtain the comparison result, and the second pseudo-depth label set is updated according to the comparison result.

[0211] Furthermore, the input module is also used for:

[0212] Input the set of images to be estimated into the second depth estimation model;

[0213] The second depth estimation model is used to extract the first feature map set of the target image to be estimated and the second feature map set of the reference image to be estimated from the set of images to be estimated.

[0214] The hypothetical depth range is determined by the second depth estimation model based on the first feature map set and the second feature map set;

[0215] The second depth estimation model obtains the depth image corresponding to the set of images to be estimated based on the first feature map set, the second feature map set, and the assumed depth range.

[0216] The present invention also provides an image depth estimation system.

[0217] The image depth estimation system includes: a memory, a processor, and an image depth estimation method program stored in the memory and executable on the processor. When the image depth estimation method program is executed by the processor, it implements the steps of the image depth estimation method as described above.

[0218] The method implemented when the image depth estimation method program running on the processor is executed can be referred to in various embodiments of the image depth estimation method of the present invention, and will not be repeated here.

[0219] The present invention also provides a readable storage medium.

[0220] The readable storage medium stores an image depth estimation method program, which, when executed by a processor, implements the steps of the image depth estimation method as described above.

[0221] The method implemented when the image depth estimation method program running on the processor is executed can be referred to in various embodiments of the image depth estimation method of the present invention, and will not be repeated here.

[0222] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or system. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or system that includes that element.

[0223] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0224] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0225] The above are merely preferred embodiments of the present invention and do not limit the patent scope of the present invention. Any equivalent structural or procedural transformations made based on the content of the present invention's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of the present invention.

Claims

1. An image depth estimation method, characterized in that, The image depth estimation method includes the following steps: A training set is obtained, and a first pseudo-depth label set is obtained during the training process based on the training set. The first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain a first depth estimation model. A first depth image set is obtained based on the first depth estimation model and the training set, and a second pseudo depth label set is determined based on the first depth image set. The pre-created second depth estimation network is iteratively trained based on the training set and the second pseudo-depth label set to obtain the second depth estimation model. Obtain a set of images to be estimated, input the set of images to be estimated into the second depth estimation model, and obtain the depth image corresponding to the set of images to be estimated; The first depth image set includes a target depth image and a reference depth image set. The step of determining the second pseudo depth label set based on the first depth image set includes: Obtain the depth value and coordinate value corresponding to each pixel in the target depth image; Projecting the depth values ​​onto the reference depth image set yields a set of projected depth values ​​and a set of projected coordinate values ​​for each pixel in the target depth image. The depth error set corresponding to each pixel is calculated based on the depth value and the set of projected depth values, and the coordinate offset set corresponding to each pixel is calculated based on the coordinate value and the set of projected coordinate values. Based on the depth error set and the coordinate offset set, a reliable pixel set is determined in the target depth image, and a second pseudo depth label set is determined based on the reliable pixel set.

2. The image depth estimation method as described in claim 1, characterized in that, The step of obtaining the first pseudo-depth label set during training based on the training set includes: The first depth estimation network is iteratively trained based on the training set, and the first depth estimation network generates a first set of pseudo-depth labels corresponding to the target training viewpoint images in the training set according to the training set.

3. The image depth estimation method as described in claim 1, characterized in that, The step of iteratively training the pre-created first depth estimation network based on the training set and the first pseudo-depth label set to obtain the first depth estimation model includes: The first depth estimation network is iteratively trained based on the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model. The first pseudo-depth label set is updated based on the first depth estimation pre-model, and the photometric loss value and depth consistency loss value corresponding to the first depth estimation pre-model are calculated based on the updated first pseudo-depth label set, the preset photometric loss function and the preset depth consistency loss function. The first depth estimation pre-model is validated to obtain the validation results, and it is determined whether the validation results, the photometric loss value, and the depth consistency loss value meet the preset conditions. If the verification result, the photometric loss value, and the depth consistency loss value meet the preset conditions, then the first depth estimation pre-model is used as the first depth estimation model. If the verification result, the photometric loss value, or the depth consistency loss value does not meet the preset conditions, then based on the updated first pseudo-depth label set, the following steps are repeated: the pre-created first depth estimation network is iteratively trained according to the training set and the first pseudo-depth label set to obtain the first depth estimation pre-model.

4. The image depth estimation method as described in claim 1, characterized in that, The step of iteratively training the pre-created second depth estimation network based on the training set and the second pseudo-depth label set to obtain the second depth estimation model includes: The pre-created second depth estimation network is iteratively trained based on the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model. A depth image is obtained based on the second depth estimation pre-model and the training set, and the second pseudo-depth label set is updated based on the depth image; Obtain the reliability of the updated second pseudo-depth label set, and determine whether the reliability meets the preset conditions; If the reliability meets the preset conditions, then the second depth estimation pre-model is used as the second depth estimation model; If the reliability does not meet the preset conditions, then based on the updated second pseudo-depth label set, the following steps are repeated: the pre-created second depth estimation network is iteratively trained according to the training set and the second pseudo-depth label set to obtain the second depth estimation pre-model.

5. The image depth estimation method as described in claim 4, characterized in that, The step of updating the second pseudo-depth label set based on the depth image includes: A third set of pseudo-depth labels is determined based on the depth image, and the reliability of each third pseudo-depth label in the third set of pseudo-depth labels is calculated. Obtain the reliability of each second pseudo-depth label in the second pseudo-depth label set; The reliability of the second pseudo-depth label is compared with the reliability of the corresponding third pseudo-depth label to obtain the comparison result, and the second pseudo-depth label set is updated according to the comparison result.

6. The image depth estimation method as described in claim 1, characterized in that, The step of inputting the set of images to be estimated into the second depth estimation model to obtain the depth image corresponding to the set of images to be estimated includes: Input the set of images to be estimated into the second depth estimation model; The second depth estimation model is used to extract the first feature map set of the target image to be estimated and the second feature map set of the reference image to be estimated from the set of images to be estimated. The second depth estimation model determines the assumed depth range based on the scene range and camera parameters corresponding to the set of images to be estimated; The second depth estimation model obtains the depth image corresponding to the set of images to be estimated based on the first feature map set, the second feature map set, and the assumed depth range.

7. An image depth estimation device, characterized in that, The image depth estimation device includes: The first training module is used to acquire a training set, obtain a first pseudo-depth label set during training based on the training set, and iteratively train a pre-created first depth estimation network according to the training set and the first pseudo-depth label set to obtain a first depth estimation model. A determination module is configured to obtain a first depth image set based on the first depth estimation model and the training set, and determine a second pseudo-depth label set based on the first depth image set; the first depth image set includes a target depth image and a reference depth image set; the determination module is further configured to obtain the depth value and coordinate value corresponding to each pixel in the target depth image; project the depth values ​​based on the reference depth image set to obtain a projected depth value set and a projected coordinate value set corresponding to each pixel in the target depth image; calculate a depth error set corresponding to each pixel based on the depth value and the projected depth value set, and calculate a coordinate offset set corresponding to each pixel based on the coordinate value and the projected coordinate value set; determine a reliable pixel set in the target depth image based on the depth error set and the coordinate offset set, and determine the second pseudo-depth label set based on the reliable pixel set; The second training module is used to iteratively train the pre-created second depth estimation network based on the training set and the second pseudo-depth label set to obtain the second depth estimation model. The input module is used to acquire a set of images to be estimated, input the set of images to be estimated into the second depth estimation model, and obtain the depth image corresponding to the set of images to be estimated.

8. An image depth estimation system, characterized in that, The image depth estimation system includes: a memory, a processor, and an image depth estimation method program stored in the memory and executable on the processor. When the image depth estimation method program is executed by the processor, it implements the steps of the image depth estimation method as described in any one of claims 1 to 6.

9. A readable storage medium, characterized in that, The readable storage medium stores an image depth estimation method program, which, when executed by a processor, implements the steps of the image depth estimation method as described in any one of claims 1 to 6.