Method for estimating depth using monocular camera

By employing log-space decomposition and cross-reconstruction loss, the method addresses depth estimation inaccuracies in scenes with reflections and lighting variations, ensuring accurate depth estimation in complex environments with reduced sensor costs.

WO2026141816A1PCT designated stage Publication Date: 2026-07-02DAEGU GYEONGBUK INSTITUTE OF SCIENCE AND TECHNOLOGY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
DAEGU GYEONGBUK INSTITUTE OF SCIENCE AND TECHNOLOGY
Filing Date
2025-07-15
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Conventional self-supervised monocular depth estimation methods struggle with accuracy in scenes containing optical reflections or transparent objects, and lighting variations lead to incorrect depth predictions due to pixel intensity inconsistencies.

Method used

The method employs log-space decomposition and cross-reconstruction loss to separate diffuse and residual images, using an intrinsic decoder to maintain consistent self-supervised learning and minimize lighting and reflection deviations, with techniques like auto-masking to exclude dynamic objects and optical outliers.

Benefits of technology

This approach enables accurate depth estimation in complex scenes with reflective surfaces and varying lighting conditions, reducing sensor costs and maintaining depth learning quality without large-scale labeled data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025010319_02072026_PF_FP_ABST
    Figure KR2025010319_02072026_PF_FP_ABST
Patent Text Reader

Abstract

According to one embodiment of the present disclosure, a computer program stored in a computer-readable storage medium is disclosed. The computer program is configured to perform a method for estimating depth using a monocular camera for an image in which a reflective surface is present, the method comprising the steps of: acquiring a reference image and a source image; estimating a relative pose between the reference image and the source image using a pose network; generating a reference feature map from the reference image, and a source feature map from the source image, using a depth encoder; estimating a depth map for the reference image by providing the reference feature map to a depth decoder; acquiring a reference diffuse image and a reference residual image corresponding to the reference image, and a source diffuse image and a source residual image corresponding to the source image, by providing the reference feature map and the source feature map to an intrinsic decoder; and calculating a photometric reprojection loss between the reference image and the source image on the basis of the estimated relative pose, the depth map, the reference diffuse image, the reference residual image, the source diffuse image, and the source residual image, and performing self-supervised learning by updating parameters of a monocular camera depth estimation model to minimize the photometric reprojection loss.
Need to check novelty before this filing date? Find Prior Art

Description

Monocular camera depth estimation method

[0001] The present invention belongs to the field of computer vision and machine learning, and in particular relates to a method for estimating reliable depth information using only images captured by a single RGB camera.

[0002] Recently, various types of sensors (e.g., LiDAR, stereo cameras) are used in autonomous vehicles and augmented reality devices to quickly and accurately determine 3D spatial information. However, these devices are expensive or have demanding installation requirements, and there are many limitations to their use in lightweight device environments such as smartphones or small robots. Consequently, research is actively underway to estimate depth using only a single RGB camera.

[0003] Traditional supervised learning methods require obtaining numerous depth labels (ground truth), but this entails high production costs and time consumption. Self-supervised methods learn depth by minimizing photometric reprojection loss through viewpoint shifts between images without additional labels. This approach has the advantage of significantly reducing labeling costs and being applicable to a wide variety of scenes.

[0004] Natural scenes contain metallic reflections, mirror surfaces, gloss, and transparent objects, and pixel brightness can vary significantly even for the same object depending on the lighting environment. Conventional self-supervision techniques alone face a problem where depth accuracy deteriorates in scenes containing optical reflections or transparent objects.

[0005] Furthermore, if Photometric Reprojection Loss is applied directly under conditions where lighting intensity varies, pixel intensity consistency is likely to be broken, leading to incorrect depth.

[0006] Conventional self-supervised monocular depth estimation is highly likely to cause significant errors in scenes containing optical reflections or transparent objects. In particular, when non-Lambertian reflectors, such as metallic or mirror surfaces, are present, pixel brightness changes or becomes distorted more significantly than expected, making it difficult to simply apply photometric reprojection loss. Furthermore, in environments with varying lighting conditions, pixel intensity can differ greatly even for the same object, hindering accurate depth prediction.

[0007] This problem is more pronounced in self-supervised methods that obtain learning signals solely from viewpoint shifts between reference and source images. In scenes with significant lighting variations, even when applying loss functions based on simple pixel differences or SSIM (Structural Similarity), discrepancies can occur during the reprojection process due to optical reflections or highlight regions, which can distort the learning process.

[0008] The present disclosure is derived from the aforementioned background technology and aims to estimate a highly reliable depth map even in scenes where reflective surfaces exist, maintain consistent self-supervised learning in various environments where lighting conditions vary, and improve depth estimation accuracy by minimizing lighting and reflection deviations through log-space decomposition and cross-reconstruction loss.

[0009] According to one embodiment of the present disclosure, a computer program stored on a computer-readable storage medium is disclosed. The computer program performs a method for estimating monocular camera depth for an image having a reflective surface, the method comprising: acquiring a reference image and a source image; estimating a relative pose between the reference image and the source image using a pose network; generating a reference feature map from the reference image and generating a source feature map from the source image using a depth encoder; transmitting the reference feature map to a depth decoder to estimate a depth map for the reference image; transmitting the reference feature map and the source feature map to an intrinsic decoder to acquire a reference diffuse image and a reference residual image corresponding to the reference image, and acquiring a source diffuse image and a source residual image corresponding to the source image. and may include the step of calculating a photometric reprojection loss between the reference image and the source image based on the estimated relative pose, the depth map, the reference diffuse image, the reference residual image, the source diffuse image, and the source residual image, and performing learning in a self-supervised manner by updating the parameters of the monocular camera depth estimation model to minimize the photometric reprojection loss.

[0010] Alternatively, the reference diffuse image and the source diffuse image include a component that is maintained regardless of optical change with respect to the same surface of the object in the reference image and the source image, and the reference residual image and the source residual image may include a component that changes due to optical change in the reference image and the source image.

[0011] Alternatively, the pixel reprojection loss can be calculated by measuring the optical difference between the source image and the reprojected image generated by projecting the coordinate system of the reference image onto the coordinate system of the source image, using the relative orientation between the reference image and the source image and a depth map corresponding to the reference image.

[0012] Alternatively, the step of performing learning in the self-supervised manner may include additionally calculating a diffuse reconstruction loss to maintain a correspondence between the reference diffuse image and the source diffuse image, and a residual reconstruction loss to maintain a correspondence between the reference residual image and the source residual image, and updating the parameters of the monocular camera depth estimation model to minimize the diffuse reconstruction loss and the residual reconstruction loss.

[0013] Alternatively, the step of transmitting the reference feature map and the source feature map to an intrinsic decoder to obtain a reference diffuse image and a reference residual image corresponding to the reference image, and obtaining a source diffuse image and a source residual image corresponding to the source image, may include the step of converting each feature map into a logarithmic space before inputting the reference feature map and the source feature map to the intrinsic decoder to separate the diffuse image and the residual image and remove high-frequency components including highlights or reflected light.

[0014] Alternatively, in the process of calculating the pixel reprojection loss, dynamically moving objects or abnormal regions within the source image can be excluded using an auto-masking technique so that reprojection errors for fixed objects are reflected in the learning.

[0015] Alternatively, the depth encoder may process the reference image and the source image at multiple scales, respectively, and allow the depth decoder and the intrinsic decoder to each generate a multi-scale output from the feature map calculated for each scale.

[0016] Alternatively, the step of performing learning in the self-supervised manner may include a step of additionally performing multi-scale learning through pixel reprojection loss between the multi-scale outputs.

[0017] Alternatively, the step of performing learning in the self-supervised manner may include the step of filtering regions where the optical error between the reference image and the source image, the reference diffuse image and the source diffuse image, and the reference residual image and the source residual image is above a threshold level through Mahalanobis distance-based filtering.

[0018] Alternatively, the step of performing learning in the self-supervised manner described above comprises a Reflectance Consistency Loss that ensures the reflectance of the object surface is consistently maintained between the reference diffuse image and the source diffuse image, and

[0019] The method may include a step of additionally including a Shading Consistency Loss that causes the illumination shading or reflected light between the reference residual image and the source residual image to have a physical change amount according to the viewpoint transformation, and updating the parameters of the monocular camera depth estimation model so that the reflectance consistency loss and the shading consistency loss are minimized.

[0020] Alternatively, the method may further include the step of calculating a geometric shading loss using normal information estimated from the reference image and the source image to make the reference residual image and the source residual image reflect the correlation between the light source direction and the surface normal.

[0021] Alternatively, the diffuse image and the residual image are decomposed based on Equation 2, and Equation 2:

[0022]

[0023] Here, I represents the original image containing the reference image or source image, L is the diffuse component which is an intrinsic component of the object, and R may represent the residual component which changes due to optical variation.

[0024] Alternatively, the above diffuse reconstruction loss is defined by Equation 3, and

[0025] Mathematical formula 3:

[0026]

[0027] Here, is a reference image, is the reference diffuse image, is a comparison target image obtained by reprojecting or aligning a reference residual image, can mean L1 gambling.

[0028] Alternatively, to calculate the pixel reprojection loss, the alignment of the coordinate system of the reference image and the coordinate system of the source image is performed based on Equation 4, and

[0029] Math 4:

[0030]

[0031] Here, is the reprojected coordinate, is the relative pose from the reference image to the source image, is the depth map for pixel (u,v) of the reference image, K is the camera intrinsic parameter matrix, can be a function that projects from homogeneous coordinates to 2D.

[0032] Alternatively, the above diffuse reconstruction loss and the above residual reconstruction loss are integrated into a cross reconstruction loss according to Equation 5, and the cross reconstruction loss is utilized for training,

[0033] Mathematical formula 5:

[0034]

[0035] Here, is a reference image, is a reference residual image, It may be an image obtained by warping the source diffuse image into the reference coordinate system.

[0036] Alternatively, a contrast loss to suppress the variance between the diffuse image and the residual image produced by the above intrinsic decoder is defined by Equation 6 and is used in training the above monocular camera depth estimation model, and

[0037] Mathematical formula 6:

[0038]

[0039] Here, is margin, and In this, i and j identify images within different batches, and b is the batch size, can mean L2 gambling.

[0040] Alternatively, in the training phase of the monocular camera depth estimation model, the intrinsic decoder is trained by a loss function defined by Equation 7, and

[0041] Mathematical formula 7:

[0042]

[0043] Here, is the diffuse reconstruction loss according to Equation 3, is the cross-reconstruction loss according to mathematical formula 5, is the contrast loss according to mathematical formula 6. , and can be the weight of each term.

[0044] Alternatively, the pixel reprojection loss is defined by Equation 8, and

[0045] Mathematical formula 8:

[0046]

[0047] Here, is a reference image or reference diffuse image, and is an image obtained by projecting the source image (or source diffuse image) onto the coordinate system of the reference image, M is a mask for excluding dynamically moving objects or abnormal regions, SSIM is the structural similarity, and may be the weight of the SSIM term and the L1 term.

[0048] Alternatively, the method further includes the step of identifying non-Lambertian regions in the image based on Equation 9, and

[0049] Mathematical formula 9:

[0050]

[0051] Here, P represents the photometric reprojection loss of Equation 8, and is the photometric error in the reference image and source image pair, represents the photometric error in the reference diffuse image and source diffuse image pair, and represents the pseudo-diffuse image, represents the log value of the residual component, and represents an exponential function, and, here and Areas with large differences can be considered as non-Rambertian areas and excluded from training.

[0052] Alternatively, the above Mahalanobis distance-based filtering is performed based on Equation 10, and

[0053] Mathematical formula 10:

[0054]

[0055] Here, and is the photometric error defined by mathematical formula 9, and is the mean vector of the photometric error, and If α is 1, it is considered a normal region, and if α is 0, it can be filtered as a region where optical error is above a critical level.

[0056] Alternatively, the step of updating the parameters of the monocular camera depth estimation model to perform learning in a self-supervised manner includes the step of updating the parameters of the monocular camera depth estimation model using a depth estimation loss defined by Equation 11;

[0057] Mathematical formula 11:

[0058]

[0059] Here, is a Mahalanobis-based mask defined by mathematical formula 10, can be the photometric reprojection loss defined by mathematical formula 8.

[0060] Alternatively, the step of performing learning in a self-supervised manner by updating the parameters of the monocular camera depth estimation model comprises the step of updating the parameters of the monocular camera depth estimation model using a final loss function defined by Equation 12;

[0061] Mathematical formula 12:

[0062]

[0063] Here, is the depth estimation loss defined by mathematical equation 11, can be the intrinsic decoder loss defined by mathematical formula 7.

[0064] The present invention separates diffuse and residual images through an intrinsic decoder, thereby preventing differences in reflected light or lighting intensity from being excessively reflected in simple pixel differences. Accordingly, the accuracy of depth estimation can be improved even in complex scenes containing mirrors, metal, glass, etc.

[0065] The present invention performs learning by utilizing viewpoint shift and optical resolution without the need for separate depth labels. As a result, learning is possible even in environments without large-scale labeled data, and it can respond more robustly to changes in lighting conditions than conventional self-supervision methods.

[0066] Through techniques proposed in this invention, such as log-space transformation, auto-masking, and loss of reflectance consistency and shading consistency, learning proceeds stably even in scenes with severe optical noise. Therefore, the quality of depth learning does not degrade even when indoor lighting changes periodically or sunlight intensity varies in outdoor environments.

[0067] The technology of the present invention can be applied to various fields such as autonomous driving, robotics, AR / VR, and 3D modeling. Even in industrial environments where reflective or transparent objects frequently appear, or in road and outdoor environments where outdoor lighting changes rapidly, the present invention enables highly accurate monocular depth estimation while reducing sensor costs.

[0068] FIG. 1 is a block diagram of a computer device according to one embodiment of the present disclosure.

[0069] FIG. 2 is a schematic diagram of a monocular camera depth estimation model of one embodiment of the present disclosure.

[0070] FIG. 3 is a flowchart of a monocular camera depth estimation method according to one embodiment of the present disclosure.

[0071] FIG. 4 illustrates a brief and general schematic diagram of an exemplary computing environment in which embodiments of the present disclosure may be implemented.

[0072] Various embodiments are now described with reference to the drawings. In this specification, various descriptions are provided to provide an understanding of the present disclosure. However, it is evident that these embodiments can be practiced without such specific descriptions.

[0073] As used herein, terms such as “component,” “module,” “system,” etc. refer to computer-related entities, hardware, firmware, software, combinations of software and hardware, or executions of software. For example, a component may be, but is not limited to, a procedure executed on a processor, a processor, an object, an execution thread, a program, and / or a computer. For example, both an application executed on a computer device and the computer device itself may be a component. One or more components may reside within a processor and / or an execution thread. A component may be localized within a single computer. A component may be distributed among two or more computers. Additionally, these components may be executed from various computer-readable media having various data structures stored therein. Components may communicate through local and / or remote processes, for example, according to signals having one or more data packets (e.g., data from a component interacting with another component in a local system or distributed system, and / or data transmitted through signals to other systems and networks such as the Internet).

[0074] Furthermore, the term "or" is intended to mean an implicit "or" rather than an exclusive "or." That is, unless otherwise specified or evident from the context, "X uses A or B" is intended to mean one of the natural implicit substitutions. In other words, if X uses A; if X uses B; or if X uses both A and B, "X uses A or B" may apply to any of these cases. Additionally, the term "and / or" as used herein should be understood to refer to and include all possible combinations of one or more of the enumerated related items.

[0075] Additionally, the terms “comprising” and / or “comprising” should be understood to mean that such features and / or components are present. However, the terms “comprising” and / or “comprising” should be understood not to exclude the presence or addition of one or more other features, components and / or groups thereof. Furthermore, unless otherwise specified or clearly evident from the context to indicate a singular form, the singular in this specification and claims should generally be interpreted to mean “one or more.”

[0076] And, the term “at least one of A or B” should be interpreted to mean “a case including only A,” “a case including only B,” or “a combination of A and B.”

[0077] Those skilled in the art should recognize that the various exemplary logical blocks, configurations, modules, circuits, means, logics, and algorithmic steps described in connection with the embodiments disclosed herein may be implemented in electronic hardware, computer software, or a combination of both. To clearly exemplify the interchangeability of hardware and software, various exemplary components, blocks, configurations, means, logics, modules, circuits, and steps have been generally described above in terms of their functionality. Whether such functionality is implemented in hardware or software depends on the specific application and design constraints imposed on the overall system. Skilled technicians may implement the described functionality in various ways for each specific application. However, such decisions regarding implementation should not be construed as going beyond the scope of this disclosure.

[0078] The description of the presented embodiments is provided to enable those skilled in the art to use or practice the present invention. Various modifications to these embodiments will be apparent to those skilled in the art. The general principles defined herein may be applied to other embodiments without departing from the scope of the present disclosure. Thus, the present invention is not limited to the embodiments presented herein. The present invention should be interpreted in the broadest possible scope consistent with the principles and novel features presented herein.

[0079] FIG. 1 is a block diagram of a computer device according to one embodiment of the present disclosure. The configuration of the computer device (100) shown in FIG. 1 is merely a simplified example. In one embodiment of the present disclosure, the computer device (100) may include other configurations for performing a computer environment, and only some of the disclosed configurations may constitute the computer device (100).

[0080] A computer device (100) may include a processor (110), memory (130), and a network unit (150). The processor (110) may be composed of one or more cores and may include a processor for data analysis and deep learning, such as a central processing unit (CPU), a general purpose graphics processing unit (GPGPU), or a tensor processing unit (TPU) of the computer device. The processor (110) may read a computer program stored in memory (130) and perform data processing for machine learning according to one embodiment of the present disclosure. According to one embodiment of the present disclosure, the processor (110) may perform calculations for learning a neural network. The processor (110) may perform calculations for learning a neural network, such as processing input data for learning in deep learning (DL), extracting features from input data, calculating errors, and updating the weights of the neural network using backpropagation. At least one of the CPU, GPGPU, and TPU of the processor (110) can process the learning of a network function. For example, the CPU and GPGPU can together process the learning of a network function and data classification using the network function. In addition, in one embodiment of the present disclosure, processors of a plurality of computer devices can be used together to process the learning of a network function and data classification using the network function. In addition, a computer program executed in a computer device according to one embodiment of the present disclosure may be a CPU, GPGPU, or TPU executable program.

[0081] According to one embodiment of the present disclosure, the memory (130) can store any form of information generated or determined by the processor (110) and any form of information received by the network unit (150).

[0082] According to one embodiment of the present disclosure, the memory (130) may include at least one type of storage medium among a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (e.g., SD or XD memory), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, a magnetic disk, and an optical disk. The computer device (100) may operate in conjunction with web storage that performs the storage function of the memory (130) on the internet. The description of the memory described above is merely an example and the present disclosure is not limited thereto.

[0083] A network unit (150) according to one embodiment of the present disclosure can use various wired communication systems such as a public switched telephone network (PSTN), xDSL (x Digital Subscriber Line), RADSL (Rate Adaptive DSL), MDSL (Multi Rate DSL), VDSL (Very High Speed ​​DSL), UADSL (Universal Asymmetric DSL), HDSL (High Bit Rate DSL), and a local area network (LAN).

[0084] In addition, the network unit (150) presented in this specification may use various wireless communication systems such as CDMA (Code Division Multi Access), TDMA (Time Division Multi Access), FDMA (Frequency Division Multi Access), OFDMA (Orthogonal Frequency Division Multi Access), SC-FDMA (Single Carrier-FDMA), and other systems.

[0085] In the present disclosure, the network unit (150) can be configured regardless of the communication mode, such as wired and wireless, and can be configured as various communication networks, such as a Personal Area Network (PAN) or a Wide Area Network (WAN). In addition, the network may be a known World Wide Web (WWW) and may utilize wireless transmission technology used for short-range communication, such as Infrared Data Association (IrDA) or Bluetooth.

[0086] Through the network unit (150) of the present disclosure, the computer device (100) can communicate with other computer devices, etc., for example, a data storage where data is stored, a cloud data storage, a cloud computer system for using computing power, etc. The technologies described in this specification can be used not only in the networks mentioned above but also in other networks.

[0087] Throughout this specification, computational model, neural network, network function, and neural network may be used interchangeably. A neural network may consist of a set of interconnected computational units, which may generally be referred to as nodes. These nodes may also be referred to as neurons. A neural network is composed of at least one node. The nodes (or neurons) constituting the neural networks may be interconnected by one or more links.

[0088] In a neural network, one or more nodes connected via links can form relative input and output node relationships. The concepts of input and output nodes are relative; any node in an output node relationship with respect to one node may be in an input node relationship with respect to another node, and vice versa. As described above, the input node versus output node relationship can be generated based on links. One or more output nodes may be connected to a single input node via links, and vice versa.

[0089] In a relationship between an input node and an output node connected through a single link, the value of the output node's data can be determined based on the data input to the input node. Here, the link interconnecting the input node and the output node may have a weight. The weight can be variable and can be varied by the user or an algorithm to enable the neural network to perform the desired function. For example, if one or more input nodes are interconnected to a single output node by respective links, the output node's value can be determined based on the values ​​input to the input nodes connected to the output node and the weights set on the links corresponding to each input node.

[0090] FIG. 2 is a schematic diagram of a monocular camera depth estimation model of one embodiment of the present disclosure.

[0091] As illustrated in the drawing, the computer device (100) receives a reference image (10) and a source image (20) as inputs, respectively. The reference image (10) is an image used as a reference for estimating a depth map in the present invention, and the source image (20) refers to one of two consecutive image frames with different viewpoints. The computer device processes the two images to perform self-supervised depth estimation in a monocular camera environment.

[0092] The computer device (100) first extracts features of a reference image (10) through a depth encoder (210). The depth encoder (210) is implemented as a convolutional neural network (CNN) or the like, and compresses and abstracts pixel information within the image in multiple stages. Additionally, the computer device estimates the relative pose ([R|t]) between the reference image (10) and the source image (20) using a pose network (220). The pose network (220) performs the role of inferring the three-dimensional geometric relationship (rotation and translation) between the two images.

[0093] The computer device (100) transmits a feature map of a reference image (10) to a depth decoder (230) to generate a depth map (30) for the reference image. The depth decoder (230) predicts the depth value of each location by decoding the features extracted from the encoder (210) back into high-resolution pixel unit information. As a result, a reference depth map (30) in the form of a single channel (depth channel) corresponding to the reference image (10) is produced.

[0094] The computer device (100) simultaneously performs intrinsic decomposition through an intrinsic decoder (240). The intrinsic decoder (240) receives the feature map of the reference image (10) and the feature map of the source image (20) as inputs, respectively, and separates the object's intrinsic reflection component and the residual component caused by lighting. Specifically, the intrinsic decoder (240) generates an intrinsic image (300), which is largely composed of a reference diffuse image (310), a reference residual image (320), a source diffuse image (330), and a source residual image (340).

[0095] The reference diffuse image (310) contains the part corresponding to the object's intrinsic reflectance (albedo) of the reference image (10). It represents the color and reflectance components that must remain constant even when the lighting changes on the same object surface.

[0096] The reference residual image (320) represents optical elements of the reference image (10) that are highly variable, such as lighting, shadows, and highlights.

[0097] The source diffuse image (330) is the result of separating the object reflectance in the same way as the source image (20).

[0098] The source residual image (340) refers to the residual components, such as light and shadow, of the source image (20).

[0099] The computer device combines the reference depth map (30) and the intrinsic image (300) generated by the intrinsic decoder (240) to perform self-supervised learning. For example, after optically aligning the reference and source images through a reprojection (warping) process, it calculates the photometric reprojection loss and suppresses learning errors occurring in areas with significant lighting changes or reflections by referencing diffuse residual information separated from the intrinsic image (300).

[0100] FIG. 3 is a flowchart of a monocular camera depth estimation method according to one embodiment of the present disclosure.

[0101] A computer device (100) can acquire a reference image and a source image (S100). The computer device can acquire images captured at regular time intervals through a single camera (e.g., a smartphone camera, a webcam, etc.). For example, the first frame or a frame at a specific point in time can be set as the reference image (10), and a frame at a subsequent point in time can be set as the source image (20). At this time, the computer device can record internal and external camera parameters (focal length, angle of view, position, attitude, etc.) together or estimate them based on the environment.

[0102] The computer device (100) can perform preprocessing on images as needed. Preprocessing may include resolution adjustment and normalization, color and brightness normalization, storage and metadata management, etc.

[0103] The reference image (10) serves as a reference frame, and a depth map (30) is estimated for this image. The computer device calculates depth information through the depth encoder (210) and depth decoder (230) using this reference image.

[0104] The source image (20) is used to calculate the photometric error by comparing (reprojecting) it with the reference image. The pose network (220) estimates the relative pose ([R∣t]) between the reference and source images, and the intrinsic decoder (240) also generates a diffuse / residual image for the source image.

[0105] When acquiring monocular camera images in various locations, such as outdoor scenes (outdoor roads, buildings) or indoor environments (homes, offices), significant differences in lighting intensity or reflectivity occur. Since the present invention corrects these lighting changes by using an intrinsic decomposition technique, high accuracy is maintained during the self-supervised learning process even if the reference image (10) and the source image (20) are captured under different lighting conditions.

[0106] The computer device (100) can estimate the relative pose between the reference image and the source image using a pose network (S200).

[0107] Generally, the pose network (220) extracts features from pairs (reference image, source image) using, for example, a CNN structure, and calculates relative rotation (R) and relative translation (t) parameters. The output of the pose network is in the form of (3D rotation vector, 3D translation vector) or is converted into a 3×3 rotation matrix using the Rodrigues formula, etc.

[0108] In the present invention, the parameters of the pose network (220) are optimized to minimize photometric reprojection loss during self-supervised learning. That is, since the correct rotation and translation must be predicted to accurately align (warp) the reference image and the source image, the pose network (220) consequently estimates the relative pose more precisely.

[0109] Relative orientation indicates the angle at which the camera has rotated, and may include rotation (3D angle) expressed as a rotation vector with respect to the x, y, and z axes or using quaternions, for example, and translation (3D movement) which is a value indicating how the camera has moved in space (forward, backward, lateral movement, etc.).

[0110] The computer device (100) can generate a reference feature map from the reference image and generate a source feature map from the source image using a depth encoder (S300).

[0111] The depth encoder (210) can be implemented as a convolutional neural network (CNN) or a vision transformer to extract low-dimensional representations within an image.

[0112] Some implementations extract feature maps at multiple stages for each scale, supporting subsequent stages (depth decoder, intrinsic decoder) to merge them at multiple scales. For example, 4 to 5 scales can be used to gradually reduce the resolution while enriching the features.

[0113] The computer device inputs a reference image (10) into a depth encoder (210). The encoder (210) outputs a reference feature map in the form of a multidimensional tensor through a convolutional layer. The feature map is in a concentrated state of the main boundaries, textures, and object contour information of the reference image (10), and subsequent steps (depth decoder (230), intrinsic decoder (240), etc.) can decode it to infer a depth and intrinsic image.

[0114] The computer device inputs the source image (20) into the depth encoder (210). A source feature map is generated by performing a convolution operation with the same network structure (or shared parameters). The source feature map reflects the viewpoint characteristics of the source image (20) and is used by the pose network (220) or the intrinsic decoder (240) to precisely interpret the viewpoint difference and reflection / lighting components.

[0115] The computer device (100) can transmit the reference feature map to a depth decoder to estimate a depth map for the reference image (S400).

[0116] The computer device (100) transmits the reference feature map obtained through the depth encoder in the previous step to the depth decoder. The reference feature map is a compressed form of low-dimensional representations (borders, textures, structural information of objects, etc.) inherent in the reference image, and the computer device (100) inputs this data so that the depth decoder can interpret it.

[0117] A depth decoder generally restores a reference feature map to a size similar to the original resolution through upsampling operations (such as deconvolution or interpolation-based upsampling). This enables the computer device (100) to predict depth values ​​in pixel units. Depending on the implementation example, the depth decoder may establish a skim or skip connection with intermediate features extracted at each stage of the encoder (similar to a U-Net structure). In this case, the computer device (100) supplies the depth decoder with not only low-resolution features but also medium and high-resolution features, thereby accurately restoring depth boundaries or the depth of small objects. An activation function such as sigmoid or ReLU may be used in the final layer of the decoder. The computer device (100) finally converts the restored tensor into a depth map in the form of a single channel. The depth map values ​​can be expressed in actual meters (m) or a normalized range (e.g., a 0 to 1 scale). To predict the actual scene depth, predefined depth ranges, scale factors, etc., can be applied.

[0118] The computer device (100) interprets the output tensor as a depth value (perspective information) corresponding to each pixel of the reference image. For example, the distance can be set closer as the pixel is brighter and farther as the pixel is darker (or vice versa). If necessary, the computer device (100) performs masking or clipping if the predicted depth value exceeds a certain range or if there are incorrect areas. During the self-supervised learning process, an auto-masking technique is used to improve the quality of the depth map by excluding dynamically moving objects or optical outlier areas.

[0119] The generated depth map (30) is compared with the source image (20) during the reprojection (warping) process to measure the photometric reprojection loss. The computer device (100) uses this loss for backpropagation to update the parameters of the depth encoder and decoder (and pose network, intrinsic decoder, etc.).

[0120] The computer device (100) can transmit the reference feature map and the source feature map to an intrinsic decoder to obtain a reference diffuse image and a reference residual image corresponding to the reference image, and obtain a source diffuse image and a source residual image corresponding to the source image (S500).

[0121] The computer device uses an intrinsic decoder to separate the object's intrinsic reflection component (albedo) and the residual component (shading and reflections) caused by lighting from an input image (or feature map). In this invention, by performing intrinsic decomposition simultaneously with monocular camera depth estimation, accurate depth learning is induced even in scenes with severe lighting or reflections.

[0122] The computer device inputs a reference feature map and a source feature map, respectively, to an intrinsic decoder. The intrinsic decoder can process the two inputs in parallel or process them in a branched manner within a single shared encoder-decoder structure. Although the module configuration may vary depending on the implementation, the present invention is not limited to a specific structure.

[0123] The computer device estimates the object's intrinsic reflectance (albedo) through a CNN (or Transformer)-based decoder that takes a feature map as input. As a result, a reference diffuse image (310) reflecting only the albedo of the reference image and a source diffuse image (330) reflecting only the albedo of the source image are obtained. In actual implementation, color information composed of three RGB channels is restored, or an 'intrinsic color map' that is not affected by lighting is learned.

[0124] In another branch within the intrinsic decoder, components that vary depending on the environment, such as lighting, shadows, and reflections, are extracted. This generates a reference residual image (320) and a source residual image (340). In residual images, highlights or shadows are often captured as the main characteristics. If necessary, correction techniques such as log space transformation and normalization are used together to more stably separate non-Lambertian reflections (according to the example in the paper, specular reflection gloss is effectively processed using log space-based decomposition).

[0125] The computer device acquires the diffuse-residual channel corresponding to the reference image (10) among the outputs of the intrinsic decoder as the reference diffuse image (310) and the reference residual image (320).

[0126] Diffuse (310) is the intrinsic reflectance (color) of the object surface, and residual (320) contains variable optical components such as light, shadow, or highlight.

[0127] The computer device separates the diffuse residual from the source image (20) in the same process to obtain the source diffuse image (330) and the source residual image (340).

[0128] This ensures that reflections and shadows are consistently separated even for images from different viewpoints, serving as a basis for correcting lighting differences in subsequent steps (reprojection loss calculation, reflectance and shadow consistency loss, etc.).

[0129] The computer device (100) can perform self-supervised learning by calculating a photometric reprojection loss between the reference image and the source image based on the estimated relative pose, the depth map, the reference diffuse image, the reference residual image, the source diffuse image, and the source residual image, and updating the parameters of the monocular camera depth estimation model to minimize the photometric reprojection loss (S600).

[0130] The computer device processes the reference and source images to estimate the depth map and intrinsic (diffuse / residual) images, respectively. It also acquires the relative pose between the two images through a pose network. In this step, based on this information, the reference and source images are reprojected (warped) to measure pixel-level differences, and the entire model is trained to minimize these measured differences (pixel reprojection loss). This training method utilizes a self-supervised structure and learns accurate depth through viewpoint differences and optical resolution without using additional labels (such as depth maps).

[0131] Calculate Photometric Reprojection Loss

[0132] The computer device uses the relative pose obtained from the pose network to align (warp) the coordinate system of the reference image with the coordinate system of the source image. At this time, by referencing the depth value (depth map) of each pixel, it estimates where the corresponding pixel is located in 3D space. Subsequently, this 3D point is converted to the source image coordinate system to find which pixel location corresponds to that point in the source image.

[0133] A computer device (100) compares a re-projected reference image with an actual source image to calculate the difference in brightness (or color) at the pixel level. In the present invention, by considering the reference diffuse-residual image and the source diffuse-residual image together, not only is the pixel brightness matched, but the difference in lighting and reflected light is corrected. For example, the consistency of the object's inherent color (reflectance) is checked by comparing the diffuse images, and optical changes such as shadows and highlights are physically natural through the residual image.

[0134] Correction techniques such as auto-masking

[0135] Since dynamic objects or areas of extreme reflections can interfere with learning, the computer device generates automatic masks to exclude errors in these areas. This enables learning of depth and posture based on a fixed background, such as a floor or wall, thereby producing stable results.

[0136] Update monocular camera depth estimation model parameters

[0137] The entire monocular camera depth estimation model consists of a structure including a depth encoder, a depth decoder, a pose network, and an intrinsic decoder. By jointly learning the parameters of these modules, the computer device ensures that depth and optical resolution are consistently achieved even in scenes with lighting variations.

[0138] Model parameters are updated by applying the backpropagation technique based on the calculated Photometric Reprojection Loss. At this time, diffuse and residual decomposition errors are also partially reflected, minimizing inaccurate depth estimations caused by reflections or shading issues.

[0139] The invention forms learning objectives using only viewpoint transformation and optical resolution, without externally provided depth labels. As a result, large-scale labeling is not required, and it can be easily scaled to various environments (indoor and outdoor, and under diverse lighting conditions).

[0140] A diffuse image may be an image that represents the intrinsic reflectance (albedo) of an object. In the present invention, a computer device separates input images (reference and source) using an intrinsic decoder, etc. Among the results, the diffuse image represents the intrinsic color of the object's surface and reflects the reflectance (albedo) of the object in an ideal Lambertian reflection model. Since the same object surface must maintain a similar color even with changes in lighting or shadows, the diffuse image may include a component that is maintained regardless of optical changes.

[0141] The reference image (10) and the source image (20) may differ in viewpoint and in the ambient lighting in which they were captured. However, the object's intrinsic color (diffuse component) must be essentially the same regardless of differences in light source or viewing angle. Accordingly, it is ideal for the reference diffuse image and the source diffuse image to remain unchanged on the same object surface.

[0142] Residual images may contain components that change due to optical variations in the image. Unlike the intrinsic color of an object's surface, residual images capture parts that change due to ambient light sources, reflected light, shadows, etc. For example, highlights reflected on a metal surface or reflected light generated under bright lighting correspond to residual components. By separating and learning these components separately, the present invention ensures that pixel value fluctuations caused by lighting differences are not considered errors in depth estimation.

[0143] Optical reflection or shadow patterns may vary depending on the time and angle at which the reference image (10) is taken and the time and angle at which the source image (20) is taken. As a result, the reference residual image and the source residual image are expressed differently depending on the angle at which light strikes or changes in light intensity, even on the same surface. Therefore, the reference image may contain components that change due to optical changes.

[0144] This invention processes light and shadow (residual) into different channels while maintaining the color (diffuse) of the same object surface through diffuse and residual separation. This corrects for changes in pixel intensity caused by differences in lighting, reflection, and shadow, preventing confusion during the depth estimation learning process.

[0145] Although optical reflection changes significantly in the presence of metallic surfaces, mirror surfaces, etc., the present invention reflects this effect in the residual image. As a result, the intrinsic reflectance (diffuse component) of the object surface remains relatively stable, and the depth estimation results do not fluctuate even when lighting or reflected light changes.

[0146] Separating diffuse and residual images according to one embodiment of the present disclosure has the effect of reducing illumination deviation when calculating reprojection loss. This enables stable learning using viewpoint difference and optical resolution even without labels (depth maps).

[0147] The following describes the method for calculating pixel reprojection loss through reprojection (warping).

[0148] The computer device obtains the relative pose (rotation and translation) estimated through a pose network and a depth map for the reference image calculated through a depth decoder. Using this information, each pixel of the reference image is repositioned (warped) to the coordinate system of the source image. The process proceeds as follows.

[0149] Reference Pixel Coordinate Transformation: Each pixel location in the reference image is restored to a point in 3D space using the depth value provided by the depth map. The computer device projects this point into the source image coordinate system by applying the estimated relative orientation. Based on the projection result, it finds which pixel in the source image corresponds to the point.

[0150] Reprojection Image Generation: The computer device constructs a reprojection image by mapping the pixel values ​​of a reference image to a source coordinate system. This reprojection image is the result of transforming the reference image into the source viewpoint, and thus is placed in the same coordinate system as the source image.

[0151] Computer devices measure the difference in color (or brightness) at the pixel level by comparing the reprojected image, transformed into the source coordinate system, with the actual source image. This is defined as Photometric Reprojection Loss and quantitatively indicates how well the two images are aligned.

[0152] Computer devices can use various measurement methods such as simple absolute difference, squared difference, and structural similarity (SSIM). Through this, high errors occur in pixels that are misaligned due to incorrect depth values ​​or inaccurate relative orientation.

[0153] Areas with moving objects in the source image or severe distortion caused by non-Rambertian surfaces (highly reflective surfaces) can be excluded from the reprojection loss calculation by applying automatic masks. This allows the computer to focus depth and pose learning primarily on the stable background (static areas).

[0154] When a reference image and a source image are aligned, if model parameters (depth encoder, decoder, pose network, etc.) are adjusted so that the pixel difference is minimized, the correct 3D structure and camera movement are naturally learned. This forms the basis of a self-supervised method that can learn even without labels (depth maps). By considering diffuse and residual images separated by an intrinsic decoder together, the present invention can correct pixel differences caused by changes in optical lighting or reflected light. As a result, lighting deviations that are difficult to explain by simple pixel differences alone are processed as residuals, and reprojection loss is concentrated on differences in the actual object structure.

[0155] The following describes self-supervised learning through diffuse and residual reconstruction loss.

[0156] The computer device compares the reference diffuse image (object intrinsic reflectance) generated by the intrinsic decoder with the source diffuse image to ensure that the same object surface maintains similar color and reflectance despite differences in viewpoint. The error resulting from this comparison process is called diffuse reconstruction loss, and the system learns to reduce unnecessary differences in the actual object color (albedo).

[0157] Computer devices assume that even if reference and source diffuse images are captured at different points in time, if they are of the same object surface, they must have the same (or similar) color. Therefore, color and brightness deviations in the diffuse channel are limited so that they are not excessive even when the viewpoint changes, and this ensures that the object's intrinsic reflectance is consistently expressed throughout the scene. The diffuse reconstruction loss minimizes only the difference in intrinsic reflectance (color), rather than lighting deviations such as shadows or reflected light. As a result, the present invention distributes color differences caused by optical factors (lighting and reflection) to a separate residual channel, thereby stably maintaining the object color in the diffuse image.

[0158] The computer device compares the reference residual image with the source residual image to learn how optical variations, such as lighting, shadows, and reflections, are physically and naturally transformed. This error is called residual reconstruction loss, and it adjusts changes in shading and reflections that occur during viewpoint shifts to within a reasonable range.

[0159] Residual images are constrained to prevent shading or reflection patterns from changing completely disorderly when viewing the same object from different viewpoints. For example, even if highlights on a metallic surface shift with changes in viewpoint, this change is naturally represented through the residual image. The computer learns both losses (diffuse and residual) together so that the object's intrinsic color is preserved in diffuse images, while the residual image fluctuates in a consistent pattern even when lighting and reflections change. This separates optical variations from the object's intrinsic color, suppressing unnecessary color deviations that negatively affect depth estimation.

[0160] The computer device introduces diffuse reconstruction loss and residual reconstruction loss as additional loss terms in addition to the existing Photometric Reprojection Loss. These additional terms improve the limitations of monocular depth estimation, which was vulnerable to lighting and reflections, by forcing the consistent representation of object-specific reflections and optical deviations. During the backpropagation process, diffuse and residual reconstruction losses influence the intrinsic decoder (diffuse branch, residual branch), depth encoder / decoder, and pose network. As a result, the monocular camera depth estimation model of the present invention converges more accurately and stably, even in lighting and reflection situations.

[0161] The present invention distinguishes and calculates the reconstruction error between reference and source diffuse images and the reconstruction error between reference and source residual images, thereby ensuring that the object's inherent color (diffuse) is maintained regardless of lighting changes or reflected light, and inducing the reasonable representation of optical variations (residual), such as shading and reflection.

[0162] The following describes the separation of diffuse and residuals through log space transformation.

[0163] The computer device converts the reference feature map and source feature map into log space before directly inputting them into the intrinsic decoder. This is to stably handle high-frequency components with very high pixel intensity, such as highlights or strong reflections.

[0164] Areas where highlights or reflections occur present problems, such as causing very large differences in pixel values ​​in the original image (or feature map) or leading specific channels to approach saturation. Converting to logarithmic space allows for processing based on the ratio of large to small values, thereby relatively mitigating extreme intensity differences.

[0165] Generally, the observed intensity of an object is interpreted as being determined by 'diffuse (inherent reflection) × illumination (residual)'. In logarithmic space, since multiplication is converted into an addition relationship, it becomes easier to separate diffuse (a specific channel upon log transformation) and residual (another channel upon log transformation).

[0166] The computer device obtains reference feature maps and source feature maps that have already been extracted through a depth encoder, etc. These feature maps undergo a multi-stage convolution and activation process to form tensors containing important information about the reference and source images. The computer device applies a logarithmic function to the elements (pixel or channel values) of each feature map to reconfigure the value range. At this time, for negative or near-zero values, an offset can be added in advance or a stabilization value can be applied so that the logarithm is defined. After logarithmic transformation, high-intensity reflected light regions are also mitigated, ensuring that the intrinsic decoder is not affected by excessive differences in optical components.

[0167] The reference feature map and source feature map, converted to log space, are divided into two (or multiple) channels corresponding to a 'diffuse image' (e.g., object intrinsic color) and a 'residual image' (illumination / reflective light) in an intrinsic decoder. Through this process, the present invention concentrates high-frequency components, including highlights or reflected light, onto one axis (residual), while maintaining the diffuse as a relatively stable and smooth channel.

[0168] The computer device inputs the log-transformed feature map into an intrinsic decoder to acquire diffuse and residual images corresponding to the reference image, and acquires diffuse and residual images corresponding to the source image. The diffuse image reflects the object's intrinsic color regardless of optical variations (illumination and reflection), while the residual image captures environmental factors such as illumination and reflection.

[0169] Since highlights and strong reflections are separated while softened by logarithmic transformation, the residual image accommodates high-frequency components more stably. As a result, unnecessary and excessive brightness variations (specular and reflected light) are eliminated in the diffuse channel, and a uniform color distribution on the object surface is maintained.

[0170] In the self-supervised depth estimation of the present invention, areas with significant optical deviations do not interfere with the learning process, and errors caused by lighting differences or reflected light are reduced when calculating pixel reprojection loss. Ultimately, through the log space transformation step, intrinsic decomposition and depth estimation operate more stably even in scenes with large reflection and lighting deviations.

[0171] The following describes dynamic object exclusion through auto-masking.

[0172] When a computer device learns monocular depth estimation in a self-supervised manner, it reprojects the reference and source images to update model parameters in a way that minimizes optical differences between pixels. However, if there are dynamically moving objects within the images, they actually appear in different positions and forms in the two images and move differently from the fixed background (static environment). Consequently, during the reprojection process, dynamic objects cause significant errors due to misalignment between the reference and source images, thereby distorting the learning process.

[0173] Computer devices determine areas where pixel differences between reference and source images are excessive or where inconsistencies that cannot be explained by viewpoint shifts occur as areas subject to auto-masking. Typical examples include cases where the positions of dynamic vehicles, people, or animals change significantly between frames.

[0174] In addition, areas with abnormal lighting or excessively strong reflections can also be targets for auto-masking. Since these areas interfere with the model learning the correct depth-pose relationship, the corresponding pixels are ignored (masked) when calculating the reprojection loss. Masking techniques can automatically configure the process by checking pixel differences above a certain threshold or by identifying points of inconsistency across multiple frames.

[0175] Through automasking, the computer device incorporates only the reprojection errors of static scenes (background walls, floors, fixed objects, etc.) into the learning process. This reduces the influence of dynamic objects, which cause significant errors, by considering only the parts explainable by viewpoint shifts.

[0176] Regions where dynamic objects are excluded reflect the camera's ego-motion more purely. Therefore, the risk of the model falling into incorrect pose and depth inference early in training is reduced, and it effectively learns geometric consistency in a fixed environment.

[0177] In a simple way, areas where the pixel reprojection difference is greater than a certain standard (e.g., mean ± deviation, a specific threshold) are masked. The computer device can automatically update this at every step.

[0178] In a more sophisticated way, there is a method to identify regions containing dynamic objects by calculating the optical flow between consecutive frames. Then, the corresponding regions are targeted for auto-masking and excluded from reprojection loss.

[0179] Since excessive masking in the initial stages can result in fewer pixels to learn, variations are also possible in which the computer device gradually applies stricter masking criteria as it converges stably.

[0180] The following describes multi-scale-based depth encoders.

[0181] The computer device extracts feature maps by downsampling the reference image and the source image in multiple stages through a depth encoder. For example, a structure (U-Net, Feature Pyramid Network, etc.) can be used to obtain features at each stage by reducing the resolution to 1 / 2, 1 / 4, 1 / 8, etc., as well as the original scale (1×), and then hierarchically accumulating them.

[0182] Low-resolution features (1 / 8, 1 / 16 scale) capture overall geometric and layout information of the entire scene over a wide coverage area, while high-resolution features (1 / 2, 1× scale) do not miss minor details (object boundaries, small object shapes, etc.). By utilizing these multiple scale features together, the accuracy of depth and intrinsic resolution is increased.

[0183] The computer device is configured so that the depth decoder and the intrinsic decoder each receive and utilize the feature maps calculated for each scale. For example, the depth decoder calculates the final depth map by progressively upsampling to scales such as 1 / 16, 1 / 8, 1 / 4, and 1 / 2. The intrinsic decoder can also progressively separate optical deviations by generating diffuse and residual images at various resolutions such as 1 / 8, 1 / 4, and 1 / 2.

[0184] High-resolution information is used to accurately infer the depth and reflection components of fine areas, while low-resolution information reflects the overall scene structure (rough object placement, background-foreground distinction). This yields stable and sophisticated depth and intrinsic resolution results that cover a wide range from large objects and backgrounds to small details.

[0185] For example, the computer device connects the intermediate scale output of the encoder (e.g., 1 / 4 scale features) to the decoder stage to restore details at intermediate resolution (similar to the U-Net structure). Since the depth decoder as well as the intrinsic decoder share multiple scales, illumination and reflection separation are also performed from a multi-scale perspective.

[0186] The computer device can also calculate reprojection loss stepwise from low-resolution to high-resolution using depth maps (or diffuse-residuals) predicted for each scale. Through this, it stably proceeds with learning by matching large structures initially (low resolution) and fine details later (high resolution).

[0187] Even during the inference phase, depth maps, diffuse, and residual results at various resolutions can be obtained and applied to subsequent processing (e.g., correction of sparse pixel regions, rapid prediction in downsampled environments, etc.). Furthermore, it can be utilized to quickly verify results from a lightly downsampled version and, if necessary, obtain high-resolution results to perform final, precise predictions.

[0188] The following is an exemplary implementation example.

[0189] Pyramid Encoder: A computer device receives reference and source images as input, downsamples them in steps (1 / 2, 1 / 4, 1 / 8, etc.), and generates feature maps by performing convolution, activation, and pooling at each step. The feature maps at each step can be shared by the depth decoder and the intrinsic decoder, or received independently, and are flexibly designed according to the model configuration.

[0190] Branch Decoder: Multiscale feature maps are distributed in parallel, with one entering the depth decoder and the other the intrinsic decoder. As a result, the depth and intrinsic decompositions are progressively upsampled at different resolutions to produce the final map.

[0191] Below, the calculation of reprojection loss through multi-scale learning utilizing the aforementioned multi-scale is explained.

[0192] The computer device utilizes feature maps generated by the depth encoder in multiple stages (multiple resolutions) to produce outputs of various scales from the depth decoder and the intrinsic decoder, respectively. At this time, the computer device can separately calculate pixel reprojection loss based on the results predicted at each resolution. Through this process, multi-scale learning is performed.

[0193] The computer device reprojects the depth map and intrinsic image at reduced resolution and compares them to a reduced version of the source image. At the low resolution level, it is advantageous for matching large structures (overall geometric information of the scene). The pixel reprojection loss (low resolution loss) calculated during this process quickly corrects errors in macroscopic scene composition.

[0194] When reprojecting through the results of a higher level (e.g., 1 / 2, 1× resolution), it is possible to verify whether object boundaries and details of small objects match accurately. The computer device updates the model to match even fine areas through pixel reprojection loss (high-resolution loss) of this high-resolution level.

[0195] The present invention may additionally apply multi-scale pixel reprojection loss to a self-supervised structure using reference-source image reprojection. That is, a loss function is defined that sums (or averages) errors for each stage scale, and the parameters of the depth encoder, depth decoder, intrinsic decoder, and pose network are updated through backpropagation.

[0196] A sequential approach that starts with low resolution to catch large errors first, followed by reinforcement of high-resolution details, contributes to both learning stability and accuracy. If necessary, the computer device can also use a strategy of giving more weight to low-resolution loss in the early epochs and increasing the weight of high-resolution loss in later epochs.

[0197] The following describes self-supervised learning through Mahalanobis distance-based filtering.

[0198] Computer devices use a self-supervised method to calculate pixel differences (optical errors) by reprojecting the reference image and the source image. However, if lighting or reflections are excessively strong, or if dynamic objects or anomalous reflectors are present, large errors occur in regions that are difficult to explain using standard reprojection equations. Such sets of pixels that deviate statistically to extremes (outliers) can distort the entire learning process.

[0199] Computer devices can represent optical differences between reference and source images, as well as between reference diffuse-residual images and source diffuse-residual images, as multidimensional vectors. For example, RGB (or YUV) channel differences or diffuse-residual channel differences are grouped into a single vector, and then the Mahalanobis distance is used to measure how far they are from the corresponding distribution.

[0200] Unlike simple Euclidean distance, the Mahalanobis distance identifies outliers by considering the variance and covariance of the data. In other words, it determines how far an optical error vector deviates from the mean and whether its direction is unusual when compared to the correlation between other channels. The computer device considers pixels with a Mahalanobis distance above a specific threshold as outliers and includes them in the filtering process. Regions identified in this way are excluded from reprojection loss calculations or diffuse and residual consistency learning, or their weights are reduced.

[0201] The reference diffuse image and source diffuse image represent the object's intrinsic color (albedo). If the two diffuse images differ significantly to a statistically inexplicable degree, it is highly likely that the light is excessively reflected or there are input outliers.

[0202] If the difference between the reference residual image and the source residual image is too large based on the Mahalanobis distance, it may be an extreme highlight or an area of ​​lighting reflection. Computer devices view this as anomalous optical variation and filter it out to prevent it from causing learning errors.

[0203] In actual implementation, the Mahalanobis distance can be calculated by constructing a multi-channel vector that includes the original reference / source image and the diffuse / residual images. This allows for more precise identification of outlier regions from various perspectives (original, diffuse, and residual).

[0204] The computer device handles outlier pixels identified by the Mahalanobis distance by excluding them or assigning lower weights during the reprojection loss calculation. This process enables the model to perform stable self-supervised learning based solely on optical errors in the normal region.

[0205] Since the data distribution is not stable during the early stages of training, the mean and covariance estimated from the Mahalanobis distance calculation may be inaccurate. As training progresses and parameters are updated, the computer periodically updates the statistics (mean and covariance) to improve filtering accuracy.

[0206] Excessive reflected light may occur in certain areas due to mirror reflectors, metal surfaces, glass windows, etc. The present invention removes and mitigates these extreme reflection regions using Mahalanobis distance-based filtering, thereby ensuring that overall model performance is not compromised.

[0207] This makes learning robust by statistically excluding abnormal pixels, even in scenes containing dynamically moving objects, strong light reflections, or color distortion.

[0208] By considering diffuse and residual images together, it prevents problems where lighting resolution is distorted or depth estimation is inaccurate due to outliers.

[0209] The Mahalanobis distance method reflects the correlation between each channel, so it offers higher flexibility and accuracy than simple filters based on a single threshold.

[0210] The following describes in detail self-supervised learning through reflectance and shadow consistency loss.

[0211] As mentioned above, the computer device obtains reference diffuse and source diffuse images through an intrinsic decoder. These diffuse images reflect the intrinsic reflectance (albedo) of the object's surface. For the same object surface, the albedo should not change significantly even if the viewpoint changes.

[0212] Computer devices define a Reflectance Consistency Loss that ensures color and reflectance match between a reference diffuse image and a source diffuse image on the same surface. By training to minimize this loss, the diffuse image stably represents object colors regardless of lighting and shadow effects.

[0213] The present invention maintains the intrinsic color of an object's surface consistently through diffuse images, even under various angles and lighting conditions. This reduces the problem of the object's intrinsic color fluctuating due to lighting deviations or variations in reflected light, making the depth estimation model less sensitive to lighting variations.

[0214] Residual images represent dynamically changing optical elements on an object's surface, such as lighting, shadows, and reflections. Depending on viewpoint changes, the residual image can vary due to shifts in surface angles and light source positions. Computer devices define Shading Consistency Loss between the reference residual image and the source residual image to ensure that lighting and shading change naturally in response to viewpoint shifts. For example, the model learns how the shadow of the same object shifts to a certain extent depending on the viewpoint, or how reflections move to reasonable positions. If excessive reflections or abnormal shading changes occur, the Shading Consistency Loss increases, which adjusts parameters of the intrinsic decoder or depth and pose networks through backpropagation. As a result, it induces that only physically consistent shading and reflection patterns remain.

[0215] The computer constructs the overall loss function by adding reflectance consistency and shadow consistency losses as additional terms, in addition to the existing reprojection loss (pixel difference). Reflectance consistency loss acts intensively on the diffuse channel, while shadow consistency loss acts intensively on the residual channel, enabling the correct separation and learning of object intrinsic color and lighting variations, respectively. A higher reflectance and shadow consistency loss indicates that the intrinsic decoder or depth / pose network is performing viewpoint shifts or optical resolution inaccurately. By iteratively updating parameters to minimize this, the computer enables stable monocular depth estimation even in environments with significant changes in lighting and reflection. However, overemphasizing only reflectance and shadow consistency can disrupt the balance with other losses (such as reprojection error). Therefore, the computer appropriately adjusts the weights (e.g., λ) for reflectance and shadow consistency losses to achieve harmony among the various loss terms during the convergence process.

[0216] In this disclosure, by managing the consistency of reflectance (diffuse) and shading (residual) as separate loss terms, illumination and shadow variations are induced to remain within a physically reasonable range. This can significantly reduce model errors, particularly on metal surfaces or in complex optical environments.

[0217] The following describes in detail the learning process using Geometric Shading Loss.

[0218] The computer device can indirectly calculate surface normals based on depth maps estimated for the reference and source images, respectively. For example, it derives the surface gradient at each point through the depth difference between adjacent pixels and obtains the normal vector by normalizing it.

[0219] Computer devices represent surface normals in either a camera coordinate system or a world coordinate system. While normal vectors may change depending on the viewpoint, the relative direction of the normal must maintain consistent characteristics for the same object surface.

[0220] In this invention, even without using explicit light source information (e.g., sun direction, internal lighting direction), the direction from which the light source is shining can be indirectly inferred by learning residual images. The model gradually learns the correlation between normals and residuals without additional sensors or labels.

[0221] Generally, shading depends on the angle between the light source vector and the surface normal. By comparing how a reference residual image displays a surface with a specific normal distribution with how the source residual image changes when the viewpoint changes, one must predict optically reasonable shading changes.

[0222] The computer device evaluates whether pixels within the reference and source residual images correspond to normal vectors, or whether the two residuals change naturally on the same surface. Geometric Shading Loss induces the model to correct for areas where light intensity appears irregular due to surface gradient by assigning a large error.

[0223] If the surface remains the same regardless of the viewpoint, the residual image must change appropriately in correspondence with the light source-normal angle. Geometric shading loss quantifies this point, suppressing shading changes that are too abrupt or physically incorrect.

[0224] The computer device calculates the normal corresponding to each pixel in the reference and source images and associates the residual image with the corresponding location. If the intensity or distribution of shading differs excessively from the theoretical value estimated through dot operations between the normal vector and the light source direction, the loss is amplified.

[0225] Verify whether the reference residual and source residual exhibit geometrically consistent shading changes on the same object surface, even at different time points. Design the loss term based on normal information so that the amount of shading change falls within a reasonable range.

[0226] When shading loss increases, components such as the intrinsic decoder (residual branching), depth decoder (influencing normal estimation), and pose network (viewpoint transformation) are adjusted through training to reduce the corresponding error. This enables the model to learn physically valid lighting and shading phenomena.

[0227] The present invention can autonomously learn shading changes using normal estimation + residual images without providing the position or intensity of an external light source. Depending on the dataset, it can encompass various optical situations (indoor lighting, sunlight, reflective materials, etc.).

[0228] The present invention provides feedback of shadow loss to the entire monocular depth estimation model (depth encoder / decoder, intrinsic decoder, and pose network). As a result, the model explains lighting and reflection variations more precisely, and high accuracy can be achieved even through self-supervised learning. Even in challenging scenes such as mirror surfaces, metal surfaces, and interior structures with complex shading, normal-based shadow loss corrects extreme errors and ensures natural optical resolution.

[0229] Mathematical formulas related to the implementation of the present invention will be explained below.

[0230] Mathematical Equation 1 is a formula for expressing the original image as a multiplication relationship between the diffuse component (the intrinsic component of the object) and the residual component (optical component).

[0231]

[0232] Here, I is the original image, A is the diffuse component opposing the intrinsic reflectance (albedo) of the object surface, and S represents the optical component including lighting and shadow. Symbol This represents the element-wise multiplication of pixels.

[0233] Equation 1 is a physical assumption that the observation information of a pixel is combined in a multiplicative relationship with the object's intrinsic color (diffuse) and optical components affected by lighting, shadows, and reflected light. By using this multiplicative structure, changes in external lighting or differences in reflected light are concentrated in S, thereby reducing the problem of the object's intrinsic color (A) changing unnecessarily.

[0234] Mathematical equation 2 below shows how the original image (III) is divided into diffuse (L) and residual (R) in log space.

[0235]

[0236] Here, I represents the original (observed) image containing the reference or source image, L represents the diffuse component corresponding to the object's intrinsic reflectance (albedo), and R represents the residual component that varies due to optical changes (illumination, reflected light, shadows, etc.).

[0237] At this time, by converting the multiplication relationship (Equation 1) of the original image into an addition form (Equation 2) in logarithmic space, high-frequency components such as strong reflections or highlights can be relatively reduced and separated.

[0238] Computer devices represent the portion corresponding to the object's surface's intrinsic color (albedo) as log(L). Abrupt brightness fluctuations caused by changes in lighting, such as highlights, reflections, and shadows, are separated into the log(R) channel.

[0239] The present invention enables self-learning to satisfy the relationship log(I)≈log(L)+log(R). Through this, the model can be trained to restore a value close to the original image in the form exp(log(L)+log(R))=L⊙R.

[0240] The computer device can log-transform the original image (I) pixel by pixel, and perform minimal normalization (e.g., adding ϵ) so that there are no values ​​less than or equal to 0. The transformed log(I) is input into an intrinsic decoder, etc., to separate and predict log(L) and log(R).

[0241] By separating diffuse (log(L)) and residual (log(R)), pixel intensity variations caused by illumination and reflections do not become direct noise but can be absorbed into the residual channel. As a result, the intrinsic color of objects is stably maintained, and variations in light and reflection do not have an excessive impact on self-supervised depth estimation or reprojection loss calculations.

[0242]

[0243] Equation 3 below represents the diffuse reconstruction loss to measure how much the reference diffuse image and reference residual image differ from the comparison target image obtained through the reprojection or alignment process.

[0244]

[0245] Here, is a reference image, is the reference diffuse image, is a comparison target image obtained by reprojecting or aligning a reference residual image, means L1 gambling.

[0246] The computer device can compare the result of log-transforming the reference image with the result of compositing the diffuse and residual images output by the intrinsic decoder. In this process, by calculating the difference between the three terms using the L1 norm, it measures how closely the reference diffuse and residual images match the actually observed reference image.

[0247] If training is performed to reduce this loss, the reference image ( ) diffuse component ( ) and residual ingredients( The results of decomposition become increasingly accurate. Consequently, the model is guided so that optical deviations such as illumination, highlights, and reflections are absorbed into the residuals, while the object's intrinsic color (diffuse) is stably maintained. The present invention updates model parameters by integrating diffuse reconstruction loss along with other losses (e.g., pixel reprojection loss). Through this, the discrepancy between the original image and the diffuse / residual decomposition results is reduced even in scenes with heavy illumination and reflections, thereby enabling optical decomposition that is advantageous for monocular depth estimation.

[0248]

[0249] Alignment of the reference image coordinate system and the source image coordinate system for calculating pixel reprojection loss can be performed through Equation 4 below.

[0250]

[0251] Here, is the reprojected coordinate, is the relative pose from the reference image to the source image, is the depth map for pixel (u,v) of the reference image, K is the camera intrinsic parameter matrix, is a function that projects from homogeneous coordinates to 2D.

[0252] The computer device estimates the 3D point corresponding to each pixel (u,v) in the reference image (coordinate system), converts it to the source image coordinate system, and We obtain new coordinates called. This process The function represents projecting homogeneous coordinates (position in 3D space) onto pixel coordinates on a 2D plane. is reference→source conversion, represents the depth of pixel (u,v) in the reference image. Combining the two pieces of information reveals where the corresponding pixel (u,v) is located in 3D space and at what position in the source coordinate system. You can see if it is mapped to.

[0253] K is the camera intrinsic parameter matrix that includes focal length, principal point position, etc. It converts the reference pixel coordinates (u,v,1) into normalized coordinates and finally multiplies them by K to project the source coordinate system into 2D pixels.

[0254] The computer device depth for the reference pixel (u,v) Using this, estimate where the pixel is located in 3D space. Obtain a normalized ray through this, and multiply by the depth to obtain the 3D position in the reference coordinate system.

[0255] It contains relative pose information from the reference to the source, converting a 3D point into the source camera position and angle. Multiplying the converted result by K again yields a certain pixel on the source image plane. It is possible to determine whether it corresponds to.

[0256] The proj(∙) function projects homogeneous coordinates (e.g., (x,y,z)) onto 2D pixel coordinates. These coordinates Using this, the computer device moves the corresponding pixel data of the reference image to the source coordinate system and performs a comparison (pixel reprojection loss, etc.).

[0257] The computer device can improve the accuracy and stability of monocular depth estimation by using this to perform pixel-level reprojection (warping) and learning model parameters in a direction that minimizes pixel reprojection loss even without labels.

[0258] Equation 5 below defines the cross-reconstruction loss by integrating the diffuse reconstruction loss and the residual reconstruction loss. The diffuse reconstruction loss and the residual reconstruction loss are integrated into the cross-reconstruction loss and can be utilized for training a monocular camera depth estimation model.

[0259]

[0260] Here, is a reference image, is a reference residual image, is an image obtained by warping the source diffuse image into the reference coordinate system.

[0261] The computer device is a source diffuse image ( Project (warp) ) onto the reference coordinate system Obtain, and reference residedual( Summed together with ) in log space to the original reference image( It reconstructs ). By calculating the difference between this reconstructed result and the actual reference image using the L1 norm, the accuracy of diffuse-residual decomposition is measured from the source-to-reference perspective. In Equation 5, the reference image) is two elements and Attempts are made to restore based on [this]. As a result, if the diffuse and residual channels are not properly separated and aligned, a large difference appears, and the loss increases. The present invention uses this for self-supervised learning to induce diffuse and residual decomposition to become more precise.

[0262] Source diffuse image( The result of converting ) to the reference coordinate system( ) and reference residual( Because it combines and compares with the reference image, reconstruction errors arising during the viewpoint transformation (warping) process are naturally reflected. This drives training to maintain viewpoint transformation consistency for both the object's intrinsic color (diffuse) and optical residuals.

[0263] computer devices The entire model (depth encoder / decoder, intrinsic decoder, pose network, etc.) is updated to minimize . At this time, and As the difference between them decreases, time-space transformation and intrinsic decomposition can become more sophisticated together.

[0264]

[0265] A contrast loss to suppress the variance between the diffuse image and the residual image produced by the intrinsic decoder is defined by Equation 6 and can be used to train the monocular camera depth estimation model.

[0266]

[0267] Here, is margin, and In this, i and j identify images within different batches, and b is the batch size, means L2 gambling.

[0268] The computer device produces diffuse images estimated at different time points (source→reference, reference→reference) through an intrinsic decoder. If the diffuse channel results pointing to the same object surface are unnecessarily far apart (if the pixel difference is large), it can be considered that optical resolution is unstable. Equation 6 separates these diffuse results within a certain range (margin It is a contrast loss designed to be maintained within ).

[0269] Mathematical Equation 6 is the result of the two diffuses Wow Distance between ( )go Induce it to be less than or equal to. The distance If it exceeds a certain size, the loss ∥...∥ is reflected as a positive value, facilitating learning so that the model reduces the difference.

[0270] By iteratively calculating for various pairs of diffuse images (different i, j) within the same batch, the dispersion of all diffuse images is suppressed. As a result, the present invention reduces cases where the representation of object reflectance (diffuse) varies excessively within a batch.

[0271] The computer device generates a reference diffuse image and a source diffuse image in an intrinsic decoder, and the source diffuse image is warped to reference coordinates. It can be aligned as such. In this case, Equation 6 can be calculated by measuring the distance between diffuses from two different points or different samples (surfaces of identical or similar objects).

[0272] A larger value may indicate that the diffuse channels are unnecessarily dispersed. The model is trained to minimize this, so the distance between diffuses Improve so that it is maintained below.

[0273] When this contrast loss is reduced, diffuse images representing the same object surface have similar color and brightness values. Consequently, in intrinsic decomposition, the phenomenon of noise caused by illumination deviations or reflected light seeping into the diffuse is reduced, and optical deviations are primarily absorbed by the residual (illumination channel).

[0274]

[0275] In training a monocular camera depth estimation model, the intrinsic decoder can be trained by a loss function defined by Equation 7.

[0276]

[0277] Here, is the diffuse reconstruction loss according to Equation 3, is the cross-reconstruction loss according to mathematical formula 5, is the contrast loss according to mathematical formula 6. , , and are the weights of each term.

[0278] The computer device is defined by mathematical formula 7 The intrinsic decoder (diffuse and residual branching) parameters are updated to minimize [this]. By doing so, optical deviations (illumination and reflection) are concentrated into residuals, and the object's intrinsic color (diffuse) is stably separated, thereby obtaining highly consistent decomposition results even in environments with large lighting variations.

[0279] In the actual model, the pixel reprojection loss (depth and pose related) and the intrinsic decomposition loss defined in Equation are combined to form the overall learning objective. As a result, monocular camera depth estimation and intrinsic decomposition are optimized complementarily.

[0280] , , Weights such as these may gradually change during the learning process.

[0281]

[0282] In addition, the pixel reprojection loss used in training the monocular camera depth estimation model can be defined by Equation 8.

[0283]

[0284] Here, is a reference image or reference diffuse image, and is an image obtained by projecting the source image (or source diffuse image) onto the coordinate system of the reference image, M is a mask for excluding dynamically moving objects or abnormal regions, SSIM is Structural Similarity, and is the weight of the SSIM term and the L1 term.

[0285] computer devices and Optical differences between pixels are evaluated simultaneously not only from the perspective of simple pixel differences (e.g., L1) but also from the perspective of structural similarity (SSIM). By reflecting human visual characteristics to some extent and measuring how similar the structures are, this evaluates image quality beyond simple brightness differences.

[0286] In mathematical equation 8, M⊙[…] causes the loss to be ignored (or mitigated) for regions identified as dynamically moving or optically outliers (abnormally large errors). This allows viewpoint movement to be interpreted around a fixed background, ensuring that incorrect errors do not interfere with model training.

[0287] In mathematical formula 8 can be the original (I) or the diffuse (L). The computer device can compare the original vs. projected image and the diffuse vs. projected diffuse, respectively, based on Equation 8.

[0288]

[0289] The computer device can identify non-Lambertian regions in an image based on Equation 9 and exclude them from learning.

[0290] The computer device [determines] the photometric error between the original image (I) and the pseudo-diffuse (L′) image ( , We examine how different ) is. In this case, based on the pseudo-diffuse image in which illumination and reflection are separated, the Lambertian reflection assumption is relatively valid in areas with small errors, whereas areas with large error differences are considered to be places where non-Lambertian reflection is strongly at play.

[0291] represents the photometric error in the reference image and source image pair, and represents the photometric error in the reference diffuse image and source diffuse image pair (or pseudo-diffuse). If it is a Lambertian surface, since the pseudo-diffuse(L′) has the illumination deviation removed, and is similar or can be low. As non-Rambertian reflection becomes stronger, specular highlights are larger in the original image (I) and pseudo-diffuse (L′) attempts to remove them, so the two error values ​​can differ significantly. The present invention Wow Regions with large differences are identified as non-Lambertian regions and excluded (or masked) from training. This prevents extreme parts of reflected light from distorting model parameters.

[0292]

[0293] Here, P represents the photometric reprojection loss of Equation 8, and is the photometric error in the reference image and source image pair, represents the photometric error in the reference diffuse image and source diffuse image pair, and means a pseudo-diffuse image, and represents the log value of the residual component and and represents an exponential function, and, here and Areas with large differences can be considered as non-Rambertian areas and excluded from training.

[0294]

[0295] The computer device performs learning in a self-supervised manner and can filter out regions where the optical error between the reference image and the source image, the reference diffuse image and the source diffuse image, and the reference residual image and the source residual image is greater than a threshold level through Mahalanobis distance-based filtering. Here, Mahalanobis distance-based filtering can be performed based on Equation 10.

[0296]

[0297] Here, and is the photometric error defined by mathematical formula 9, and is the mean vector of the photometric error, and If α is 1, it is considered a normal region, and if α is 0, it can be filtered as a region where optical error is above a critical level.

[0298] The computer device original image error ( ) and pseudo-diffuse based errors( For ), the Mahalanobis distance using mean and covariance is calculated for each. At each pixel and By measuring the value of , it is possible to determine how statistically close the photometric error is to an outlier.

[0299] The present invention When the condition , that is, the corresponding pixel can be filtered by considering it as an abnormally large optical error. Conversely, if the condition is not satisfied Set to consider it a normal area. The mask determined in this way. It is subsequently used in reprojection loss calculations, etc., to ignore or exclude errors above a critical level.

[0300] Through mathematical formula 10, model distortion is reduced by automatically identifying areas where non-Rambertian reflections or strong reflected light occur (areas where the error difference was statistically large) and excluding them from training.

[0301] Is It shows how statistically far the error at pixel(u,v) deviates from the data mean and covariance. A large value indicates a high probability that it is an outlier significantly deviating from the mean distribution.

[0302] is a pseudo-diffuse based error( It is the same determination for ). By comparing the two values, it determines which is more abnormal between source-based error and pseudo-diffuse-based error.

[0303] is in the form of a 2D binary map, where 1 indicates normal (used for learning) and 0 indicates abnormal (excluded from learning). Through this, the present invention excludes extreme optical errors occurring in dynamic objects, mirror reflections, metallic surfaces, etc., during supervised learning.

[0304] A computer device at (u,v) pixels In this case, by ignoring the corresponding pixel in the reprojection loss calculation, outliers are prevented from significantly affecting the model parameter update. As a result, depth and intrinsic decomposition are learned centered on a fixed background or object surface, thereby promoting stable convergence.

[0305] In the early stages of learning , Since statistics such as the back may not be stable, it is also possible to gradually update the statistics according to the epoch.

[0306]

[0307] The monocular camera depth estimation model of the present invention can be trained using a depth estimation loss defined by Equation 11.

[0308] Here, is a Mahalanobis-based mask defined by mathematical formula 10, is the photometric reprojection loss defined by mathematical formula 8.

[0309] Equation 11, which combines the two elements of the above Mahalovisk mask and photometric reprojection loss, enables the model to learn depth and pose centered on background and static objects while ignoring exclusion regions in the image. The pixel is As the contribution becomes zero, extreme regions such as dynamic objects or non-Rambertian reflections do not cause learning errors. As a result, the model can converge stably without being swayed by large errors arising from dynamic objects or mirror reflections.

[0310]

[0311] The monocular camera depth estimation model of the present invention can be trained using the final loss function defined by Equation 12.

[0312]

[0313] Here, is the depth estimation loss defined by mathematical equation 11, is the intrinsic decoder loss defined by mathematical formula 7.

[0314] The present invention combines an intrinsic decoder (diffuse-residual decomposition) with a monocular camera-based depth estimation model. is an intrinsic loss that separates illumination and reflected light, is the depth loss that includes reprojection loss due to viewpoint shifting, etc. It is the sum of the two losses. By minimizing , stable depth estimation and optical resolution are achieved simultaneously even in scenes with significant lighting variations.

[0315] As needed, A loss function with assigned weights can also be utilized.

[0316] This invention enables training on large-scale image data much more cost-effectively than existing supervised learning methods that require a large amount of labels or methods relying on stereo or LiDAR. Furthermore, it ensures stable monocular depth estimation performance even in complex indoor and outdoor environments where lighting changes, reflections, and dynamic objects are frequent.

[0317] Ultimately, the present invention provides a deep learning-based self-supervised learning technique that can obtain high-quality 3D information even in real-world scenes with heavy optical noise by estimating depth information and optical resolution together using only a monocular camera.

[0318]

[0319] FIG. 4 illustrates a brief and general schematic diagram of an exemplary computing environment in which embodiments of the present disclosure may be implemented.

[0320] Although the present disclosure has been described as generally being implementable by a computer device, those skilled in the art will understand that the present disclosure may be implemented in combination with computer-executable instructions and / or other program modules that can be executed on one or more computers and / or as a combination of hardware and software.

[0321] Generally, a program module includes routines, programs, components, data structures, etc., that perform a specific task or implement a specific abstract data type. Furthermore, those skilled in the art will be well aware that the method of the present disclosure may be implemented in other computer system configurations, including single-processor or multi-processor computer systems, minicomputers, mainframe computers, as well as personal computers, handheld computer devices, microprocessor-based or programmable consumer electronics, etc. (each of which may be connected to and operated with one or more associated devices).

[0322] The embodiments described in this disclosure may also be implemented in a distributed computing environment in which tasks are performed by remote processing devices connected via a communication network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

[0323] Computers typically include various computer-readable media. Any medium accessible by a computer may be a computer-readable medium, and such computer-readable media include volatile and non-volatile media, transitory and non-transitory media, and removable and non-removable media. By example, but not limiting, computer-readable media may include computer-readable storage media and computer-readable transmission media. Computer-readable storage media include volatile and non-volatile media, transitory and non-transitory media, and removable and non-removable media implemented by any method or technique for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, DVD (digital video disk) or other optical disk storage devices, magnetic cassettes, magnetic tapes, magnetic disk storage devices or other magnetic storage devices, or any other media that can be accessed by a computer and used to store desired information.

[0324] Computer-readable transmission media typically include all information transmission media that implement computer-readable instructions, data structures, program modules, or other data, etc., on a modulated data signal, such as a carrier wave or other transport mechanism. The term modulated data signal means a signal in which one or more of the characteristics of the signal are set or modified to encode information within the signal. By example, not limiting, computer-readable transmission media include wired media, such as wired networks or direct-wired connections, and wireless media, such as acoustic, RF, infrared, and other wireless media. Any combination of the media described above is also considered to be within the scope of computer-readable transmission media.

[0325] An exemplary environment (1100) for implementing various aspects of the present disclosure, including a computer (1102), is shown, wherein the computer (1102) includes a processing unit (1104), system memory (1106), and a system bus (1108). The system bus (1108) connects system components, including system memory (1106) (but not limited thereto), to the processing unit (1104). The processing unit (1104) may be any processor among various commercial processors. Dual processors and other multiprocessor architectures may also be used as the processing unit (1104).

[0326] The system bus (1108) may be any of several types of bus structures that can be additionally interconnected to a local bus using any of the memory bus, peripheral bus, and various commercial bus architectures. System memory (1106) includes read-only memory (ROM) (1110) and random access memory (RAM) (1112). The basic input / output system (BIOS) is stored in non-volatile memory (1110), such as ROM, EPROM, EEPROM, etc., and this BIOS includes basic routines that help transfer information between components within the computer (1102) at times such as during startup. The RAM (1112) may also include high-speed RAM, such as static RAM, for caching data.

[0327] The computer (1102) also includes an internal hard disk drive (HDD) (1114) (e.g., EIDE, SATA)—this internal hard disk drive (1114) may also be configured for external use within a suitable chassis (not shown)—a magnetic floppy disk drive (FDD) (1116) (e.g., for reading from or writing to a removable diskette (1118)), and an optical disk drive (1120) (e.g., for reading from a CD-ROM disk (1122) or reading from or writing to other high-capacity optical media such as a DVD). The hard disk drive (1114), the magnetic disk drive (1116), and the optical disk drive (1120) may each be connected to the system bus (1108) by a hard disk drive interface (1124), a magnetic disk drive interface (1126), and an optical drive interface (1128). The interface (1124) for implementing an external drive includes at least one or both of USB (Universal Serial Bus) and IEEE 1394 interface technologies.

[0328] These drives and associated computer-readable media provide non-volatile storage of data, data structures, computer-executable instructions, etc. In the case of a computer (1102), the drives and media correspond to storing any data in a suitable digital format. Although the description of computer-readable media above refers to HDDs, removable magnetic disks, and removable optical media such as CDs or DVDs, those skilled in the art will know that other types of computer-readable media, such as zip drives, magnetic cassettes, flash memory cards, cartridges, etc., may also be used in exemplary operating environments and that any of these media may contain computer-executable instructions for performing the methods of the present disclosure.

[0329] A number of program modules, including an operating system (1130), one or more application programs (1132), other program modules (1134), and program data (1136), may be stored in the drive and RAM (1112). All or part of the operating system, application, module and / or data may also be cached in RAM (1112). It will be well known that the present disclosure may be implemented in various commercially available operating systems or combinations of operating systems.

[0330] The user can input commands and information into the computer (1102) through one or more wired / wireless input devices, such as a pointing device like a keyboard (1138) and a mouse (1140). Other input devices (not shown) may include a microphone, an IR remote control, a joystick, a game pad, a stylus pen, a touch screen, etc. These and other input devices are often connected to the processing unit (1104) via an input device interface (1142) connected to the system bus (1108), but may also be connected via other interfaces such as a parallel port, an IEEE 1394 serial port, a game port, a USB port, an IR interface, etc.

[0331] A monitor (1144) or other type of display device is also connected to the system bus (1108) via an interface such as a video adapter (1146). In addition to the monitor (1144), the computer generally includes other peripheral output devices (not shown), such as speakers, a printer, and so on.

[0332] The computer (1102) may operate in a networked environment using a logical connection to one or more remote computers, such as remote computer(s) (1148), via wired and / or wireless communication. The remote computer(s) (1148) may be a workstation, a computing device computer, a router, a personal computer, a portable computer, a microprocessor-based entertainment device, a peer device, or other conventional network node, and generally include many or all of the components described for the computer (1102), but for brevity, only the memory storage device (1150) is illustrated. The illustrated logical connection includes a wired / wireless connection to a local area network (LAN) (1152) and / or a larger network, e.g., a wide area network (WAN) (1154). Such LAN and WAN networking environments are common in offices and companies and facilitate enterprise-wide computer networks, such as intranets, all of which can be connected to a global computer network, e.g., the Internet.

[0333] When used in a LAN networking environment, the computer (1102) is connected to a local network (1152) via a wired and / or wireless communication network interface or adapter (1156). The adapter (1156) may facilitate wired or wireless communication to the LAN (1152), and the LAN (1152) may also include a wireless access point installed therein to communicate with the wireless adapter (1156). When used in a WAN networking environment, the computer (1102) may include a modem (1158), be connected to a communication computing device on the WAN (1154), or have other means of establishing communication through the WAN (1154), such as through the Internet. The modem (1158), which may be an internal or external and a wired or wireless device, is connected to the system bus (1108) via a serial port interface (1142). In a networked environment, the program modules described for the computer (1102) or parts thereof may be stored in a remote memory / storage device (1150). It will be well known that the illustrated network connection is exemplary and that other means of establishing a communication link between computers may be used.

[0334] The computer (1102) operates to communicate with any wireless device or object that is deployed and operated via wireless communication, for example, a printer, scanner, desktop and / or portable computer, PDA (portable data assistant), communication satellite, any equipment or place associated with a wireless detectable tag, and a telephone. This includes at least Wi-Fi and Bluetooth wireless technologies. Accordingly, the communication may be a predefined structure as in a conventional network, or simply ad hoc communication between at least two devices.

[0335] Wi-Fi (Wireless Fidelity) enables connectivity to the Internet and other sources without wires. Wi-Fi is a wireless technology, similar to a cell phone, that allows devices, such as computers, to transmit and receive data indoors and outdoors—that is, anywhere within the coverage area of ​​a base station. Wi-Fi networks use a wireless technology called IEEE 802.11 (a, b, g, etc.) to provide secure, reliable, and high-speed wireless connections. Wi-Fi can be used to connect computers to each other, to the Internet, and to wired networks (using IEEE 802.3 or Ethernet). Wi-Fi networks can operate in unlicensed 2.4 and 5 GHz wireless bands, for example, at data rates of 11 Mbps (802.11a) or 54 Mbps (802.11b), or in products that include both bands (dual band).

[0336] Those skilled in the art of the present disclosure will understand that information and signals may be represented using any various different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced in the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

[0337] Those skilled in the art will understand that the various exemplary logic blocks, modules, processors, means, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented by electronic hardware, various forms of programs or design code (referred to herein as software for convenience), or a combination of all such. To clearly illustrate this interoperability between hardware and software, various exemplary components, blocks, modules, circuits, and steps have been generally described above in relation to their functions. Whether such functions are implemented as hardware or software depends on the design constraints imposed on the specific application and the overall system. Those skilled in the art may implement the functions described in various ways for each specific application, but such implementation decisions should not be interpreted as being outside the scope of this disclosure.

[0338] The various embodiments presented herein may be implemented as methods, devices, or articles manufactured using standard programming and / or engineering techniques. The term "article manufactured" includes a computer program, a carrier, or a medium accessible from any computer-readable storage device. For example, computer-readable storage media include, but are not limited to, magnetic storage devices (e.g., hard disks, floppy disks, magnetic strips, etc.), optical discs (e.g., CDs, DVDs, etc.), smart cards, and flash memory devices (e.g., EEPROMs, cards, sticks, key drives, etc.). Additionally, the various storage media presented herein include one or more devices and / or other machine-readable media for storing information.

[0339] It should be understood that the specific order or hierarchy of steps in the presented processes is an example of exemplary approaches. It should be understood that the specific order or hierarchy of steps in the processes may be rearranged within the scope of this disclosure based on design priorities. The appended method claims provide elements of various steps in a sample order, but do not imply being limited to the specific order or hierarchy presented.

[0340] Description of the presented embodiments is provided so that a person skilled in the art may use or practice the present disclosure. Various modifications to these embodiments will be apparent to a person skilled in the art, and the general principles defined herein may be applied to other embodiments without departing from the scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments presented herein, but should be interpreted in the broadest possible scope consistent with the principles and novel features presented herein.

[0341]

[0342] As described above, the relevant details have been described in the best mode for carrying out the invention.

[0343] The present invention can be used in technical fields where a technique for estimating depth in images is utilized.

Claims

1. A computer program stored on a computer-readable storage medium, wherein the computer program performs a method for estimating monocular camera depth for an image having a reflective surface, and the method comprises: Step of acquiring a reference image and a source image; A step of estimating the relative pose between the reference image and the source image using a pose network; A step of generating a reference feature map from the reference image and generating a source feature map from the source image using a depth encoder; A step of transmitting the above reference feature map to a depth decoder to estimate a depth map for the above reference image; A step of transmitting the above reference feature map and the above source feature map to an intrinsic decoder to obtain a reference diffuse image and a reference residual image corresponding to the above reference image, and obtaining a source diffuse image and a source residual image corresponding to the above source image; A step of calculating a photometric reprojection loss between the reference image and the source image based on the estimated relative pose, the depth map, the reference diffuse image, the reference residual image, the source diffuse image, and the source residual image, and performing self-supervised learning by updating the parameters of a monocular camera depth estimation model to minimize the photometric reprojection loss; including, A computer program stored on a computer-readable storage medium.

2. In Paragraph 1, The above reference diffuse image and the above source diffuse image are, It includes a component that is maintained independently of optical changes with respect to the same surface of an object in the above reference image and the above source image, and The above reference residual image and the above source residual image are, A component comprising a component that changes due to optical change in the above reference image and the above source image, A computer program stored on a computer-readable storage medium.

3. In Paragraph 1, The pixel reprojection loss mentioned above is, Calculated by measuring the optical difference between the source image and the reprojected image generated by projecting the coordinate system of the reference image onto the coordinate system of the source image, using the relative orientation between the reference image and the source image and the depth map corresponding to the reference image. A computer program stored on a computer-readable storage medium.

4. In Paragraph 1, The step of performing learning using the above self-supervised method is, A step of additionally calculating a diffuse reconstruction loss to maintain a correspondence between the reference diffuse image and the source diffuse image, and a residual reconstruction loss to maintain a correspondence between the reference residual image and the source residual image, and updating the parameters of the monocular camera depth estimation model to minimize the diffuse reconstruction loss and the residual reconstruction loss; including, A computer program stored on a computer-readable storage medium.

5. In Paragraph 1, The step of transmitting the above reference feature map and the above source feature map to an intrinsic decoder to obtain a reference diffuse image and a reference residual image corresponding to the reference image, and obtaining a source diffuse image and a source residual image corresponding to the source image, is as follows: A step of converting each feature map into a log space to separate the diffuse image and the residual image and remove high-frequency components including highlights or reflected light before inputting the reference feature map and the source feature map into the intrinsic decoder; including, A computer program stored on a computer-readable storage medium.

6. In Paragraph 1, In the process of calculating the pixel reprojection loss mentioned above, Excluding dynamically moving objects or abnormal regions within the source image using an auto-masking technique so that reprojection errors for fixed targets are reflected in the learning, A computer program stored on a computer-readable storage medium.

7. In Paragraph 1, The depth encoder mentioned above is, Processing the reference image and the source image at multiple scales respectively, and causing the depth decoder and the intrinsic decoder to each generate a multi-scale output from the feature map calculated for each scale, A computer program stored on a computer-readable storage medium.

8. In Paragraph 7, The step of performing learning using the above self-supervised method is, A step of additionally performing multi-scale learning through pixel reprojection loss between the above multi-scale outputs; including, A computer program stored on a computer-readable storage medium.

9. In Paragraph 1, The step of performing learning using the above self-supervised method is, A step of filtering regions where the optical error between the reference image and the source image, the reference diffuse image and the source diffuse image, and the reference residual image and the source residual image is greater than or equal to a threshold level through filtering based on the Mahalanobis distance; including, A computer program stored on a computer-readable storage medium.

10. In Paragraph 1, The step of performing learning using the above self-supervised method is, Reflectance Consistency Loss that ensures the reflectance of the object surface is consistently maintained between the reference diffuse image and the source diffuse image, and Additionally, it includes a Shading Consistency Loss that causes the illumination shading or reflected light between the reference residual image and the source residual image to have a physical change amount according to the viewpoint transformation, A step of updating the parameters of the monocular camera depth estimation model so as to minimize the reflectance consistency loss and the shading consistency loss; including, A computer program stored on a computer-readable storage medium.

11. In Paragraph 1, A step of additionally calculating a geometric shading loss using normal information estimated from the reference image and the source image to make the reference residual image and the source residual image reflect the correlation between the light source direction and the surface normal; including, A computer program stored on a computer-readable storage medium.

12. In Paragraph 1, The above diffuse image and the above residual image are decomposed based on Equation 2, and Mathematical Formula 2: Here, I is the original image containing the reference image or source image, L is the diffuse component which is the intrinsic component of the object, and R represents the residual component which changes due to optical variation, A computer program stored on a computer-readable storage medium.

13. In Paragraph 4, The above diffuse reconstruction loss is defined by Equation 3, and Mathematical formula 3: Here, is a reference image, is the reference diffuse image, is a comparison target image obtained by reprojecting or aligning a reference residual image, means L1 gambling, A computer program stored on a computer-readable storage medium.

14. In Paragraph 1, To calculate the above pixel reprojection loss, The alignment of the coordinate system of the above reference image and the coordinate system of the above source image is performed based on Equation 4, and Math 4: Here, is the reprojected coordinate, is the relative pose from the reference image to the source image, is the depth map for pixel (u,v) of the reference image, K is the camera intrinsic parameter matrix, is a function that projects from homogeneous coordinates to 2D, Computer program stored on a computer-readable medium 15. In Paragraph 4, The above diffuse reconstruction loss and the above residual reconstruction loss are integrated into a cross reconstruction loss according to Equation 5, and the cross reconstruction loss is utilized in training. Mathematical formula 5: Here, is a reference image, is a reference residual image, is an image obtained by warping the source diffuse image into the reference coordinate system, A computer program stored on a computer-readable storage medium.

16. In Paragraph 1, A contrast loss for suppressing variance between diffuse images and residual images produced by the above intrinsic decoder is defined by Equation 6 and is used for training the above monocular camera depth estimation model, and Mathematical formula 6: Here, is margin, and In this, i and j identify images within different batches, and b is the batch size, means L2 gambling, A computer program stored on a computer-readable storage medium.

17. In Paragraph 1, In the training phase of the above monocular camera depth estimation model, The above intrinsic decoder is trained by a loss function defined by mathematical formula 7, and Mathematical formula 7: Here, is the diffuse reconstruction loss according to Equation 3, is the cross-reconstruction loss according to mathematical formula 5, is the contrast loss according to mathematical formula 6. , and is the weight of each term, A computer program stored on a computer-readable storage medium.

18. In Paragraph 1, The above pixel reprojection loss is defined by Equation 8, and Mathematical formula 8: Here, is a reference image or reference diffuse image, and is an image obtained by projecting the source image (or source diffuse image) onto the coordinate system of the reference image, M is a mask for excluding dynamically moving objects or abnormal regions, SSIM is Structural Similarity, and is the weight of the SSIM term and the L1 term, A computer program stored on a computer-readable storage medium.

19. In Paragraph 1, A step of identifying non-Lambertian regions in an image based on Equation 9; Includes more, Mathematical formula 9: Here, P represents the photometric reprojection loss of Equation 8, and is the photometric error in the reference image and source image pair, represents the photometric error in the reference diffuse image and source diffuse image pair, and represents the pseudo-diffuse image, log(R) represents the log value of the residual component, and represents an exponential function, and, here and Considering regions with large differences as non-Rambertian regions and excluding them from training, A computer program stored on a computer-readable storage medium.

20. In Paragraph 9, The above Mahalanobis distance-based filtering is performed based on Equation 10, and Mathematical formula 10: Here, and is the photometric error defined by mathematical formula 9, and is the mean vector of the photometric error, and If α is 1, it is considered a normal region, and if α is 0, it is filtered as a region where optical error is above a critical level, A computer program stored on a computer-readable storage medium.

21. In Paragraph 1, The step of updating the parameters of the above monocular camera depth estimation model to perform learning in a self-supervised manner is: A step of updating the parameters of the monocular camera depth estimation model using the depth estimation loss defined by mathematical formula 11; Includes, Mathematical formula 11: Here, is a Mahalanobis-based mask defined by mathematical formula 10, is the photometric reprojection loss defined by mathematical equation 8, A computer program stored on a computer-readable storage medium.

22. In Paragraph 21, The step of updating the parameters of the above monocular camera depth estimation model to perform learning in a self-supervised manner is: A step of updating the parameters of the monocular camera depth estimation model using the final loss function defined by mathematical formula 12; Includes, Mathematical formula 12: Here, is the depth estimation loss defined by mathematical equation 11, is the intrinsic decoder loss defined by mathematical formula 7, A computer program stored on a computer-readable storage medium.