A unified indoor and outdoor monocular depth estimation method based on scale decoupling, electronic devices, and storage media.

By using a scale decoupling method, monocular depth estimation is divided into two parts: relative depth estimation and scale prediction. The CLIP image encoder and Transformer feature interactor are used to perform unified indoor and outdoor depth estimation, which solves the problem of poor model adaptability in existing technologies and achieves efficient depth estimation results.

CN118447066BActive Publication Date: 2026-06-30UNIV OF SCI & TECH OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UNIV OF SCI & TECH OF CHINA
Filing Date
2024-04-30
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing monocular depth estimation methods are difficult to adapt to both indoor and outdoor scenes simultaneously, and existing technologies increase model complexity or are time-consuming and labor-intensive.

Method used

By using a scale decoupling method, absolute depth estimation is decoupled into relative depth estimation and scale prediction. Multi-scale fusion and semantic awareness are performed using a feature interactor built with CLIP image encoder and Transformer. The image-text similarity constraint is combined for training to achieve unified indoor and outdoor depth estimation.

Benefits of technology

It simplifies the model training process, improves the accuracy and generalization ability of depth estimation in scenes with large scale differences, and realizes joint depth estimation for indoor and outdoor scenes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118447066B_ABST
    Figure CN118447066B_ABST
Patent Text Reader

Abstract

This invention provides a unified indoor / outdoor monocular depth estimation method, electronic device, and storage medium based on scale decoupling. The method includes: using a feature interactor to interact with multiple depth partitioning queries and multi-scale fusion features of the target image; processing the updated depth partitioning queries using an adaptive relative depth estimation module; obtaining the center value of the depth interval based on the normalization result of the depth partitioning, and performing similarity calculation with the depth feature representation to obtain the probability of the depth interval; processing the probability of the depth interval with the center value of the depth interval to obtain the relative depth estimation result of the target image; realizing the interaction between the scale query and the multi-scale fusion features through the feature interactor; projecting the semantically aware scale query into a scale factor using a semantically aware scale prediction module, and calculating the scale factor and the relative depth estimation result to obtain the absolute depth estimation result of the target image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and artificial intelligence, and in particular to a unified indoor and outdoor monocular depth estimation method, electronic device, and storage medium based on scale decoupling. Background Technology

[0002] Monocular depth estimation is a key technology in computer vision, used to accurately estimate the vertical distance between each pixel in an image and the camera using a single image. This technology has significant applications in fields such as autonomous driving, 3D reconstruction, and augmented reality.

[0003] Currently, depth estimation methods can be mainly divided into two categories: relative depth estimation and absolute depth estimation. Relative depth estimation lacks absolute scale information, making it unsuitable for complex applications requiring precise depth information. Absolute depth estimation methods typically rely on training on specific datasets using depth labels. This training paradigm has a significant problem: the resulting models often cannot simultaneously adapt to the depth estimation needs of different scenarios, such as indoor and outdoor environments. To address the technical problems of absolute depth estimation methods, researchers have proposed numerous solutions. For example, some methods attempt to design different regression heads for different scenarios, but this undoubtedly increases the model's complexity. Other methods try to improve the model's generalization ability by utilizing large datasets, but this is not only time-consuming and labor-intensive but also increases the difficulty of the training process. Summary of the Invention

[0004] In view of the above problems, the present invention provides a unified indoor and outdoor monocular depth estimation method, electronic device and storage medium based on scale decoupling, in order to solve at least one of the above problems.

[0005] According to a first aspect of the present invention, a unified indoor and outdoor monocular depth estimation method based on scale decoupling is provided, comprising:

[0006] The CLIP image encoder of the trained monocular depth estimation model is used to obtain multi-scale image features of the target image, and the pixel decoder of the trained monocular depth estimation model is used to perform multi-scale fusion of multi-scale image features to obtain multi-scale fused features with global and local perception.

[0007] Multiple depth partitioning queries are initialized, and the feature interactor of the trained monocular depth estimation model is used to realize the interaction between the multiple depth partitioning queries and the multi-scale fusion features to obtain the updated depth partitioning queries, where each depth partitioning query corresponds to a depth interval.

[0008] The updated depth partition query is processed using the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain the depth feature representation and the depth partition used to generate each depth interval.

[0009] The normalization result based on depth partitioning is used to obtain the center value of each depth interval. After flattening the multi-scale fusion feature, similarity calculation is performed with the depth feature representation to obtain the probability of each depth interval.

[0010] The probability of each depth interval is linearly weighted with the center value of each depth interval to obtain the relative depth estimation result of the target image.

[0011] According to embodiments of the present invention, the above-described unified indoor and outdoor monocular depth estimation method based on scale decoupling further includes:

[0012] A scale query is initialized, and the interaction between the scale query and the multi-scale fusion features is realized through the feature interactor of the trained monocular depth estimation model to obtain a semantically aware scale query, wherein the semantically aware scale query has global information of the target image.

[0013] The semantic perception scale prediction module of the trained monocular depth estimation model projects the semantic perception scale query into a scale factor, and calculates the scale factor and the relative depth estimation result to obtain the absolute depth estimation result of the target image. Here, the scale factor represents the scene depth range of the target image.

[0014] According to an embodiment of the present invention, the feature interactor of the monocular depth estimation model trained above is constructed based on Transformer, and the trained feature interactor includes a multi-head self-attention unit, a multi-head cross-attention unit, and a feedforward network.

[0015] The trained monocular depth estimation model has multiple feature interactors, and each feature interactor corresponds to one scale fusion feature in the multi-scale fusion features.

[0016] According to an embodiment of the present invention, the above-mentioned initialization of multiple depth partitioning queries, and the use of the feature interactor of the trained monocular depth estimation model to realize the interaction between the multiple depth partitioning queries and multi-scale fusion features, to obtain the updated depth partitioning queries, include:

[0017] The feature interactor of the trained monocular depth estimation model is used to interact with the fusion features of each scale in the multi-scale fusion feature to update the depth partition query and obtain the updated depth partition query.

[0018] According to an embodiment of the present invention, the above-mentioned adaptive relative depth estimation module using the trained monocular depth estimation model processes the updated depth partition query to obtain a depth feature representation for generating each depth interval and a depth partition for generating each depth interval, including:

[0019] The first multilayer perceptron of the adaptive relative depth estimation module of the trained monocular depth estimation model is used to process the updated depth segmentation query to obtain the depth feature representation used to generate each depth interval.

[0020] The updated depth partition query is processed by the second multilayer perceptron of the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain the depth partition used to generate each depth interval.

[0021] According to an embodiment of the present invention, the above-mentioned normalization processing result based on depth segmentation yields the center value of each depth interval, and the multi-scale fusion feature is flattened and then similar to the depth feature representation is calculated to obtain the probability of each depth interval, including:

[0022] The depth division corresponding to each depth interval is normalized, and the center value of each depth interval is obtained based on the normalization result using a predefined center value calculation method.

[0023] The predefined scale fusion features in the multi-scale fusion features are flattened, and the flattened results are compared with the depth feature representation.

[0024] The similarity calculation results are normalized on a predefined dimension to obtain the probability of pixels in the target image in each depth interval.

[0025] According to a second aspect of the present invention, a training method for a unified indoor / outdoor monocular depth estimation model based on scale decoupling is provided, applied to the aforementioned unified indoor / outdoor monocular depth estimation method based on scale decoupling, comprising:

[0026] Based on image-text similarity constraints, the monocular depth estimation model is trained using image samples, a predefined loss function, and a pre-trained CLIP text encoder, resulting in a trained monocular depth estimation module.

[0027] According to an embodiment of the present invention, the monocular depth estimation module trained by the above-mentioned image-text similarity constraint-based method, using image samples, a predefined loss function, and a pre-trained CLIP text encoder, to obtain the trained monocular depth estimation module includes:

[0028] Multi-scale fusion features of image samples are obtained using the CLIP image encoder and pixel decoder of the monocular depth estimation model. Based on the depth partition query, the feature interactor of the monocular depth estimation model is used to process the multi-scale fusion features to obtain the updated depth partition query.

[0029] The updated depth segmentation query is processed using the adaptive relative depth estimation module of the monocular depth estimation model to obtain the relative depth estimation results of the image samples. Based on the scale query, the multi-scale fusion features are processed using the feature interactor of the monocular depth estimation model to obtain the updated scale query.

[0030] By introducing image categories to construct image-text similarity constraints, and based on these constraints, the text features of image samples are obtained using a pre-trained CLIP text encoder.

[0031] The scene category information of the image samples is obtained by using the text features of the image samples and the updated scale query. The scene category label and scene category information of the image samples are processed by the predefined cross-entropy loss function to obtain the cross-entropy loss value.

[0032] The multilayer perceptron of the semantic perception scale prediction module of the monocular depth estimation model projects the scale query of the image sample into the scale factor of the image sample, and multiplies the scale factor of the image sample and the relative depth estimation result element by element to obtain the absolute depth estimation result of the image sample.

[0033] The relative depth estimation results, absolute depth estimation results, and true depth values ​​of image samples are processed using a predefined scale-invariant loss function to obtain the scale-invariant loss value. The scale-invariant loss value and the cross-entropy loss value are then used to update the parameters of the monocular depth estimation model until the preset training conditions are met, resulting in a trained monocular depth estimation model.

[0034] According to a second aspect of the present invention, an electronic device is provided, comprising:

[0035] One or more processors;

[0036] Storage device for storing one or more programs.

[0037] Specifically, when one or more programs are executed by one or more processors, the one or more processors execute the aforementioned unified indoor and outdoor monocular depth estimation method based on scale decoupling and the training method of the unified indoor and outdoor monocular depth estimation model based on scale decoupling.

[0038] According to a third aspect of the present invention, a computer-readable storage medium is provided having executable instructions stored thereon, which, when executed by a processor, cause the processor to perform the above-described unified indoor / outdoor monocular depth estimation method based on scale decoupling and the training method of the unified indoor / outdoor monocular depth estimation model based on scale decoupling.

[0039] The unified indoor / outdoor monocular depth estimation method based on scale decoupling provided by this invention decouples the absolute depth of the target image into two parts: relative depth estimation and scale prediction. In the scale prediction part, textual information is introduced to guide the scale query to learn semantically aware global features. Through scale decoupling, the difficulty of absolute depth estimation in scenes with large scale differences is effectively reduced, achieving the goal of joint depth estimation in indoor and outdoor scenes using a unified and concise model. Attached Figure Description

[0040] Figure 1 This is a flowchart of a unified indoor and outdoor monocular depth estimation method based on scale decoupling according to an embodiment of the present invention;

[0041] Figure 2 This is an architecture diagram of a unified indoor and outdoor monocular depth estimation method based on scale decoupling according to an embodiment of the present invention, and a training diagram of a semantically aware scale prediction module.

[0042] Figure 3 The diagram illustrates an electronic device suitable for implementing a unified indoor / outdoor monocular depth estimation method based on scale decoupling and a training method for a unified indoor / outdoor monocular depth estimation model based on scale decoupling, according to an embodiment of the present invention. Detailed Implementation

[0043] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to specific embodiments and accompanying drawings.

[0044] Currently, depth estimation methods can be mainly divided into two categories: relative depth estimation and absolute depth estimation. Relative depth estimation focuses primarily on inferring the relative depth relationships between objects in a scene. However, due to the lack of absolute scale information, these methods cannot be deployed in complex application scenarios requiring precise depth information, such as robot grasping and obstacle avoidance. Therefore, absolute depth estimation has gradually become the mainstream. Absolute depth estimation methods typically rely on training on specific datasets using depth labels, directly regressing pixel-by-pixel depth maps from images. However, this training paradigm has a significant problem: the resulting models often cannot simultaneously adapt to the depth estimation requirements of different scenes, such as indoors and outdoors. To address this issue, some methods attempt to design different regression heads for different scenes, which undoubtedly increases the complexity of the model. Other methods try to improve the model's generalization ability by utilizing large datasets, but this is not only time-consuming and labor-intensive but also increases the difficulty of the training process.

[0045] It is worth noting that the aforementioned methods generally neglect the crucial element of scene depth range (scale). In reality, relative depth information can be relatively easily recovered from image cues and can generalize across multiple scenes. Therefore, decoupling absolute depth estimation into two sub-tasks—relative depth estimation and scale prediction—simplifies the absolute depth estimation process and helps improve the accuracy of depth estimation in scenes with large scale variations. To this end, this invention, by decoupling absolute depth, designs a depth estimation framework that can adapt to different scenes, such as indoor and outdoor environments. Trained on only two datasets, it can achieve joint depth estimation for indoor and outdoor scenes, effectively improving the model's accuracy and generalization.

[0046] The purpose of this invention is to design a concise and efficient unified framework for joint indoor and outdoor monocular depth estimation, addressing the problems in existing technologies. The main components of this invention include: predicting the scene depth range using scene category semantic information and image information; estimating relative depth per unit depth using classification regression; and combining relative depth and depth range to obtain the absolute depth of each point in the scene.

[0047] Figure 1 This is a flowchart of a unified indoor and outdoor monocular depth estimation method based on scale decoupling according to an embodiment of the present invention.

[0048] like Figure 1 As shown, the above-mentioned unified indoor and outdoor monocular depth estimation method based on scale decoupling includes operations S110 to S170.

[0049] In operation S110, the CLIP image encoder of the trained monocular depth estimation model is used to obtain multi-scale image features of the target image, and the pixel decoder of the trained monocular depth estimation model is used to perform multi-scale fusion of the multi-scale image features to obtain multi-scale fused features with global and local perception.

[0050] CLIP (Contrastive Language-Image Pre-training) is a pre-trained model based on contrastive text and images; CLIP is a multimodal model based on contrastive learning, including an image encoder and a text encoder.

[0051] The aforementioned multi-scale image features include multiple image features with different scales.

[0052] The aforementioned pixel decoders are multiple, and each pixel decoder corresponds to an image feature at a certain scale.

[0053] In operation S120, multiple depth partitioning queries are initialized, and the feature interactor of the trained monocular depth estimation model is used to realize the interaction between multiple depth partitioning queries and multi-scale fusion features to obtain updated depth partitioning queries, where each depth partitioning query corresponds to a depth interval.

[0054] According to an embodiment of the present invention, the feature interactor of the trained monocular depth estimation model is constructed based on Transformer. The feature interactor of the trained monocular depth estimation model includes a multi-head self-attention unit, a multi-head cross-attention unit, and a feedforward network. There are multiple feature interactors of the trained monocular depth estimation model, and each trained feature interactor corresponds to one scale fusion feature in the multi-scale fusion features.

[0055] According to an embodiment of the present invention, the above-mentioned initialization of multiple depth partitioning queries and the use of the feature interactor of the trained monocular depth estimation model to realize the interaction between the multiple depth partitioning queries and the multi-scale fusion features to obtain the updated depth partitioning query includes: using the feature interactor of the trained monocular depth estimation model to interact with the fusion features of each scale in the multi-scale fusion features in order to update the depth partitioning query and obtain the updated depth partitioning query.

[0056] In operation S130, the adaptive relative depth estimation module of the trained monocular depth estimation model is used to process the updated depth partition query to obtain the depth feature representation for generating each depth interval and the depth partition for generating each depth interval.

[0057] The aforementioned adaptive relative depth estimation module includes multiple multilayer perceptrons, each of which is used to obtain different information about the target image from the updated depth partition query.

[0058] According to an embodiment of the present invention, the above-mentioned processing of the updated depth partition query using the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain the depth feature representation for generating each depth interval and the depth partition for generating each depth interval includes: processing the updated depth partition query using the first multilayer perceptron of the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain the depth feature representation for generating each depth interval; and processing the updated depth partition query using the second multilayer perceptron of the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain the depth partition for generating each depth interval.

[0059] In operation S140, the normalization result based on depth partitioning is used to obtain the center value of each depth interval. After flattening the multi-scale fusion feature, similarity calculation is performed with the depth feature representation to obtain the probability of each depth interval.

[0060] According to an embodiment of the present invention, the above-mentioned normalization processing result based on depth partitioning obtains the center value of each depth interval, and after flattening the multi-scale fusion features, a similarity calculation is performed with the depth feature representation to obtain the probability of each depth interval, including: normalizing the depth partition corresponding to each depth interval, and obtaining the center value of each depth interval based on the normalization processing result using a predefined center value calculation method; flattening the fusion features of a predefined scale in the multi-scale fusion features, and performing a similarity calculation with the flattening result and the depth feature representation; and normalizing the similarity calculation result on a predefined dimension to obtain the probability of a pixel in the target image in each depth interval.

[0061] In operation S150, the probability of each depth interval is linearly weighted with the center value of each depth interval to obtain the relative depth estimation result of the target image.

[0062] In operation S160, a scale query is initialized, and the interaction between the scale query and the multi-scale fusion features is realized through the feature interactor of the trained monocular depth estimation model to obtain a semantically aware scale query, which has global information of the target image.

[0063] In operation S170, the semantic perception scale prediction module of the trained monocular depth estimation model projects the semantic perception scale query into a scale factor, and calculates the scale factor and the relative depth estimation result to obtain the absolute depth estimation result of the target image, where the scale factor represents the scene depth range of the target image.

[0064] The unified indoor / outdoor monocular depth estimation method based on scale decoupling provided by this invention decouples the absolute depth of the target image into two parts: relative depth estimation and scale prediction. In the scale prediction part, textual information is introduced to guide the scale query to learn semantically aware global features. Through scale decoupling, the difficulty of absolute depth estimation in scenes with large scale differences is effectively reduced, achieving the goal of joint depth estimation in indoor and outdoor scenes using a unified and concise model.

[0065] According to an embodiment of the present invention, a training method for a unified indoor and outdoor monocular depth estimation model based on scale decoupling is provided, which is applied to the aforementioned unified indoor and outdoor monocular depth estimation method based on scale decoupling. The method includes: training the monocular depth estimation model based on image-text similarity constraints, using image samples, a predefined loss function, and a pre-trained CLIP text encoder to obtain a trained monocular depth estimation module.

[0066] According to an embodiment of the present invention, the above-mentioned monocular depth estimation module, based on image-text similarity constraints, utilizes image samples, a predefined loss function, and a pre-trained CLIP text encoder to train a monocular depth estimation model, resulting in a trained monocular depth estimation module. This module includes: obtaining multi-scale fusion features of image samples using the CLIP image encoder and pixel decoder of the monocular depth estimation model; processing the multi-scale fusion features using the feature interactor of the monocular depth estimation model based on a depth partitioning query to obtain an updated depth partitioning query; processing the updated depth partitioning query using the adaptive relative depth estimation module of the monocular depth estimation model to obtain a relative depth estimation result of the image samples; and processing the multi-scale fusion features using the feature interactor of the monocular depth estimation model based on a scale query to obtain an updated scale query; and constructing image-text similarity constraints by introducing image categories. The monocular depth estimation model uses a pre-trained CLIP text encoder to obtain text features of image samples. It then uses these text features and an updated scale query to obtain scene category information for the image samples. A predefined cross-entropy loss function is used to process the scene category label and information, resulting in a cross-entropy loss value. The multilayer perceptron of the semantic-aware scale prediction module of the monocular depth estimation model projects the scale query of the image samples into scale factors. The scale factors and relative depth estimation results are then multiplied element-wise to obtain the absolute depth estimation result. A predefined scale-invariant loss function is used to process the relative depth estimation result, absolute depth estimation result, and true depth value of the image samples, resulting in a scale-invariant loss value. The scale-invariant loss value and cross-entropy loss value are used to update the parameters of the monocular depth estimation model until the preset training conditions are met, resulting in a trained monocular depth estimation model.

[0067] According to an embodiment of the present invention, the above-mentioned method of obtaining scene category information of image samples by using text features and scale queries of image samples includes: processing text features and scale queries of image samples using a predefined image-text similarity calculation method to obtain scene category information of image samples, wherein the scene category information of image samples represents the probability distribution of image samples belonging to various scene categories.

[0068] The following specific embodiments and appendices Figure 2 The above-mentioned unified indoor and outdoor monocular depth estimation method based on scale decoupling provided by the present invention will be described in further detail.

[0069] Figure 2 This is an architecture diagram of a unified indoor and outdoor monocular depth estimation method based on scale decoupling according to an embodiment of the present invention, and a training diagram of a semantically aware scale prediction module.

[0070] like Figure 2 As shown, the unified indoor and outdoor monocular depth estimation method based on scale decoupling provided by the present invention mainly includes a pre-trained adaptive relative depth estimation module and a semantically aware scale prediction module.

[0071] (1) Pre-trained adaptive relative depth estimation module

[0072] A classification regression approach is used to predict the depth distribution of each pixel in an image within a normalized depth range (0-1 meter). First, the normalized depth range is divided into N depth intervals. Based on the similarity between image features and depth interval features, the probability of a pixel belonging to each depth interval is obtained. This probability is then used to linearly weight the center values ​​of the depth intervals to obtain the pixel-by-pixel relative depth prediction result.

[0073] Specifically, given an input image Multi-scale image features are extracted using the trained CLIP image encoder. These features are then further fused across multiple scales using the trained pixel decoder to obtain multi-scale features with both global and local awareness. in, Let be the feature map at the l-th scale, and D be the feature dimension. For the sake of brevity, the subscript l will be omitted in the following description of this invention.

[0074] Next, initialize N depth partition queries. Each query corresponds to a depth interval. A trained feature interactor is used to implement the interaction between the depth partitioning query and image features. The trained feature interactor is a standard Transformer architecture, consisting of multi-head self-attention, multi-head cross-attention, and a feedforward network. In the trained feature interactor, the depth partitioning query interacts sequentially with features at each scale, updating itself using image features to obtain an updated depth partitioning query with full image awareness. This is used to subsequently divide the depth range into N depth intervals. Compared to manually dividing depth intervals, using queries to adaptively divide depth intervals enhances the model's adaptability to different scenarios.

[0075] Depth partitioning query updated by the trained feature interactor The process involves two independent multilayer perceptrons, one for generating feature representations for each depth interval. Another partition used to generate depth intervals The depth partition is normalized as shown in formula (1):

[0076]

[0077] in, This represents the proportion of each depth interval within the overall depth range. For a normalized depth range (0-1 meter), It is also the length of each depth interval. Therefore, the center value of the i-th depth interval can be calculated using formula (2):

[0078]

[0079] in, Defined as 0, This represents the length of the i-th depth interval. For each depth interval, its center value is used to represent the depth corresponding to that depth interval.

[0080] To classify pixels into their corresponding depth ranges, the image features of the last layer are flattened. And similarity calculation is performed with depth interval features, as shown in formula (3):

[0081] Similarity = E × F T (3),

[0082] in, Normalize it in the first dimension to obtain the probability of a pixel belonging to each depth interval, as shown in formula (4):

[0083] P = Softmax(Similarity) (4),

[0084] in,

[0085] The center depth values ​​of the depth interval are linearly weighted using this probability to obtain the relative depth representation of the image, as shown in formula (5):

[0086] R = θ T ×P(4),

[0087] in, Let be the center value vector of the depth interval. This is the result of relative depth estimation.

[0088] (2) Scale prediction module for semantic perception

[0089] Considering the significant differences in the depth range of different scene categories, such as indoor scenes typically ranging from 0 to 10 meters and outdoor scenes typically ranging from 0 to 80 meters, while scenes of the same category have a basically consistent depth range, the semantic perception scale prediction module uses scene category information to assist the model in capturing global semantic information, thereby achieving more accurate scale prediction.

[0090] Specifically, initialize one scale query. By training a feature interactor to fully interact with image features, a scale query with global image information is obtained. Considering that scene category information is not always available in practical applications, this module innovatively introduces image categories to construct image-text similarity constraints. This approach encourages scale queries to learn global semantic features from images, enabling the model to generalize to scenes with unknown categories without explicitly relying on scene category information.

[0091] Read all C scene categories, such as kitchen, office, outdoor scenes, etc., and fill the scene category into the preset text prompt template "This is a photo of [scene category]". These text prompts are processed by the frozen pre-trained CLIP text encoder to obtain the text feature representations of the C scene categories. The image-text similarity T between the current image and the i-th category can be calculated according to formula (6):

[0092]

[0093] Where cos<·> is used to calculate the cosine similarity between two features. Let represent the text feature representation of the i-th category, and τ be the temperature coefficient. The calculated image-text similarity represents the probability distribution of the image belonging to each scene category, and can be supervised using scene category information.

[0094] After obtaining the semantically perceptual scale query, it is projected into a scale factor through a multilayer perceptron. This scale factor represents the depth range of the scene. It is multiplied element-wise by the relative depth to obtain the absolute depth.

[0095] To effectively supervise the prediction of relative and absolute depth, the scale-invariant loss is modified as follows, as shown in Equation (7):

[0096]

[0097] in, Indicates the actual depth value. and Let α and λ represent the variance and expectation, respectively. The first term in the loss function is scale-invariant, providing effective supervision of relative depth. The second term incorporates the scale, indirectly constraining it. α and λ are scaling coefficients, set to 10 and 0.15, respectively. This scale-invariant loss function effectively provides accurate supervision for both relative and absolute depth prediction, enabling the model to learn depth cues from images.

[0098] For scale queries, cross-entropy loss is used for semantic supervision, as shown in Equation (8):

[0099]

[0100] in, For the category tag of unique hot scenes, T i The introduction of this loss function, based on image-text similarity with the i-th scene category, ensures that the model can accurately capture semantic information in the image, thus improving the accuracy of scale prediction. The final loss function is shown in Equation (9):

[0101]

[0102] β is set to 0.01.

[0103] During the training process of the aforementioned modules, this invention innovatively adds a scale query branch, decoupling absolute depth into relative depth and scale prediction. A corresponding loss function is designed for network training, enabling joint depth estimation for indoor and outdoor scenes through training on only two datasets, effectively improving the model's generalization ability. This invention can be applied to autonomous driving scenarios, providing real-time depth estimation results based on real-time images captured by a monocular camera, installed as software on the front-end device, for subsequent vehicle navigation and obstacle avoidance.

[0104] Figure 3 The diagram illustrates an electronic device suitable for implementing a unified indoor / outdoor monocular depth estimation method based on scale decoupling and a training method for a unified indoor / outdoor monocular depth estimation model based on scale decoupling, according to an embodiment of the present invention.

[0105] like Figure 3 As shown, an electronic device 300 according to an embodiment of the present invention includes a processor 301, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 302 or a program loaded from a storage portion 308 into a random access memory (RAM) 303. The processor 301 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 301 may also include onboard memory for caching purposes. The processor 301 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of the present invention.

[0106] RAM 303 stores various programs and data required for the operation of electronic device 300. Processor 301, ROM 302, and RAM 303 are interconnected via bus 304. Processor 301 executes various operations of the method flow according to embodiments of the present invention by executing programs in ROM 302 and / or RAM 303. It should be noted that programs may also be stored in one or more memories other than ROM 302 and RAM 303. Processor 301 may also execute various operations of the method flow according to embodiments of the present invention by executing programs stored in one or more memories.

[0107] According to an embodiment of the present invention, the electronic device 300 may further include an input / output (I / O) interface 305, which is also connected to a bus 304. The electronic device 300 may also include one or more of the following components connected to the I / O interface 305: an input section 306 including a keyboard, mouse, etc.; an output section 307 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 308 including a hard disk, etc.; and a communication section 309 including a network interface card such as a LAN card, modem, etc. The communication section 309 performs communication processing via a network such as the Internet. A drive 310 is also connected to the I / O interface 305 as needed. A removable medium 311, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 310 as needed so that computer programs read from it can be installed into the storage section 308 as needed.

[0108] The present invention also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs, which, when executed, implement the method according to the embodiments of the present invention.

[0109] According to embodiments of the present invention, a computer-readable storage medium may be a non-volatile computer-readable storage medium, such as including, but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In the present invention, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of the present invention, a computer-readable storage medium may include ROM 302 and / or RAM 303 and / or one or more memories other than ROM 302 and RAM 303 described above.

[0110] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0111] The above specific embodiments further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above are merely specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A unified indoor and outdoor monocular depth estimation method based on scale decoupling, characterized in that, include: The CLIP image encoder of the trained monocular depth estimation model is used to obtain multi-scale image features of the target image, and the pixel decoder of the trained monocular depth estimation model is used to perform multi-scale fusion of the multi-scale image features to obtain multi-scale fused features with global and local awareness. Multiple depth partitioning queries are initialized, and the interaction between the multiple depth partitioning queries and the multi-scale fusion features is realized using the feature interactor of the trained monocular depth estimation model to obtain an updated depth partitioning query, wherein each depth partitioning query corresponds to a depth interval. The updated depth partition query is processed using the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain a depth feature representation for generating each depth interval and a depth partition for generating each depth interval. The center value of each depth interval is obtained based on the normalization processing result of the depth division, and the multi-scale fusion feature is flattened and then similar to the depth feature representation to obtain the probability of each depth interval. The probability of each depth interval is linearly weighted with the center value of each depth interval to obtain the relative depth estimation result of the target image.

2. The method according to claim 1, characterized in that, Also includes: A scale query is initialized, and the interaction between the scale query and the multi-scale fusion features is realized through the feature interactor of the trained monocular depth estimation model to obtain a semantically aware scale query, wherein the semantically aware scale query has the global information of the target image. The semantic perception scale prediction module of the trained monocular depth estimation model projects the semantic perception scale query into a scale factor, and performs calculations on the scale factor and the relative depth estimation result to obtain the absolute depth estimation result of the target image, wherein the scale factor represents the scene depth range of the target image.

3. The method according to claim 1, characterized in that, The feature interactor of the trained monocular depth estimation model is constructed based on Transformer, and the feature interactor of the trained monocular depth estimation model includes a multi-head self-attention unit, a multi-head cross-attention unit, and a feedforward network. The trained monocular depth estimation model has multiple feature interactors, and each feature interactor of the trained monocular depth estimation model corresponds to one scale fusion feature in the multi-scale fusion features.

4. The method according to claim 1, characterized in that, Multiple depth partitioning queries are initialized, and the feature interactor of the trained monocular depth estimation model is used to realize the interaction between the multiple depth partitioning queries and the multi-scale fusion features, resulting in updated depth partitioning queries including: The feature interactor of the trained monocular depth estimation model is used to interact with the fusion features of each scale in the multi-scale fusion features in order to update the depth partition query and obtain the updated depth partition query.

5. The method according to claim 1, characterized in that, The updated depth segmentation query is processed using the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain a depth feature representation for generating each depth interval and a depth segmentation for generating each depth interval, including: The first multilayer perceptron of the adaptive relative depth estimation module of the trained monocular depth estimation model is used to process the updated depth segmentation query to obtain a depth feature representation for generating each depth interval. The updated depth segmentation query is processed by the second multilayer perceptron of the adaptive relative depth estimation module of the trained monocular depth estimation model to obtain the depth segmentation used to generate each depth interval.

6. The method according to claim 1, characterized in that, Based on the normalization result of the depth segmentation, the center value of each depth interval is obtained. After flattening the multi-scale fused features, a similarity calculation is performed between the multi-scale fused features and the depth feature representation to obtain the probability of each depth interval, including: The depth division corresponding to each depth interval is normalized, and the center value of each depth interval is obtained based on the normalization result using a predefined center value calculation method. The predefined scale fusion features in the multi-scale fusion features are flattened, and the flattening result is compared with the depth feature representation. The similarity calculation results are normalized on a predefined dimension to obtain the probability of pixels in the target image in each depth interval.

7. A training method for a unified indoor / outdoor monocular depth estimation model based on scale decoupling, applied to the method described in any one of claims 1-6, characterized in that, include: Based on image-text similarity constraints, the monocular depth estimation model is trained using image samples, a predefined loss function, and a pre-trained CLIP text encoder, resulting in a trained monocular depth estimation module.

8. The method according to claim 7, characterized in that, Based on image-text similarity constraints, the monocular depth estimation model is trained using image samples, a predefined loss function, and a pre-trained CLIP text encoder. The trained monocular depth estimation module includes: The multi-scale fusion features of the image samples are obtained using the CLIP image encoder and pixel decoder of the monocular depth estimation model. Based on the depth partition query, the multi-scale fusion features are processed using the feature interactor of the monocular depth estimation model to obtain the updated depth partition query. The updated depth segmentation query is processed using the adaptive relative depth estimation module of the monocular depth estimation model to obtain the relative depth estimation result of the image sample. Based on the scale query, the multi-scale fusion feature is processed using the feature interactor of the monocular depth estimation model to obtain the updated scale query. By introducing image categories to construct image-text similarity constraints, and based on the image-text similarity constraints, the text features of the image samples are obtained using a pre-trained CLIP text encoder; The scene category information of the image sample is obtained by using the text features of the image sample and the updated scale query, and the scene category label and scene category information of the image sample are processed by the predefined cross-entropy loss function to obtain the cross-entropy loss value; The multilayer perceptron of the semantic perception scale prediction module of the monocular depth estimation model projects the scale query of the image sample into the scale factor of the image sample, and multiplies the scale factor of the image sample and the relative depth estimation result element by element to obtain the absolute depth estimation result of the image sample. The relative depth estimation result, absolute depth estimation result, and true depth value of the image sample are processed using a predefined scale-invariant loss function to obtain a scale-invariant loss value. The monocular depth estimation model is then updated with the scale-invariant loss value and the cross-entropy loss value until the preset training conditions are met, resulting in a trained monocular depth estimation model.

9. An electronic device, comprising: One or more processors; Storage device for storing one or more programs. Wherein, when the one or more programs are executed by the one or more processors, the one or more processors perform the method according to any one of claims 1 to 8.

10. A computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 8.