Depth completion method and apparatus based on radiation difference and spatial distance, and device
By employing a multi-scale architecture based on radiation difference and spatial distance, combined with a bilateral propagation network and a multimodal fusion module, the ambiguity and spatial invariance issues of sparse depth maps are resolved, achieving high-precision depth estimation and detail preservation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-12-26
- Publication Date
- 2026-06-23
Smart Images

Figure CN117635444B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image depth completion technology, and in particular to a depth completion method, apparatus and device based on radiometric difference and spatial distance. Background Technology
[0002] Dense depth sensing is a technique for calculating the distance from each pixel in an image to the camera, playing a crucial role in computer vision, especially in 3D vision tasks. While using LiDAR technology to measure depth is currently the most reliable solution in practical deployments, directly estimating dense depth maps remains extremely difficult due to hardware limitations. An economical approach is to utilize machine vision algorithms for depth estimation, but the accuracy obtained by methods based on binocular, monocular, or multi-view imaging is very limited. Nevertheless, obtaining accurate pixel-level scene depth is a key technology for advancing applications in autonomous driving, robotics, and augmented reality.
[0003] Generally, depth completion methods mainly focus on three issues: how to handle irregular sparse data, how to fuse multimodal data, and how to optimize the completion results. Early methods typically employed single-stage depth completion strategies such as multimodal fusion modules. These methods tend to oversmooth at depth edges, resulting in missing details. Currently, more and more works are adopting a two-stage approach, using an additional network for post-processing to alleviate the oversmoothing problem. However, these methods often use 0 to represent unknown pixels in the input sparse depth map, leading to ambiguity in distinguishing between valid and invalid pixel values. Furthermore, for irregularly sampled sparse depth points, the spatial invariance inherent in convolution operations degrades the performance of the multimodal fusion module, and even the addition of a post-processing stage cannot effectively solve these problems. Summary of the Invention
[0004] Therefore, it is necessary to provide a depth completion method based on radiometric difference and spatial distance that can avoid ambiguity when distinguishing between valid and invalid pixel values and improve the performance of the multimodal fusion module, in order to address the above-mentioned technical problems.
[0005] A depth completion method based on radiometric difference and spatial distance, the method comprising:
[0006] Obtain a color image and a sparse depth map paired with the color image;
[0007] A two-sided propagation network is constructed based on a multi-scale architecture. The two-sided propagation network performs dense depth estimation through multiple scales from coarse to fine. At each scale, it includes a preprocessing module, a multimodal fusion module, and a depth optimization module. In the preprocessing module, a supernetwork is designed. The supernetwork combines the radiation difference and spatial distance to generate filtering weights. The filtering weights enable the two-sided propagation network to have a preference for the nearest value in the radiation difference and spatial neighborhood during the depth propagation process.
[0008] The preprocessing module extracts features from the color image and the sparse depth map to obtain the image code and sparse depth map at each scale. It processes the sparse depth map at each scale to obtain an initialized dense depth map and then back-projects the initialized dense depth map into the camera space to obtain depth features.
[0009] The image encoding of the same scale and the depth features are input into the multimodal fusion module for fusion to obtain a dense depth map;
[0010] The depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map, and obtains the completed dense depth map by weighted combination of dense depth maps at each scale.
[0011] A depth completion device based on radiation difference and spatial distance, the device comprising:
[0012] The data acquisition module is used to acquire a color image and a sparse depth map paired with the color image;
[0013] A network construction module is used to construct a bilateral propagation network based on a multi-scale architecture. The bilateral propagation network performs dense depth estimation through multiple scales from coarse to fine. At each scale, it includes a preprocessing module, a multimodal fusion module, and a depth optimization module. In the preprocessing module, a supernetwork is designed. The supernetwork combines the radiation difference and spatial distance to generate filtering weights. The filtering weights enable the bilateral propagation network to have a preference for the nearest value in the radiation difference and spatial neighborhood during the depth propagation process.
[0014] The feature extraction module is used to extract features from the color image and the sparse depth map through the preprocessing module to obtain the image code and sparse depth map at each scale, process the sparse depth map at each scale to obtain the initial dense depth map, and back-project the initial dense depth map onto the camera space to obtain depth features.
[0015] The feature fusion module is used to fuse the image encoding of the same scale with the depth features into the multimodal fusion module to obtain a dense depth map;
[0016] The update output module is used to iteratively update the depth propagated by the multimodal fusion module based on the sparse depth map through the depth optimization module. By weighted combination of dense depth maps at each scale, a complete dense depth map is obtained.
[0017] A computer device includes a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of the method.
[0018] The aforementioned depth completion method, apparatus, and device based on radiometric difference and spatial distance acquire a color image and a sparse depth map paired with the color image. A bilateral propagation network is constructed based on a multi-scale architecture. This network performs dense depth estimation at multiple scales, from coarse to fine. At each scale, a preprocessing module, a multimodal fusion module, and a depth optimization module are included. Specifically, a supernetwork is designed in the preprocessing module. This supernetwork, combined with radiometric difference and spatial distance, generates filtering weights. These weights enable the bilateral propagation network to achieve optimal depth estimation within the radiometric difference and spatial neighborhood during depth propagation. The preprocessing module extracts features from the color image and sparse depth map to obtain the corresponding image encoding and sparse depth map at each scale. It processes the sparse depth map at each scale to obtain the initial dense depth map and back-projects the initial dense depth map into the camera space to obtain depth features. The image encoding and depth features at the same scale are input into the multimodal fusion module for fusion to obtain the dense depth map. The depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map. By weighted combination of the dense depth maps at each scale, the completed dense depth map is obtained.
[0019] This invention designs a supernetwork in the preprocessing module. This supernetwork combines radiation difference and spatial distance to generate filtering weights. These filtering weights enable the bilateral propagation network to exhibit a preference for nearest values in the radiation difference and spatial neighborhood during depth propagation, thus effectively preserving edges. By introducing this nonlinear propagation model, a weighted combination of effective neighborhood depth values is achieved, propagating depth in an early stage. This avoids the ambiguity of sparse depth map representation and the spatial invariance issues of convolution operations that are difficult to resolve in the multimodal fusion stage. Furthermore, the method provided by this invention exhibits excellent performance in estimating dense depth maps, broadening the scope for practical applications of depth estimation. Attached Figure Description
[0020] Figure 1 This is a flowchart illustrating a depth completion method based on radiometric difference and spatial distance in one embodiment;
[0021] Figure 2 This is a schematic diagram of the overall framework for depth completion based on radiation difference and spatial distance in one embodiment;
[0022] Figure 3 This is a comparison chart of the effects of an example in one embodiment;
[0023] Figure 4 This is a structural block diagram of a depth completion device based on radiation difference and spatial distance in one embodiment;
[0024] Figure 5 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0026] It should be noted that in this invention, the use of terms such as "first," "second," etc., is for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0027] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
[0028] In one embodiment, such as Figure 1 As shown, a depth completion method based on radiometric difference and spatial distance is provided, including the following steps:
[0029] Step 202: Obtain the color image and the sparse depth map paired with the color image.
[0030] It is understood that the datasets used to train the model in this invention are the indoor dataset NYUDepthV2 and the outdoor dataset KITTI. The NYUDepthV2 dataset contains 464 scenes collected by Kinect sensors. This invention uses 50K image frames sampled from 249 scenes as training data and evaluates the model on the official test set, which includes 654 samples from 215 scenes. The images are downsampled to 320×240 and then cropped to 304×228 at the center. The sparse depth map corresponding to each frame is generated by randomly sampling 500 points from the ground truth depth. Finally, the image is padded to 320×256 as input and tested within a valid region of size 304×228. The KITTI depth completion dataset was collected from autonomous vehicles, and its ground truth depth was generated by LiDAR scanning, further validated using stereo images. This dataset provides 86,898 frames and 1,000 frames for model training and validation, respectively. This invention crops image frames to 256×1216 for training and directly uses full-resolution frames as test input.
[0031] This embodiment selects a color image from a depth completion dataset. And by projecting the 3D point cloud onto the image plane using calibration parameters, a corresponding sparse depth map of the same resolution is generated. .
[0032] Step 204: Construct a bilateral propagation network based on a multi-scale architecture. The bilateral propagation network performs dense depth estimation at multiple scales from coarse to fine. At each scale, it includes a preprocessing module, a multimodal fusion module, and a depth optimization module. In the preprocessing module, a supernetwork is designed. The supernetwork combines the radiation difference and spatial distance to generate filtering weights. The filtering weights enable the bilateral propagation network to have a preference for the nearest value in the radiation difference and spatial neighborhood during the depth propagation process.
[0033] Understandable, such as Figure 2 As shown, a schematic diagram of the overall framework for depth completion based on radiometric difference and spatial distance is provided. The bilateral propagation network is constructed based on a multi-scale architecture, which contains a network of 6 scales. In the image feature extraction unit and sparse depth extraction unit in the preprocessing module, features are extracted from color images and sparse depth maps from the highest resolution 0 to the lowest resolution 5 by setting the scale s to 0, 1, 2, 3, 4, and 5. In the bilateral propagation network, dense depth maps are estimated from the lowest resolution 5 to the highest resolution 0 in a coarse-to-fine manner.
[0034] On the other hand, designing hypernets in the preprocessing module Hypernetwork The MLP network consists of four densely connected layers, each followed by a batch normalization layer and a GeLU activation layer. Skip layers connect the outputs of the second and last layers. The MLP shares the same input but uses different encodings across any (i,j) pixel pairs, thus generating spatially variable and content-dependent parameters. This is achieved through a hypernetwork. Filter weights are generated by combining radiation difference and spatial distance. , and This allows depth propagation to have a preference for nearest values in terms of radiation difference and spatial neighborhood, thus preserving edges well.
[0035] The multimodal fusion module employs a U-Net network, an encoder-decoder structure, to aggregate local and global features across multiple scales. The encoder consists of two ResNet modules that extract features at each scale, using convolutional layers with a stride of 2 to reduce the resolution of the feature maps. The decoder includes deconvolutional layers with a stride of 2 to upsample the feature maps, and skip connections to fuse the upsampled and encoded features at the same resolution.
[0036] Step 206: The preprocessing module extracts features from the color image and sparse depth map to obtain the corresponding image code and sparse depth map at each scale. The sparse depth map at each scale is processed to obtain the initial dense depth map. The initial dense depth map is then back-projected onto the camera space to obtain depth features.
[0037] It is understandable that the preprocessing module of the bilateral propagation network includes an image feature extraction unit and a sparse depth extraction unit with identical structures. The image feature extraction unit and the sparse depth extraction unit process the color image... and sparse depth map Multi-scale encoding is performed. Specifically, the image feature extraction unit extracts color images separately using stacked ResNet modules. Multiscale features In the image feature extraction unit, the output of the previous network scale becomes the input of the next network scale. The sparse depth map is extracted at different depths using a sparse depth extraction unit. After processing the sparse depth maps based on filtering weights, they serve as the initial dense depth maps for each corresponding scale in the fusion stage. .
[0038] For each scale s, an initial dense depth map can be obtained. and image encoding By initializing the dense depth map The depth features are inversely projected into the camera space and then encoded with the image. Establish a connection.
[0039] Step 208: The image encoding of the same scale and the depth feature are input into the multimodal fusion module and fused to obtain a dense depth map.
[0040] Step 210: The depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map. By weighted combination of dense depth maps at each scale, the completed dense depth map is obtained.
[0041] Understandably, in a two-sided propagation network, the multimodal fusion module follows an early fusion mechanism, simply connecting image features and depth features at the same scale, and then using these connected features as input to the U-Net network for multimodal fusion. The U-Net network ultimately outputs a fused feature map. Through the analysis of Convolutional layers are applied to estimate the residual depth map, resulting in updated dense depth maps at various scales. = Simultaneously, in the multimodal fusion module, the dense depth map updated at the previous network scale is used as the input for the next network scale. After... After each iteration, the final dense depth map is a weighted combination of multiple cores and multiple steps.
[0042] It is worth noting that the network of this invention was trained on four Nvidia RTX 3090 GPU workstations, using AdamW as the optimizer with a weight decay of 0.05 and an L2 norm greater than 0.1 for the gradient threshold. The batch size was set to 8, the maximum learning rate was 0.001, and training was completed in approximately 300K iterations.
[0043] The aforementioned depth completion method based on radiometric difference and spatial distance acquires a color image and a sparse depth map paired with it. A bilateral propagation network is constructed based on a multi-scale architecture. This network performs dense depth estimation at multiple scales, from coarse to fine. At each scale, a preprocessing module, a multimodal fusion module, and a depth optimization module are included. Specifically, a supernetwork is designed in the preprocessing module. This supernetwork, combined with radiometric difference and spatial distance, generates filtering weights. These weights ensure that the bilateral propagation network maintains proximity to the nearest values in both radiometric difference and spatial neighborhood during depth propagation. The preprocessing module extracts features from the color image and sparse depth map to obtain the corresponding image encoding and sparse depth map at each scale. It processes the sparse depth map at each scale to obtain the initial dense depth map and back-projects the initial dense depth map into the camera space to obtain depth features. The image encoding and depth features at the same scale are input into the multimodal fusion module for fusion to obtain the dense depth map. The depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map. By weighted combination of the dense depth maps at each scale, the completed dense depth map is obtained.
[0044] This invention designs a supernetwork in the preprocessing module. This supernetwork combines radiation difference and spatial distance to generate filtering weights. These filtering weights enable the bilateral propagation network to exhibit a preference for nearest values in the radiation difference and spatial neighborhood during depth propagation, thus effectively preserving edges. By introducing this nonlinear propagation model, a weighted combination of effective neighborhood depth values is achieved, propagating depth in an early stage. This avoids the ambiguity of sparse depth map representation and the spatial invariance issues of convolution operations that are difficult to resolve in the multimodal fusion stage. Furthermore, the method provided by this invention exhibits excellent performance in estimating dense depth maps, broadening the scope for practical applications of depth estimation.
[0045] In one embodiment, a multi-scale loss function is designed to train the bilateral propagation network; the multi-scale loss function is expressed as:
[0046] ;
[0047] In the formula, Represents a truth depth map; Represents the true depth map The effective pixel set in; This represents the bilinear interpolation operation; The hyperparameters representing the smoothing of losses at different scales are generally set to... ; s represents the scale; This represents the dense depth map predicted by the network at scale s.
[0048] It is understandable that a dense depth map is output by training the bilateral propagation network end-to-end using a multi-scale loss function. The multi-scale loss function provides sufficient supervision for the dense depth map estimated at each scale.
[0049] In one embodiment, a supernetwork is designed in the preprocessing module. The supernetwork combines radiation difference and spatial distance to generate filter weights, expressed as:
[0050] ;
[0051] In the formula, Indicates a hypernet; , , Indicates the filter weights; and For target pixels and source pixels Image encoding; Source pixel The depth encoding is obtained by inversely projecting the sparse depth map at the corresponding scale into the camera space; from arrive The pixel coordinate space offset encoding, the radiometric difference term is composed of Implicit considerations.
[0052] Understandable. , This is obtained through direct regression, which is very useful for sparse and irregular distributions. When the target pixel is too far from its nearest neighbor, It is very possible to learn 0, and the propagation is by Decision. For any target pixel. This invention uses an additional softmax layer to obtain... For target pixels The N valid pixels in the neighborhood are used to ensure .
[0053] In one embodiment, filtering weights are used to give the bilateral propagation network a preference for nearest values in terms of radiation difference and spatial neighborhood during deep propagation. The propagation process is represented as follows:
[0054] ;
[0055] In the formula, Represents the initial dense depth map; Represents candidate dense depth maps; Indicates source pixel sparse depth map; Represents the distance to a pixel in Euclidean space. The set of the N most recent points.
[0056] It is understandable that the target pixel in the formula depth at Modeled as N nearest neighbor effective sparse depth values The combination of .
[0057] Furthermore, it also includes the following steps:
[0058] Find using Euclidean distance on the image plane The sparse depth values corresponding to the N nearest neighbor valid pixels. .
[0059] First, through coefficients and For sparse depth values Perform affine transformation to generate candidate depth Then use coefficients Candidate depth Linear combination generates target depth .
[0060] In one embodiment, feature extraction is performed on the color image and sparse depth map to obtain image features and sparse depth maps at each scale, including:
[0061] The preprocessing module includes an image feature extraction unit and a sparse depth extraction unit with identical structures; the network scales of the image feature extraction unit and the sparse depth extraction unit decrease sequentially.
[0062] The color image is input into the image feature extraction unit, which performs multi-scale encoding on the color image to obtain image codes at different scales.
[0063] The sparse depth map is input into the sparse depth extraction unit, which performs multi-scale encoding on the sparse depth map to obtain sparse depth maps at different scales.
[0064] In one embodiment, a color image is input into an image feature extraction unit, which performs multi-scale encoding on the color image to obtain image features at different scales, including:
[0065] The color image is input into the image feature extraction unit to calculate the scale. The image encoding function is as follows:
[0066] ;
[0067] In the formula, Indicates image encoding; Representing image features; Indicates multimodal fusion features; This represents the depth features of the dense depth map initialized at scale s+1, back-projected into the camera space.
[0068] It can be understood that the process involves determining whether scale s is the smallest scale; if so, then the image features at that scale are considered. As image encoding If not, then the scale The initial dense depth map obtained by propagation depth iteration update during the lower modality fusion stage Depth features are obtained by inverse projection onto camera space. Then connect the scales The multimodal fusion stage outputs the fused feature map. , feature map with depth features The input is used as the deconvolution operation input, then upsampled to obtain the output at scale s, and the output is then compared with the image features. They are connected together, and an additional convolution operation is used to generate the image code. .
[0069] In one embodiment, a sparse depth map is input into a sparse depth extraction unit, which performs multi-scale encoding on the sparse depth map to obtain sparse depth maps at different scales, including:
[0070] The sparse depth map is input into the sparse depth extraction unit to calculate the sparse depth map at scale s. The function expression is:
[0071] ;
[0072] In the formula, An index function representing the effectiveness of depth measurement; It is a value close to 0; Represents the target pixel at scale s Pixel coordinates at the original resolution ; Indicates weight; Indicates source pixel A sparse depth map;
[0073] Then, the sparse depth map at scale s The initial dense depth map is obtained after processing based on filter weights. The initial dense depth map Back-projected into camera space as depth features .
[0074] It's understandable that we determine if the scale 's' is the smallest scale; if so, we then modify the original sparse depth map. If it is not a dense depth map at this scale, then the original sparse depth map will be used. Weighted pooling is used to obtain a sparse depth map at scale s. The weights in the formula are... Image encoding at scale s is generated, i.e., the image encoding is transformed from a scale s using a periodic transform operator. Rearrange into shapes And use exponential transformation to ensure the generated weight map It is positive and has the same properties as... Same resolution.
[0075] It is a set , representing the target pixel at scale s Pixel coordinates at the original resolution . It may be denser at lower resolutions, but in the above equation, when the equation contains... It still has invalid pixels, therefore, the sparse depth map at scale s The initial dense depth map is obtained after processing based on filter weights. The initial dense depth map Back-projected into camera space as depth features .
[0076] In one embodiment, the depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map. By weighted combining the dense depth maps at each scale, a completed dense depth map is obtained, including:
[0077] The update equation at the t-th iteration is constructed as follows:
[0078] ;
[0079] ;
[0080] ;
[0081] In the formula, This represents the depth of network propagation, initialized as a fused depth map at t=0. ; Input fusion features The output of the post-convolutional layer; To enhance content relevance and affinity; It is the set of neighborhood pixels within a k×k local window. This can be understood as... The L1-regularization constraint guarantees the stability of the propagation process. When t=0, Initialize to step 208 The formula used Unlike the bilateral propagation module used . It is the set of neighborhood pixels within a k×k local window, and is independent of the validity of the sparse depth value.
[0082] Update the propagation depth of the multimodal fusion module using a sparse depth map. The update formula is expressed as:
[0083] ;
[0084] In the formula, The fusion of features is achieved through the action of convolutional layers and sigmoid layers. The confidence level generated above; A sparse depth map representing the network input; An index function representing the effectiveness of depth measurement.
[0085] By performing T iterations according to the above formula, the completed dense depth map is obtained, represented as:
[0086] ;
[0087] In the formula, Represents a set of different core sizes; Represents a set with different iteration steps; and The confidence maps generated by convolutional and softmax layers are normalized at different iteration steps and kernel sizes. In this invention, T is set to range from 2 to 12, with an increment of 2, representing 6 scales from low resolution to high resolution.
[0088] As can be understood, this embodiment introduces a simple convolutional spatial propagation module similar to CSPN++ to achieve depth refinement at each scale. The preset propagation kernel size is k, and the update equation at the t-th iteration is constructed.
[0089] Overall, the method proposed in this invention includes three consecutive stages for estimating the dense depth map at any scale *s* in a multi-scale architecture: a preprocessing stage, a multimodal fusion stage, and an optimization stage. The preprocessing stage primarily utilizes an image feature extraction unit and a sparse depth extraction unit with identical structures. A nonlinear propagation model is introduced to obtain the dense depth map used as initialization for the multimodal fusion stage from the sparse depth map. The output depth is a weighted combination of the effective depth values in the neighborhood and the parameters learned by the network. The multimodal fusion stage employs a U-Net network to fuse image encoding and depth features to generate residual depth maps at each network scale, resulting in updated dense depth maps at each scale. Then, in the optimization phase, the dense depth map is updated using the sparse depth map through the convolution module. The final dense depth map is obtained by weighted combination of multi-core and multi-step processes. .
[0090] In one embodiment, to verify the effectiveness of the method provided by the present invention, 13 state-of-the-art methods trained on the NYU v2 and KITTI datasets were selected for comparison. Specific results are shown in Table 1 and... Figure 3 .
[0091]
[0092] From Table 1 and Figure 3 As can be seen, the method proposed in this invention ranks first on the KITTI DC leaderboard, surpassing all other methods on the RMSE metric, and exhibits similar performance on other metrics. On the NYUv2 dataset, the method also achieves the best results on both the RMSE and delta metrics. Figure 3 The first row is the color image, the second row is the corresponding sparse depth map, the third to fifth rows are the depth completion results of GuideNet, NLSPN, and CFormer methods, and the last row is the result of the BPNet proposed in this invention. Visual analysis shows that the method proposed in this invention achieves clearer object boundaries and richer details, while other methods struggle to estimate accurate depth in these challenging regions.
[0093] It should be understood that, although Figure 1 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0094] In one embodiment, such as Figure 4 As shown, a depth completion device based on radiometric difference and spatial distance is provided, including: a data acquisition module 402, a network construction module 404, a feature extraction module 406, a feature fusion module 408, and an update output module 410, wherein:
[0095] The data acquisition module 402 is used to acquire a color image and a sparse depth map paired with the color image.
[0096] The network construction module 404 is used to construct a bilateral propagation network based on a multi-scale architecture. The bilateral propagation network performs dense depth estimation through multiple scales from coarse to fine. At each scale, it includes a preprocessing module, a multimodal fusion module, and a depth optimization module. In the preprocessing module, a supernetwork is designed. The supernetwork combines the radiation difference and spatial distance to generate filtering weights. The filtering weights enable the bilateral propagation network to have a preference for the nearest value in the radiation difference and spatial neighborhood during the depth propagation process.
[0097] The feature extraction module 406 is used to extract features from the color image and sparse depth map through the preprocessing module to obtain the corresponding image code and sparse depth map at each scale, process the sparse depth map at each scale to obtain the initial dense depth map, and back-project the initial dense depth map onto the camera space to obtain the depth features.
[0098] The feature fusion module 408 is used to fuse the image encoding of the same scale with the depth feature input multimodal fusion module to obtain a dense depth map;
[0099] The update output module 410 is used to iteratively update the depth propagated by the multimodal fusion module based on the sparse depth map through the depth optimization module. By weighted combination of dense depth maps at each scale, the completed dense depth map is obtained.
[0100] Specific limitations regarding the depth completion device based on radiometric difference and spatial distance can be found in the limitations of the depth completion method based on radiometric difference and spatial distance mentioned above, and will not be repeated here. Each module in the aforementioned depth completion device based on radiometric difference and spatial distance can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.
[0101] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 5 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and the database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores depth completion data based on radiometric difference and spatial distance. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements a depth completion method based on radiometric difference and spatial distance.
[0102] Those skilled in the art will understand that Figure 5 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0103] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:
[0104] Step 202: Obtain the color image and the sparse depth map paired with the color image.
[0105] Step 204: Construct a bilateral propagation network based on a multi-scale architecture. The bilateral propagation network performs dense depth estimation at multiple scales from coarse to fine. At each scale, it includes a preprocessing module, a multimodal fusion module, and a depth optimization module. In the preprocessing module, a supernetwork is designed. The supernetwork combines the radiation difference and spatial distance to generate filtering weights. The filtering weights enable the bilateral propagation network to have a preference for the nearest value in the radiation difference and spatial neighborhood during the depth propagation process.
[0106] Step 206: The preprocessing module extracts features from the color image and sparse depth map to obtain the corresponding image code and sparse depth map at each scale. The sparse depth map at each scale is processed to obtain the initial dense depth map. The initial dense depth map is then back-projected onto the camera space to obtain depth features.
[0107] Step 208: The image encoding of the same scale and the depth features are input into the multimodal fusion module and fused to obtain a dense depth map.
[0108] Step 210: The depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map. By weighted combination of dense depth maps at each scale, the completed dense depth map is obtained.
[0109] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0110] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0111] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.
Claims
1. A depth completion method based on radiation difference and spatial distance, characterized in that, The method includes: Obtain a color image and a sparse depth map paired with the color image; A two-sided propagation network is constructed based on a multi-scale architecture. The two-sided propagation network performs dense depth estimation through multiple scales from coarse to fine. At each scale, it includes a preprocessing module, a multimodal fusion module, and a depth optimization module. In the preprocessing module, a supernetwork is designed. The supernetwork combines the radiation difference and spatial distance to generate filtering weights. The filtering weights enable the two-sided propagation network to have a preference for the nearest value in the radiation difference and spatial neighborhood during the depth propagation process. The preprocessing module extracts features from the color image and the sparse depth map to obtain the image code and sparse depth map at each scale. It processes the sparse depth map at each scale to obtain an initialized dense depth map and then back-projects the initialized dense depth map into the camera space to obtain depth features. The image encoding of the same scale and the depth features are input into the multimodal fusion module for fusion to obtain a dense depth map; The depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map, and obtains the completed dense depth map by weighted combination of dense depth maps at each scale. In the preprocessing module, a hypernetwork is designed. The hypernetwork, combined with radiation difference and spatial distance, generates filter weights, expressed as: ; In the formula, Indicates a hypernet; , , Indicates the filter weights; and For target pixels and source pixels Image encoding; Source pixel Depth encoding; from arrive The pixel coordinate space offset encoding, the radiometric difference term is composed of Implicit consideration; The filtering weights impart a preference for nearest values in terms of radiation difference and spatial neighborhood to the bilateral propagation network during deep propagation. The propagation process is represented as follows: ; In the formula, Represents the initial dense depth map; Represents candidate dense depth maps; Indicates source pixel sparse depth map; This represents the set of the N points closest to pixel i in Euclidean space. The depth optimization module iteratively updates the depth propagated by the multimodal fusion module based on the sparse depth map. By weighted combination of dense depth maps at each scale, a completed dense depth map is obtained, including: The update equation at the t-th iteration is constructed as follows: ; ; ; In the formula, This represents the depth of network propagation, initialized as a fused depth map at t=0. ; Input fusion features The output of the post-convolutional layer; To enhance content relevance and affinity; It is the set of neighborhood pixels within a k×k local window; Update the propagation depth of the multimodal fusion module using a sparse depth map. The update formula is expressed as: ; In the formula, The fusion of features is achieved through the action of convolutional layers and sigmoid layers. The confidence level generated above; A sparse depth map representing the network input; An index function representing the effectiveness of depth measurement; By performing T iterations according to the above formula, the completed dense depth map is obtained, represented as: ; In the formula, Represents a set of different core sizes; Represents a set with different iteration steps; and As a confidence map generated by convolutional and softmax layers.
2. The depth completion method based on radiation difference and spatial distance according to claim 1, characterized in that, It also includes designing a multi-scale loss function to train the bilateral propagation network; The multi-scale loss function is expressed as follows: ; In the formula, Represents a truth depth map; Represents the true depth map The effective pixel set in; This represents the bilinear interpolation operation; Hyperparameters representing the smoothing of losses at different scales; Indicates scale; This represents the dense depth map predicted by the network at scale s.
3. The depth completion method based on radiation difference and spatial distance according to claim 1, characterized in that, The preprocessing module extracts features from the color image and the sparse depth map to obtain image features and sparse depth maps at each scale, including: The preprocessing module includes an image feature extraction unit and a sparse depth extraction unit; the network scales of the image feature extraction unit and the sparse depth extraction unit decrease sequentially. The color image is input into the image feature extraction unit, which performs multi-scale encoding on the color image to obtain image codes at different scales. The sparse depth map is input into the sparse depth extraction unit, which performs multi-scale encoding on the sparse depth map to obtain sparse depth maps at different scales.
4. The depth completion method based on radiation difference and spatial distance according to claim 3, characterized in that, The color image is input into the image feature extraction unit, which performs multi-scale encoding on the color image to obtain image codes at different scales, including: The color image is input into the image feature extraction unit to calculate the scale. The image encoding function is as follows: ; In the formula, Indicates image encoding; Representing image features; Indicates multimodal fusion features; This represents the depth features of the dense depth map initialized at scale s+1, back-projected into the camera space.
5. The depth completion method based on radiation difference and spatial distance according to claim 3, characterized in that, The sparse depth map is input into the sparse depth extraction unit, which performs multi-scale encoding on the sparse depth map to obtain sparse depth maps at different scales, including: The sparse depth map is input into the sparse depth extraction unit to calculate the sparse depth map at scale s. The function expression is: ; In the formula, An index function representing the effectiveness of depth measurement; It is a value close to 0; Represents the target pixel at scale s Pixel coordinates at the original resolution ; Indicates weight; Source pixel sparse depth map.
6. A depth completion device based on radiation difference and spatial distance, characterized in that, The device employing the depth completion method based on radiometric difference and spatial distance as described in any one of claims 1 to 5 includes: The data acquisition module is used to acquire a color image and a sparse depth map paired with the color image; A network construction module is used to construct a bilateral propagation network based on a multi-scale architecture. The bilateral propagation network performs dense depth estimation through multiple scales from coarse to fine. At each scale, it includes a preprocessing module, a multimodal fusion module, and a depth optimization module. In the preprocessing module, a supernetwork is designed. The supernetwork combines the radiation difference and spatial distance to generate filtering weights. The filtering weights enable the bilateral propagation network to have a preference for the nearest value in the radiation difference and spatial neighborhood during the depth propagation process. The feature extraction module is used to extract features from the color image and the sparse depth map through the preprocessing module to obtain the image code and sparse depth map at each scale, process the sparse depth map at each scale to obtain the initial dense depth map, and back-project the initial dense depth map onto the camera space to obtain depth features. The feature fusion module is used to fuse the image encoding of the same scale with the depth features into the multimodal fusion module to obtain a dense depth map; The update output module is used to iteratively update the depth propagated by the multimodal fusion module based on the sparse depth map through the depth optimization module. By weighted combination of dense depth maps at each scale, a complete dense depth map is obtained.
7. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 5.