Image processing method, device, apparatus and storage medium

By determining the prediction consistency threshold and depth similarity tolerance in the depth estimation model, the problem of noise artifacts in the training of the depth estimation model is identified and preserved, thus improving the denoising accuracy and stability.

CN122243801APending Publication Date: 2026-06-19UBTECH ROBOTICS CORP LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
UBTECH ROBOTICS CORP LTD
Filing Date
2026-04-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to distinguish between real-world scene features and noise artifacts during image preprocessing for deep estimation models. This leads to the destruction of key spatial geometry during denoising, impacting model training accuracy and generalization ability.

Method used

By acquiring the original depth image and the predicted depth image, the prediction consistency threshold and depth similarity tolerance are determined, connected components are identified, and denoising is performed based on these parameters to ensure the preservation of true features and accurate identification of noise.

Benefits of technology

It improves the accuracy and stability of denoising, protects the integrity of key spatial geometry, and achieves cross-scene adaptive denoising capability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243801A_ABST
    Figure CN122243801A_ABST
Patent Text Reader

Abstract

This application provides an image processing method, apparatus, electronic device, and storage medium, comprising: acquiring an original depth image for training a depth estimation model, and acquiring a color image corresponding to the original depth image and a corresponding predicted depth image, wherein the predicted depth image is predicted based on the color image; determining a prediction consistency threshold based on the original depth image and the predicted depth image, and determining a depth similarity tolerance based on the original depth image; identifying multiple connected components from the original depth image based on the depth similarity tolerance and the prediction consistency threshold; and denoising the original depth image based on the multiple connected components to obtain a target depth image. This improves the accuracy of denoising the original depth image used for training the depth estimation model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of Internet technology, and in particular to an image processing method, apparatus, electronic device, and computer-readable storage medium. Background Technology

[0002] In related technologies, when preprocessing training images for depth estimation models, a "one-size-fits-all" strategy with fixed parameters is often adopted, making it difficult to distinguish between real scene features and noise artifacts. This makes it easy for the denoising process to destroy key spatial geometric structures (such as object edges) and it cannot adapt to the differences in data from different scenes. Ultimately, this results in poor quality of the cleaned supervision data, which severely restricts the training accuracy and generalization ability of depth estimation models. Summary of the Invention

[0003] This application provides an image processing method, apparatus, electronic device, and computer-readable storage medium that can improve the accuracy of denoising raw depth images used to train depth estimation models.

[0004] The technical solution of this application embodiment is implemented as follows: This application provides an image processing method, including: Obtain the original depth image used to train the depth estimation model, and obtain the color image corresponding to the original depth image and the corresponding predicted depth image, wherein the predicted depth image is predicted based on the color image; Based on the original depth image and the predicted depth image, a prediction consistency threshold is determined, and based on the original depth image, a depth similarity tolerance is determined. Wherein, the depth similarity tolerance is the threshold of the depth difference between any adjacent pixels in the original depth image, and the prediction consistency threshold is the threshold of the relative error of the depth of the same pixel in the original depth image and the predicted depth image. Based on the depth similarity tolerance and the prediction consistency threshold, multiple connected components are identified from the original depth image; Based on the multiple connected components, the original depth image is denoised to obtain the target depth image.

[0005] This application provides an image processing apparatus, including: The acquisition module is used to acquire the original depth image for training the depth estimation model, and to acquire the color image corresponding to the original depth image and the corresponding predicted depth image, wherein the predicted depth image is predicted based on the color image; The determination module is used to determine a prediction consistency threshold based on the original depth image and the predicted depth image, and to determine a depth similarity tolerance based on the original depth image; wherein, the depth similarity tolerance is a threshold for the depth difference between any adjacent pixels in the original depth image, and the prediction consistency threshold is a threshold for the relative error of the depth of the same pixel in the original depth image and the predicted depth image. The identification module is used to identify multiple connected components from the original depth image based on the depth similarity tolerance and the prediction consistency threshold. The denoising module is used to denoise the original depth image based on the multiple connected components to obtain the target depth image.

[0006] This application provides an electronic device, including: Memory is used to store executable instructions or computer programs. The processor, when executing computer-executable instructions or computer programs stored in the memory, implements the image processing method provided in the embodiments of this application.

[0007] This application provides a computer-readable storage medium storing computer-executable instructions or computer programs, which, when executed by a processor, implement the image processing method provided in this application.

[0008] This application provides a computer program product, which includes computer-executable instructions or a computer program. When the computer-executable instructions or the computer program are executed by a processor, the processor will execute the image processing method provided in this application.

[0009] The embodiments of this application have the following beneficial effects: A predicted depth image generated from a color image is used as a priori guide, and a prediction consistency threshold is determined. By comparing the relative error between the original depth and the model's predicted depth, semantic understanding is imparted to the denoising process. Thus, features in the original image that conform to the prediction pattern are identified as real features, while those that deviate significantly from the prediction are accurately identified as noise. This achieves intelligent differentiation between real features and noise, thereby improving the accuracy of denoising. Furthermore, compared to pixel-level independent processing in related technologies (such as filtering and smoothing), this application combines depth similarity tolerance (continuity of adjacent pixels) and prediction consistency threshold (semantic correctness) to extract connected components. This ensures that the spatially continuous and semantically correct real object surface is preserved as a whole, perfectly protecting... This ensures that true depth transitions (object edges) are not blurred or destroyed, protecting the integrity of key spatial geometric structures and further improving the accuracy of denoising. In addition, since the two core parameters (prediction consistency threshold and depth similarity tolerance) on which the identified connected components depend are not fixed values, but dynamically determined based on the original depth image and the predicted depth image, this application can adaptively adjust the denoising intensity for different scenes (such as flat areas or complex edges) according to the depth distribution gradient and prediction error quality of the current input image, ensuring the stability and universality of the denoising effect, thus possessing cross-scene adaptive denoising capability. Attached Figure Description

[0010] Figure 1 This is a schematic diagram of the architecture of the image processing system 100 provided in an embodiment of this application; Figure 2 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application; Figure 3 This is a schematic flowchart of the image processing method provided in the embodiments of this application; Figure 4 This is a schematic diagram of the original depth image, color image, and predicted depth image provided in the embodiments of this application; Figure 5 This is a schematic diagram of the target depth image provided in an embodiment of this application; Figure 6 This is a flowchart illustrating the process of denoising the original depth image provided in the embodiments of this application. Detailed Implementation

[0011] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0012] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0013] In the following description, the terms "first, second, third" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, third" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0014] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0015] Before providing a further detailed description of the embodiments of this application, the nouns and terms involved in the embodiments of this application will be explained, and the nouns and terms involved in the embodiments of this application shall be interpreted as follows.

[0016] 1) Client, also known as user terminal, refers to the program that provides local services to users in contrast to the server. Except for some applications that can only run locally, it is generally installed on the terminal and needs to work with the server. That is, there needs to be a corresponding server and service program on the network to provide the corresponding services. Thus, a specific communication connection needs to be established between the client and the server to ensure the normal operation of the application.

[0017] 2) A depth estimation model refers to a computational model or algorithm system built on a deep neural network architecture, capable of inferring 3D scene geometric information from 2D image data. It typically includes substructures such as a feature extraction network, a cost volume construction module, and a disparity regression module, and fits the nonlinear mapping relationship between the input image and the output depth through iterative training on a large amount of data. In this application, the depth estimation model is trained on data that has undergone noise enhancement (i.e., images obtained by splitting stitched images after adding noise data). During the training phase, the model continuously updates its internal weight parameters through a backpropagation algorithm to minimize the difference between the predicted value and the depth label. For example, but not limited to, the model can be a stereo matching network based on a convolutional neural network, a monocular depth estimation network, or a multi-view stereo vision network architecture.

[0018] See Figure 1 , Figure 1This is a schematic diagram of the architecture of the image processing system 100 provided in the embodiments of this application. The terminal (terminal 400 is shown as an example) is connected to the server 200 through the network 300. The network 300 can be a wide area network or a local area network, or a combination of the two, and data transmission is achieved using wireless or wired links.

[0019] Terminal 400 is used to send a target image processing task to the server. The server 200 is further configured to, upon receiving a target image processing task, acquire, based on the analysis task, an original depth image for training a depth estimation model, and acquire a color image corresponding to the original depth image, as well as a corresponding predicted depth image, wherein the predicted depth image is predicted based on the color image; determine a prediction consistency threshold based on the original depth image and the predicted depth image, and determine a depth similarity tolerance based on the original depth image; wherein the depth similarity tolerance is a threshold for the depth difference between any adjacent pixels in the original depth image, and the prediction consistency threshold is a threshold for the relative error of the depth of the same pixel in the original depth image and the predicted depth image; identify multiple connected components from the original depth image based on the depth similarity tolerance and the prediction consistency threshold; and denoise the original depth image based on the multiple connected components to obtain the target depth image.

[0020] In some embodiments, server 200 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms. Terminal 400 can be a smartphone, tablet, laptop, desktop computer, set-top box, smart voice interaction device, smart home appliance, virtual reality device, vehicle terminal, aircraft, portable music player, personal digital assistant, dedicated messaging device, portable gaming device, smart speaker, and smartwatch, but is not limited to these. The terminal and server can be directly or indirectly connected via wired or wireless communication, which is not limited in this embodiment.

[0021] The electronic device implementing the image processing method provided in the embodiments of this application will now be described. See also Figure 2 , Figure 2 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device can be a server or a terminal. The electronic device is used as an example. Figure 1 Taking the server shown as an example, Figure 2The illustrated electronic device includes at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400 are coupled together via a bus system 440. It is understood that the bus system 440 is used to implement communication between these components. In addition to a data bus, the bus system 440 also includes a power bus, a control bus, and a status signal bus. However, for clarity, in… Figure 2 The general labeled all buses as Bus System 440.

[0022] Processor 410 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor can be a microprocessor or any conventional processor, etc.

[0023] User interface 430 includes one or more output devices 431 that enable the display of media content, including one or more speakers and / or one or more visual displays. User interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

[0024] The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state storage, hard disk drives, optical disk drives, etc. The memory 450 may optionally include one or more storage devices physically located away from the processor 410.

[0025] The memory 450 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), and the volatile memory may be random access memory (RAM). The memory 450 described in this application embodiment is intended to include any suitable type of memory.

[0026] In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules, and data structures or subsets or supersets thereof, as illustrated below.

[0027] Operating system 451 includes system programs for handling various basic system services and performing hardware-related tasks, such as the framework layer, core library layer, driver layer, etc., for implementing various basic business functions and handling hardware-based tasks; The network communication module 452 is used to reach other electronic devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: Bluetooth, WiFi, and Universal Serial Bus (USB), etc. Presentation module 453 is configured to enable the display of information (e.g., user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., display screen, speaker, etc.) associated with user interface 430. The input processing module 454 is used to detect and translate one or more user inputs or interactions from one or more input devices 432.

[0028] In some embodiments, the apparatus provided in this application can be implemented in software. Figure 2 An image processing apparatus 455 stored in memory 450 is shown. This apparatus can be software in the form of programs and plugins, and includes the following software modules: an acquisition module 4551, a determination module 4552, a recognition module 4553, and a noise reduction module 4554. These modules are logically linked and can therefore be arbitrarily combined or further separated according to their implemented functions. The functions of each module will be described below.

[0029] Based on the foregoing description of the image processing system and electronic device provided in the embodiments of this application, the image processing method provided in the embodiments of this application will be described below. In actual implementation, the image processing method provided in the embodiments of this application can be implemented by a terminal or a server alone, or by a terminal and a server working together, so that... Figure 1 The following description uses the example of server 200 executing the image processing method provided in this application embodiment alone. See also... Figure 3 , Figure 3 This is a schematic flowchart of the image processing method provided in the embodiments of this application. Next, it will be combined with... Figure 3 The steps shown are explained.

[0030] Step 101: Obtain the original depth image used to train the depth estimation model, and obtain the color image corresponding to the original depth image and the corresponding predicted depth image. The predicted depth image is obtained based on the color image.

[0031] It should be noted that the raw depth image is an image containing initial depth distance information of the scene, acquired through a depth sensor or generated in a simulated environment. It is used as a supervision signal during the training of the depth estimation model, recording the actual physical distance values ​​from each pixel in the 3D scene to the camera. For example, it can be an initial spatial distance image containing discrete noise acquired by a LiDAR. However, due to the physical limitations of the acquisition equipment or environmental interference, the raw depth images actually acquired often contain quality defects such as discrete noise, local connected component errors, and discontinuous depth values.

[0032] In some embodiments, the original depth image and its corresponding color image can be obtained in the following ways: the three-dimensional distance information of the real scene is collected by a physical depth sensor (such as a structured light camera, a ToF time-of-flight camera, a lidar, etc.) as the original depth image, and the corresponding color image is collected synchronously by a conventional RGB camera that has been strictly calibrated with the depth sensor's intrinsic and extrinsic parameters and aligned with the pixel-level space; or, in a simulation environment such as virtual autonomous driving or robot navigation, the three-dimensional rendering engine directly and synchronously outputs a perfectly aligned depth ground truth map and a color rendering map.

[0033] It should be noted that the predicted depth image is the depth reference image output by using a pre-trained depth estimation model to perform feature inference on a color image. For example, it is a high-quality depth reference image generated after inputting a color image into a pre-trained depth estimation model.

[0034] For example, see Figure 4 , Figure 4 These are schematic diagrams of the original depth image, color image, and predicted depth image provided in embodiments of this application, based on... Figure 4 ,based on Figure 4 Image 401 indicates the original depth image, image 402 indicates the color image, and image 403 indicates the predicted depth image.

[0035] Step 102: Based on the original depth image and the predicted depth image, determine the prediction consistency threshold, and based on the original depth image, determine the depth similarity tolerance; wherein, the depth similarity tolerance is the threshold of the depth difference between any adjacent pixels in the original depth image, and the prediction consistency threshold is the threshold of the relative error of the depth of the same pixel in the original depth image and the predicted depth image.

[0036] In some embodiments, the process of determining the prediction consistency threshold based on the original depth image and the predicted depth image may include: performing pixel alignment on the original depth image and the predicted depth image to obtain an alignment result; determining the depth value difference between the original depth image and the predicted depth image based on the alignment result; determining the mean, standard deviation, and absolute median of the prediction error based on the depth value difference; generating a region quality assessment value based on the mean, standard deviation, and absolute median of the prediction error; and constructing the prediction consistency threshold based on the region quality assessment value.

[0037] It's important to note that pixel alignment refers to performing a pixel-by-pixel correspondence matching operation between different images in a spatial coordinate system. This ensures accurate error comparison of depth values ​​at the same spatial location between the original depth image and the predicted depth image. The depth value difference (i.e., depth discrepancy) is the difference between the depth value of a pixel in the original depth image and the depth value of a pixel at the same spatial location in the predicted depth image. This quantifies the magnitude of the error between the original and predicted depth images; for example, it's the absolute deviation obtained by subtracting the predicted depth value from the observed depth value of a pixel in the predicted depth image. Pixel alignment between the original and predicted depth images can be achieved using bilinear interpolation or nearest-neighbor interpolation algorithms to scale the resolution of the predicted depth image to match that of the original depth image. Subsequently, a mask matrix is ​​constructed by extracting the set of pixel coordinates with effective depth from the original depth image. This mask matrix is ​​then used to perform a bitwise AND operation on the predicted depth image, ensuring that the effective computational regions of both images completely overlap. Furthermore, the depth value difference can be determined through element-wise subtraction of the matrix, taking either the absolute value or a relative ratio.

[0038] The mean prediction error refers to the arithmetic mean of all depth value differences within a specific spatial region. For example, it's the average obtained by summing the depth value differences of one hundred pixels within that region and dividing by one hundred. The standard deviation is a quantitative indicator of the dispersion of all depth value differences within a specific spatial region from the mean prediction error. Here, it reflects the degree of fluctuation in depth data quality within that region; for example, it's a quantitative parameter of the fluctuation amplitude calculated based on the depth value differences and the mean prediction error. The absolute median deviation refers to the median of the absolute deviations of all depth value differences within a specific spatial region. Here, it measures the symmetry of the depth value difference distribution to identify extreme outlier noise pixels; for example, it's the value extracted from the middle position after sorting the absolute deviations of depth value differences from the overall median. Here, absolute deviation refers to the difference between a single measurement and the average of multiple measurements.

[0039] The region quality assessment value refers to a comprehensive quantitative score reflecting the consistency between a specific spatial region in the original depth image and the model's prior knowledge. Here, it is used to objectively evaluate the data credibility level of local spatial regions in the original depth image, where the region quality assessment value belongs to the interval [0, 1]. The prediction consistency threshold refers to the maximum threshold of the relative error of the depth of the same pixel in the original depth image and the predicted depth image. Here, it is used to distinguish between true geometric features and noise artifacts in the original depth image.

[0040] In practice, after acquiring the original depth image and the predicted depth image, the spatial coordinate systems of the original depth image and the predicted depth image are uniformly transformed to perform pixel alignment, obtaining the alignment result. Then, based on the alignment result, pixels with identical spatial mapping positions in both the original and predicted depth images are extracted. For each extracted pixel, the observed depth value in the original depth image and the inferred depth value in the predicted depth image are obtained. The difference between the observed and inferred depth values ​​is then calculated, and the absolute value of this difference is defined as the depth value difference between the original and predicted depth images.

[0041] Next, the original depth image is divided into multiple local regions, which are the specific spatial regions mentioned earlier. For each local region, the average of the differences in all depth values ​​within that region is calculated and determined as the mean prediction error. Then, the standard deviation of the differences in all depth values ​​within that local region from the mean prediction error is calculated. Next, the median of the absolute deviations of the differences in all depth values ​​within that local region is calculated and determined as the absolute median deviation. Finally, the mean prediction error, standard deviation, and absolute median deviation are used as input parameters and input into a preset quality quantification formula to generate a region quality assessment value, i.e.: ...Formula (1); in, This refers to the standard deviation of the depth difference. This refers to the mean of the depth difference, which is also the mean of the prediction error. This refers to the absolute median difference of the depth differences. This refers to the range of depth values, specifically the difference between the maximum and minimum depth values. This refers to the regional quality assessment value.

[0042] It should be noted that the calculation process for standard deviation, absolute deviation, median of absolute deviation, and average of the differences of all depth values ​​are all existing technologies, and will not be described in detail in the embodiments of this application.

[0043] Then, the process of constructing a predicted consistency threshold based on the regional quality assessment value can be as follows: First, a preset first quality benchmark threshold and a second quality benchmark threshold are obtained, where the first quality benchmark threshold is greater than the second quality benchmark threshold. For each local region, if the regional quality assessment value corresponding to that local region is greater than or equal to the first quality benchmark threshold, the corresponding local region is determined as a high-consistency range, and a first predicted consistency threshold corresponding to that high-consistency range is generated. If the regional quality assessment value corresponding to that local region is less than the first quality benchmark threshold but greater than or equal to the second quality benchmark threshold, the corresponding local region is determined as a medium-consistency range, and a second predicted consistency threshold corresponding to that medium-consistency range is generated. If the regional quality assessment value corresponding to that local region is less than the second quality benchmark threshold, the corresponding local region is determined as a low-consistency range, and a third predicted consistency threshold corresponding to that low-consistency range is generated. Thus, the first predicted consistency threshold, the second predicted consistency threshold, and the third predicted consistency threshold are determined as the constructed predicted consistency threshold. The first prediction consistency threshold, the second prediction consistency threshold, and the third prediction consistency threshold can be randomly generated, as long as they conform to the size relationship between the three, such as the first prediction consistency threshold, the second prediction consistency threshold, and the third prediction consistency threshold decreasing in size in sequence.

[0044] First, the distribution of regional quality assessment values ​​for all regions is statistically analyzed, dividing them into several quality levels (e.g., high-quality regions have a regional quality assessment value greater than 0.8, medium-quality regions have a regional quality assessment value greater than 0.5 and less than or equal to 0.8, and low-quality regions have a regional quality assessment value less than or equal to 0.5). For high-quality regions, since the original depth image and the predicted depth image are highly consistent, the original depth image in this region is generally reliable. Therefore, a relatively lenient prediction consistency threshold is set (i.e., allowing for larger prediction errors) to avoid over-cleaning and losing valuable geometric details. For low-quality regions, the original depth image and the predicted depth image deviate significantly, indicating a high probability of noise in the data in this region. Therefore, a strict prediction consistency threshold is set (i.e., tolerating only very small prediction errors) to ensure that noise is effectively identified and removed. This hierarchical strategy allows the cleaning behavior to adaptively match the data quality status of different regions, avoiding over-cleaning of high-quality regions and omitting noise in low-quality regions.

[0045] In some embodiments, the process of determining the depth similarity tolerance based on the original depth image may involve analyzing the depth gradient distribution of the original depth image and constructing the depth similarity tolerance based on the depth gradient distribution.

[0046] It should be noted that depth gradient distribution refers to the statistical set representation of the rate of change of depth observation values ​​between adjacent pixels in the original depth image in the spatial dimension. Here, it is used to quantify the flatness or edge steepness of each spatial region in the original depth image. Depth similarity tolerance refers to the maximum threshold for determining whether the depth difference between any two adjacent pixels in the original depth image belongs to a reasonable continuous physical structure. Here, it serves as a geometric consistency boundary condition for distinguishing between the continuity of the real physical surface and the isolation of discrete noise.

[0047] In actual implementation, after acquiring the original depth image, for each pixel in the original depth image, the following processing is performed to obtain the depth change rate of that pixel: determine multiple neighboring pixels of that pixel, and determine the depth value of that pixel and the depth values ​​of each neighboring pixel; for each neighboring pixel, determine the depth change rate between that pixel and that neighboring pixel; from the depth change rates corresponding to multiple neighboring pixels, select the maximum depth change rate as the depth change rate corresponding to that pixel.

[0048] Next, the depth change rate of each pixel is summarized. The depth change rate represents the depth jump amplitude of a local region in the image. Then, the frequency of the depth change rate is statistically analyzed according to the numerical range to determine the proportion of regions with gradual depth changes and regions with dramatic depth changes in the original depth image, generating the depth gradient distribution of the original depth image. Then, based on the depth gradient distribution, the original depth image is divided into multiple local regions, and a corresponding depth similarity tolerance is generated for each local region. The depth similarity tolerances of different local regions can be the same or different. It should be noted that the local regions mentioned here may be the same as or different from the local regions described above. This application does not limit this. Furthermore, the depth change rate of pixels within the same local region is approximately the same. These local regions include flat regions and abrupt change regions. In flat regions, the depth change rate of pixels is less than or equal to a pre-set threshold, while in abrupt change regions, the depth change rate is greater than the pre-set threshold. The magnitude of the depth change rate of pixels within a local region is directly proportional to the depth similarity tolerance. For example, the depth similarity tolerance for flat regions is less than that for abrupt change regions. Thus, by adjusting the depth similarity tolerance based on the depth gradient distribution, in flat regions with small depth gradients (such as walls and floors), the depth value changes of adjacent pixels should be very gradual. Therefore, a smaller depth similarity tolerance is set so that pixels with slight abrupt changes in depth value are judged as disconnected, thereby accurately identifying noise points in flat regions. In edges or transition regions of objects with large depth gradients, the depth values ​​themselves exhibit significant and reasonable changes. Therefore, a larger depth similarity tolerance is set to avoid misjudging reasonable depth jumps as noise and to protect the true depth edge structure. The process of generating corresponding depth similarity tolerances for each local region can be randomized, as long as the magnitude of the depth change rate of pixels in the local region is directly proportional to the magnitude of the depth similarity tolerance. That is, the smaller the change rate, the smaller the depth similarity tolerance of the local region.

[0049] Step 103: Based on depth similarity tolerance and prediction consistency threshold, identify multiple connected components from the original depth image.

[0050] It should be noted that a connected component refers to a set of pixels that are adjacent (close to each other) in image space and satisfy specific similarity feature constraints, i.e., a local image region. In this application, a connected component is specifically represented as a physical region in the depth map that is depth-continuous and whose model prediction is reliable. It usually corresponds to the surface of a real independent object or a complete geometric continuous structure in a 3D scene.

[0051] In some embodiments, the process of identifying multiple connected components from the original depth image based on depth similarity tolerance and prediction consistency threshold may involve: traversing the pixels in the original depth image; for the i-th pixel encountered, if the i-th pixel is a pixel whose belonging to a connected component has not yet been determined, determining at least one neighboring pixel of the i-th pixel, where i is an integer greater than or equal to 1; for each neighboring pixel, based on depth similarity tolerance and prediction consistency threshold, when it is determined that the neighboring pixel and the i-th pixel belong to the same connected component, the neighboring pixel is determined to be in the same connected component as the i-th pixel when the neighboring pixel satisfies the target condition; wherein the target condition includes at least one of the following: the depth difference between the neighboring pixel and the i-th pixel is less than or equal to the depth similarity tolerance; the relative error of the depth of the neighboring pixel in the original depth image and the predicted depth image is less than or equal to the prediction consistency threshold; and when all the pixels in the original depth image have been traversed, multiple connected components in the original depth image are obtained.

[0052] It should be noted that adjacent pixels refer to surrounding pixels that are directly adjacent to the target pixel in the image coordinate system or are within a preset adjacency range. For example, they are surrounding pixels in the four orthogonal directions (up, down, left, and right) determined with the target pixel as the center point.

[0053] Meanwhile, the depth similarity tolerance, as mentioned earlier, is the maximum allowable depth difference between adjacent pixels. It is used to determine whether adjacent pixels exhibit structural continuity in physical space. Similarly, the prediction consistency threshold, also as mentioned earlier, is the maximum allowable relative error in depth between the original and predicted depth images for the same pixel. In practice, after acquiring the original depth image and the predicted depth image, the pixels in the original depth image are traversed one by one. For the i-th pixel encountered during the traversal, it is determined whether the i-th pixel belongs to a connected component whose membership has not yet been determined. If the i-th pixel belongs to a connected component whose membership has already been determined, then the i-th pixel is skipped, and the traversal continues to the next pixel in the original depth image.

[0054] If the i-th pixel is a pixel whose connected component to which it belongs has not yet been determined, then at least one neighboring pixel of the i-th pixel is determined. For each neighboring pixel, the pre-calculated depth similarity tolerance and prediction consistency threshold are obtained, and then based on the depth similarity tolerance and prediction consistency threshold, it is determined whether the neighboring pixel meets the target condition. Specifically, the depth difference between the neighboring pixel and the i-th pixel is calculated, and then it is determined whether the depth difference is less than or equal to the depth similarity tolerance; at the same time, the relative error of the depth of the neighboring pixel in the original depth image and the predicted depth image is calculated, and then it is determined whether the relative error is less than or equal to the prediction consistency threshold. When the depth difference is less than or equal to the depth similarity tolerance, and / or the relative error is less than or equal to the prediction consistency threshold, the neighboring pixel is determined to meet the target condition.

[0055] It should be noted that, as mentioned above, there are multiple depth similarity tolerances and multiple prediction consistency thresholds. Therefore, before determining whether adjacent pixels meet the target conditions based on the depth similarity tolerances and prediction consistency thresholds, the local region corresponding to the adjacent pixel is first determined, and then the depth similarity tolerances and prediction consistency thresholds corresponding to the local region are determined as the target depth similarity tolerances and target prediction consistency thresholds for the adjacent pixel. Then, based on the target depth similarity tolerances and target prediction consistency thresholds, it is determined whether the adjacent pixels meet the target conditions.

[0056] Meanwhile, as mentioned above, the target conditions include at least one of the following: the depth difference between an adjacent pixel and the i-th pixel is less than or equal to the depth similarity tolerance; the relative error of the depth of adjacent pixels in the original depth image and the predicted depth image is less than or equal to the prediction consistency threshold. Here, taking the target conditions including the depth difference between an adjacent pixel and the i-th pixel being less than or equal to the depth similarity tolerance, and the relative error of the depth of adjacent pixels in the original depth image and the predicted depth image being less than or equal to the prediction consistency threshold as an example, based on the target depth similarity tolerance and the target prediction consistency threshold, it is determined whether the adjacent pixels meet the target conditions, that is: ...Formula (2); in, It refers to the i-th pixel. It refers to the adjacent pixels of the i-th pixel. This refers to the ground truth depth of the i-th pixel, which is the depth of the pixel. The true depth value at a given location is the distance from the 3D scene point corresponding to that pixel to the camera. The true value is usually obtained directly by a depth sensor (such as a structured light camera, a ToF camera, or a LiDAR), or directly output by the rendering engine in a simulation environment. Adjacent pixels The truth depth; This refers to the depth similarity tolerance. This refers to the prediction consistency threshold. It refers to adjacent pixels. and pixels A boolean value indicating whether the network is connected; Refers to pixels The relative error, i.e., pixel points The relative deviation of depth in the original depth image and the predicted depth image is the ratio of the absolute value of the difference between the true depth and the predicted depth value of the adjacent pixel to the predicted depth value. The true depth of the adjacent pixel is the true depth of the adjacent pixel in the original depth image, and the predicted depth of the adjacent pixel is the depth value of the adjacent pixel in the predicted depth image.

[0057] Then, when it is determined that an adjacent pixel satisfies the target condition, it is determined that the adjacent pixel and the i-th pixel belong to the same connected component. In this case, the adjacent pixel is assigned a belonging label to the corresponding connected component, and its state is updated to indicate that it has been assigned to a connected component. The process then continues, using the adjacent pixel as a new starting point, to search for the next layer of adjacent pixels for target condition judgment. Conversely, when it is determined that an adjacent pixel does not satisfy the target condition, it is determined that the adjacent pixel and the i-th pixel do not belong to the same connected component, and the adjacent pixel is not added to the connected component. The search traversal path towards the adjacent pixel is then truncated. This traversal process continues until all pixels in the original depth image have been traversed, at which point all valid pixels in the original depth image have obtained a belonging label. Based on the distribution of the belonging labels, the terminal obtains multiple connected components in the original depth image.

[0058] Step 104: Based on multiple connected components, denoise the original depth image to obtain the target depth image.

[0059] It should be noted that after determining multiple connected components, the original depth image can be denoised based on these components to obtain the target depth image. For example, see [link to relevant documentation]. Figure 5 , Figure 5 This is a schematic diagram of the target depth image provided in the embodiments of this application, based on Figure 5 Image 501 indicates the original depth image, and image 502 indicates the target depth image. Then, after determining the target depth image, the depth estimation model can be trained based on it.

[0060] In some embodiments, the process of denoising the original depth image based on multiple connected components to obtain the target depth image is specifically described in the following example. Figure 6 , Figure 6This is a flowchart illustrating the process of denoising the original depth image provided in an embodiment of this application, based on... Figure 6 The process of denoising the original depth image based on multiple connected components to obtain the target depth image is achieved through the following steps.

[0061] Step 1041: Based on the original depth image, analyze the effective pixel density and scene complexity of the original depth image, and determine the minimum connected component area threshold based on the effective pixel density and scene complexity.

[0062] It should be noted that effective pixel density refers to the proportion of pixels with effective depth detection values ​​in the original depth image to the total number of pixels in the image. Here, it is used to assess the data integrity of the original depth image. The effective depth detection value refers to a reasonable value representing the true physical distance, successfully measured and recorded by the sensor at a specific pixel. Scene complexity is a comprehensive quantitative indicator reflecting the degree of intersection of physical environment contours and the richness of depth levels in the original depth image. Here, it is used to indicate whether to preserve the true details and structure of the image or remove isolated noise. The minimum connected component area threshold is a preset threshold for distinguishing isolated noise fragments from real minute physical structures. Here, it serves as a benchmark reference value for dividing connected components into different area categories.

[0063] In practical implementation, after acquiring the original depth image and the multiple connected components contained within it, the number of non-zero or non-infinite effective pixels in the original depth image is counted. This effective pixel density is obtained by dividing the number of effective pixels by the total number of pixels in the original depth image. Simultaneously, the total length of edges where depth transitions occur is extracted from the original depth image, and the ratio of this total edge length to the number of effective pixels is calculated to obtain the scene complexity of the original depth image. The effective pixel density and scene complexity are then input into a preset threshold calculation function. Based on the product of the effective pixel density and scene complexity, along with a weighting coefficient, a minimum connected component area threshold is generated.

[0064] Step 1042: Extract features from each connected component in the multiple connected components to obtain multidimensional features of the connected components.

[0065] It should be noted that multidimensional features are used to reflect the shape of the external contour of a connected component and the state of its internal data, including geometric features, statistical features, and consistency features.

[0066] In practice, a traversal operation is performed on multiple connected components. For each connected component, geometric features are extracted to obtain its geometric features (such as area, shape complexity / compactness), and statistical features are extracted to obtain its statistical features (such as internal depth variance and depth gradient distribution). Simultaneously, consistency features are extracted to obtain its consistency features (such as average prediction error and error standard deviation). Thus, based on geometric, statistical, and consistency features, multidimensional features of the connected components are obtained.

[0067] Step 1043: Based on the minimum connected region area threshold and multidimensional features, determine the category of each connected region; wherein, the categories of connected regions include small-area connected regions and large-area connected regions; different categories correspond to different denoising methods.

[0068] It should be noted that a small-area connected component refers to a connected component whose pixel coverage area is smaller than the minimum connected component area threshold. This indicates that the current connected component is very likely to be a discrete noise point, and a denoising method that focuses on verifying the prediction benchmark is required. A large-area connected component refers to a connected component whose pixel coverage area is larger than the minimum connected component area threshold. This indicates that the current connected component most likely represents the surface of a real physical object, and a denoising method that focuses on protecting the overall structure and correcting local defects is required.

[0069] In practical implementation, the process of determining the category of each connected component based on the minimum connected component area threshold and multidimensional features involves the following steps: When the multidimensional features include geometric features, the geometric features are first extracted, and then the actual area value of the connected component is parsed from these features. The actual area value is compared with the minimum connected component area threshold. If the actual area value is greater than or equal to the minimum connected component area threshold, the connected component is determined to have a sufficient area, i.e., it is classified as a large-area connected component. If the actual area value is less than the minimum connected component area threshold, the connected component is determined to have a small area, i.e., it is classified as a small-area connected component.

[0070] Step 1044: For each connected component, a denoising method corresponding to the category of the connected component is used to denoise the original depth image to obtain the target depth image.

[0071] It should be noted that the target depth image refers to the final depth image output after identifying and removing erroneous noise pixels. This image serves as a high-quality data foundation for training downstream depth estimation models, such as a clean depth image that has removed a large amount of flying point noise and retained clear object edges.

[0072] In practice, as mentioned above, different categories correspond to different denoising methods. Therefore, the process of denoising the original depth image and obtaining the target depth image by adopting the denoising method corresponding to the category of the connected component for different categories of connected components is also different. Next, we will explain the process of denoising the original depth image and obtaining the target depth image by adopting the denoising method corresponding to the category of the connected component for small-area connected components and large-area connected components respectively.

[0073] In some embodiments, the process of denoising the original depth image to obtain the target depth image by adopting a denoising method corresponding to the category of the connected component for each connected component may be as follows: when the category of the connected component is a small-area connected component, the following processing is performed for each connected component: determining the region in the predicted depth image corresponding to the connected component as the first region; evaluating the consistency between the first region and the connected component to obtain the evaluation result; if the evaluation result indicates that the first region and the connected component are inconsistent, then the connected component is deleted as noise; if the evaluation result indicates that the first region and the connected component are consistent, then the connected component is retained; and obtaining the target depth image based on the processed connected components.

[0074] It should be noted that, as mentioned earlier, a small-area connected region refers to a local continuous pixel region, i.e., a local image region, containing a total number of pixels below a preset minimum connected region area threshold. Here, it is used to identify tiny target regions of the depth to be verified that are easily confused with discrete flying point noise. The first region refers to the pixel reference matrix extracted from the predicted depth image that completely coincides with the connected regions in the original depth image in terms of spatial coordinates.

[0075] Consistency refers to the degree of similarity between the original depth observation values ​​in the connected region and the model-predicted depth values ​​in the first region. This is used to distinguish whether small connected regions represent real physical details or random errors generated by the detection equipment. Noise refers to erroneous observation data or meaningless interfering pixels in the depth image that significantly deviate from the actual 3D spatial structure of the physical scene.

[0076] In practical implementation, the original depth image, the predicted depth image, and the category status of each connected component are acquired. When the category of a connected component is a small-area connected component, the terminal performs the following processing for each connected component: Extracting the two-dimensional absolute coordinates of all pixels in the connected component and mapping these coordinates to the coordinate system of the predicted depth image; then, finding a set of pixels in the predicted depth image with identical two-dimensional absolute coordinates to determine the region in the predicted depth image corresponding to the connected component, which is designated as the first region. Next, extracting the depth values ​​from the original depth image of each pixel within the connected component as the original observed depth values, and extracting the depth values ​​from the predicted depth image of each corresponding pixel within the first region as the model predicted depth values; calculating the absolute value of the difference between the original observed depth values ​​and the model predicted depth values, and dividing the absolute value by the original observed depth values ​​to obtain the relative prediction error for each pixel. Then, the arithmetic mean of the relative prediction errors of each pixel within the connected component is summed to obtain the average relative prediction error. This average relative prediction error is then used to evaluate the consistency between the first region and the connected component. Specifically, firstly, a preset average deviation threshold is obtained. If the average relative prediction error is less than or equal to the average deviation threshold, an evaluation result indicating that the first region is consistent with the connected domain is obtained, thus determining that the connected domain belongs to the real small geometric structure recognized by the model in the prior art, and the original observation depth value of the connected domain remains unchanged to preserve the connected domain. If the average relative prediction error is greater than the average deviation threshold, an evaluation result indicating that the first region is inconsistent with the connected domain is obtained, thus determining that the connected domain belongs to anomalous outlier data lacking physical entity structure support, and the data state of all pixels in the connected domain is marked as invalid value, so as to delete the connected domain as noise.

[0077] In this way, the deletion and retention operations are performed on all objects in the original depth image that are determined to be small connected components. All remaining valid connected components that were not deleted and the small real geometric structures that were retained are summarized. Pixels marked as invalid values ​​are extracted and the invalid values ​​are filtered or replaced with blank states. Finally, based on the processed connected components, the target depth image is obtained.

[0078] In other embodiments, the process of denoising the original depth image to obtain the target depth image by adopting a denoising method corresponding to the category of the connected component for each connected component may be as follows: when the category of the connected component is a large-area connected component, the boundary pixels in the connected component are determined, and the internal consistency of the connected component and the boundary sharpness corresponding to each boundary pixel are determined; based on the internal consistency and boundary sharpness, the connected component is locally modified to retain the main objects in the connected component and obtain the target depth image.

[0079] It should be noted that, as mentioned earlier, a large connected region refers to a local continuous pixel region, i.e., a local image region, containing a total number of effective pixels that reaches or exceeds the minimum connected region area threshold. Here, it is used to characterize local image regions in an image that have a high probability of representing a real physical scene (such as the ground or walls), to guide the avoidance of misclassifying the overall structure as noise. Boundary pixels refer to special pixels located at the outermost edge of a large connected region, and which contain pixels that do not belong to the large connected region within their eight or four neighborhoods; for example, edge contour pixels that directly intersect with the background environment within a large connected region. Internal consistency refers to the uniformity of the depth values ​​of each pixel within a large connected region and the stability of the relative prediction error. Here, it is used to identify and separate local abrupt changes or isolated data within a large region.

[0080] Boundary sharpness refers to the degree of gradient change in the depth difference between a boundary pixel and its adjacent pixels that are not part of a large connected region. Here, it is used to assess whether probe blur or ghosting occurs at the physical edges of a large connected region to determine whether to crop the edges. The principal object refers to the core object in the large connected region that occupies the dominant area and has a relatively stable and continuous depth distribution after local correction. For example, it is a high-precision smooth wall structure remaining after removing internal flying points and edge ghosting.

[0081] In practice, the process of determining boundary pixels within a connected component can be as follows: If the connected component is classified as a large-area connected component, traverse all pixels within the large-area connected component and, for each currently traversed pixel, determine its neighboring regions. Then, determine if any external pixels outside the large-area connected component exist within these neighboring regions. If such external pixels exist, the current pixel is determined to be located at the boundary of the large-area connected component. If no external pixels exist, the current pixel is determined to be located inside the large-area connected component and is not classified as a boundary pixel.

[0082] Then, the process of determining the internal consistency of a connected component specifically includes dividing the large-area connected component into multiple local regions, and performing the following processing for each local region: determining the depth value of each pixel contained in the local region, summing all depth values ​​within the local region, and dividing the sum by the total number of pixels in the local region to obtain the local depth mean. Then, calculating the square of the difference between each depth value and the local depth mean, and summing and averaging all the squares of the differences to obtain the depth statistical variance corresponding to the local region; where the depth statistical variance represents the uniformity of the distribution of depth values.

[0083] Simultaneously, the relative prediction error corresponding to each pixel within the local region is extracted. All relative prediction errors within the local region are summed, and the sum is divided by the total number of pixels within the local region to obtain the mean local error. Then, the squared difference between each relative prediction error and the mean local error is calculated, and all squared differences are summed and averaged to obtain the statistical variance of the error corresponding to the local region; where the statistical variance represents the stability of the relative prediction error.

[0084] Finally, the depth statistical variance and error statistical variance corresponding to each local region are obtained; the depth statistical variance and error statistical variance corresponding to each local region are combined to generate a variance feature matrix, thereby determining the internal consistency of the large-area connected domain.

[0085] The process of determining the boundary sharpness of each boundary pixel in a connected region specifically includes the following steps for each boundary pixel: First, determine at least one adjacent external pixel from the adjacent region of the boundary pixel. An adjacent external pixel is a pixel located in the adjacent region of the boundary pixel but not in a large connected region. Second, obtain the depth value of the boundary pixel and the depth values ​​of each adjacent external pixel. Third, for each adjacent external pixel, determine the depth change magnitude gradient between the boundary pixel and the adjacent external pixel. Fourth, select the largest depth change magnitude gradient from at least one depth change magnitude gradient as the boundary sharpness corresponding to the boundary pixel.

[0086] Based on this, after determining the internal consistency of the connected components and the boundary sharpness corresponding to each boundary pixel, the connected components can be locally corrected based on the internal consistency and boundary sharpness to obtain the target depth image. Next, the process of locally correcting the connected components based on the internal consistency and boundary sharpness to obtain the target depth image will be explained.

[0087] In practice, the process of locally correcting the connected components based on internal consistency and boundary sharpness to obtain the target depth image can be as follows: Based on boundary sharpness, target boundary pixels are selected from the boundary pixels, where the boundary sharpness is lower than a sharpness threshold; based on internal consistency, internal variance anomalous regions are identified from the connected components, where the depth value change is greater than a change threshold; and the internal variance anomalous regions in the connected components and the target boundary pixels are removed as noise to obtain the target depth image.

[0088] It should be noted that the target boundary pixel refers to the boundary pixel located on the outermost contour of the connected component and whose calculated boundary sharpness is lower than the sharpness threshold. This is used to indicate degraded edge pixels that are blurred, have depth trailing, or are excessively smoothed, for subsequent cropping operations. Both the sharpness threshold and the change threshold are preset, and this application does not limit their application. The internal variance abnormal region refers to an isolated pixel region within the connected component where the depth value change is greater than the change threshold and exhibits a local depth abrupt change.

[0089] In practical implementation, the system acquires the connected components of a large area category, the boundary sharpness of all boundary pixels within the connected components, and the internal consistency of the connected components. It also acquires pre-set sharpness thresholds and variation thresholds. All boundary pixels within the connected components are traversed. For any traversed boundary pixel, if the boundary sharpness is lower than the sharpness threshold, the boundary pixel is determined to be a blurred, trailing edge with an abnormally gentle depth change, and thus identified as a target boundary pixel. If the boundary sharpness is greater than or equal to the sharpness threshold, the boundary pixel is determined to be a true physical contour with a sharp depth transition, and therefore not selected as a target boundary pixel.

[0090] Meanwhile, regarding the process of identifying internal variance anomalous regions from connected components based on internal consistency, specifically, as mentioned above, internal consistency is a variance feature matrix generated based on the depth statistical variance and error statistical variance corresponding to each local region. Therefore, internal consistency is analyzed to obtain the depth statistical variance and error statistical variance corresponding to each local region. For each local region, the depth statistical variance and error statistical variance of that local region are weighted and summed to obtain a comprehensive calculation result, which is then determined as the depth value change corresponding to each local region. Here, the depth value change represents the overall degree of drastic fluctuation of the depth value within the local region. A preset change threshold is obtained. For each local region within a large-area connected component, if the depth value change corresponding to that local region is greater than the change threshold, it is determined that the depth value distribution within the local region is extremely uneven and deviates from the model prediction baseline. That is, the local region is marked as an abnormal state, which is equivalent to determining that the local region as an internal variance anomalous region, thereby deleting the internal variance anomalous region in the subsequent process.

[0091] If the change in depth value corresponding to the local area is less than or equal to the change threshold, it is determined that the depth distribution inside the local area is uniform and the state is stable. The local area is then marked as a normal smooth state, and the local area marked as a normal smooth state is retained in subsequent processes.

[0092] In practice, the identified internal variance abnormal regions and selected target boundary pixels are obtained. The pixel values ​​contained in the internal variance abnormal regions in the connected components are modified to invalid markers or their values ​​are set to zero to remove the internal variance abnormal regions as noise. At the same time, the target boundary pixels in the connected components are clipped and stripped from the outermost spatial contour of the connected components and their values ​​are set to zero to remove the target boundary pixels as noise.

[0093] In some embodiments, after obtaining the original depth image for training the depth estimation model, scene recognition can be performed on the original depth image. When the recognition result indicates that the scene corresponding to the original depth image is a special scene, a specific denoising method for the corresponding special scene is obtained, and the original depth image is denoised based on the specific denoising method to obtain the target depth image.

[0094] It should be noted that "special scenarios" refer to rare environments in the model training dataset where the frequency of occurrence is lower than a preset proportion, or where special physical materials and geometric structures that could easily cause depth sensors to malfunction. For example, in applications such as navigation, sorting, and handling, depth image samples corresponding to certain specially shaped objects (transparent objects, reflective surfaces, extremely fine structures, etc.) or extreme scene conditions (strong backlighting, long distance, etc.) are not limited in this application's embodiments. "Specific denoising method" refers to a denoising method that modifies the specific numerical parameters of the original denoising method. Based on this specific denoising method, real rare physical structures acquired under special scenarios, even those with certain quality defects, can be protected from being deleted as noise.

[0095] In practice, the process of denoising the original depth image to obtain the target depth image based on this specific denoising method can be as follows: obtain the initial prediction consistency threshold corresponding to the original depth image, and multiply the initial prediction consistency threshold by a preset tolerance coefficient greater than one to generate a relaxed prediction consistency threshold. The relaxed prediction consistency threshold allows for a larger relative error between the original depth image and the predicted depth image, so as to prevent the pre-trained model from mistakenly deleting real rare data due to its own inaccurate prediction in special scenarios.

[0096] Simultaneously, the initial minimum connected component area threshold corresponding to the original depth image is obtained, and the preset number of pixel offsets is subtracted from the initial minimum connected component area threshold to generate the relaxed minimum connected component area threshold. The relaxed minimum connected component area threshold reduces the minimum number of pixels required to retain connected components, so as to protect real physical structures such as extremely thin cables or tiny fragments in special scenarios.

[0097] Then, the initial depth similarity tolerance corresponding to the original depth image is obtained, and a preset depth bias value is added to the initial depth similarity tolerance to generate a relaxed depth similarity tolerance. This relaxed depth similarity tolerance allows for more drastic depth value jumps between adjacent pixels in special scenarios. Based on the relaxed prediction consistency threshold, the relaxed minimum connected component area threshold, and the relaxed depth similarity tolerance, subsequent connected component identification and denoising operations are performed on the original depth image.

[0098] It should be noted that during the denoising process, the sharpness threshold and the change threshold mentioned above can also be adjusted. For example, the sharpness threshold corresponding to the original depth image of a special scene is less than the sharpness threshold corresponding to the original depth image of a non-special scene, and / or, the change threshold corresponding to the original depth image of a special scene is greater than the change threshold corresponding to the original depth image of a non-special scene. This application does not limit the specifics of these limitations.

[0099] Thus, for the original depth images of special scenes, the cleaning criteria are appropriately relaxed to ensure that the training set can cover diverse scene distributions and improve the generalization ability of the model.

[0100] In some embodiments, after denoising the original depth image based on multiple connected components to obtain the target depth image, a denoising evaluation result of the target depth image can be determined based on the target depth image and the original depth image. The denoising evaluation result is used to quantify the denoising effect of the target depth image. The denoising evaluation result includes at least one of noise removal rate, effective information retention rate, and data consistency improvement degree. If the denoising evaluation result indicates that the denoising effect of the target depth image does not meet the expected standard, the denoising strategy of the original depth image is adjusted until the denoising effect of the re-obtained target depth image meets the expected standard.

[0101] It should be noted that the noise removal rate is used to quantitatively evaluate the performance of the denoising process in removing meaningless interference data. The effective information retention rate is used to quantitatively evaluate the algorithm's ability to avoid accidentally deleting detailed structures of the real scene during the cleaning process. The improvement in data consistency is used to quantitatively characterize the overall improvement in the numerical agreement (i.e., consistency) between the target depth image and the predicted depth image.

[0102] In practical implementation, the process of determining the noise removal rate specifically involves extracting the total number of initial noise pixels, the total number of initial effective pixels, and the total number of target noise pixels in the target depth image. The ratio of the total number of initial noise pixels to the total number of initial effective pixels is taken as the first ratio, and the ratio of the total number of target noise pixels to the total number of initial effective pixels is taken as the second ratio. The difference between the second ratio and the first ratio is obtained as the first difference, and the ratio of the first difference to the first ratio is determined as the noise removal rate. ...Formula (3); in, This refers to the noise removal rate. This refers to the second ratio, which is the ratio of the total number of target noise pixels to the total number of initial valid pixels; This refers to the first ratio, which is the ratio of the initial total number of noisy pixels to the initial total number of valid pixels.

[0103] Specifically, the process of determining the effective information retention rate involves extracting the initial total number of effective pixels in the original depth image and the target total number of effective pixels in the target depth image; the difference between the target total number of effective pixels and the initial total number of effective pixels is defined as the second difference, and the ratio of the second difference to the initial total number of effective pixels is defined as the effective information retention rate, i.e.: ...Formula (4); in, This refers to the effective information retention rate. This refers to the total number of valid pixels in the target area. This refers to the total number of initial valid pixels.

[0104] It should be noted that the effective information retention rate here may be negative, indicating that the effective information has been reduced. In this case, it is necessary to control the reduction within a reasonable range (e.g., within -20%) to avoid excessive cleaning.

[0105] To determine the degree of improvement in data consistency, the mean quality assessment value of the original region in the original depth image and the mean quality assessment value of the target region in the target depth image are obtained. The difference between the mean quality assessment value of the target region and the mean quality assessment value of the initial region are determined as the third difference, and the ratio of the third difference to the mean quality assessment value of the initial region are determined as the degree of improvement in data consistency. ...Formula (5); in, This refers to the degree of improvement in data consistency. This refers to the average quality assessment value of the target area. This refers to the average of the initial regional quality assessment.

[0106] It should be noted that the process of obtaining the mean value of the original region quality assessment of the original depth image can be as follows: obtain the region quality assessment value of each local region in the original depth image, and take the average value of the region quality assessment values ​​of multiple local regions as the mean value of the original region quality assessment of the original depth image; wherein, the process of determining the region quality assessment value of each local region in the original depth image is as described above, and will not be repeated here in this embodiment; correspondingly, the process of obtaining the mean value of the target region quality assessment of the target depth image is similar to the process of obtaining the mean value of the original region quality assessment of the original depth image, and will not be repeated here in this embodiment either.

[0107] In practical implementation, the denoising evaluation result of the target depth image is obtained based on the noise removal rate, effective information retention rate, and data consistency improvement degree. Then, the denoising effect of the target depth image is determined based on the denoising evaluation result. Specifically, a pre-set expected standard is obtained, which includes at least one of the following: the noise removal rate is greater than a preset noise removal rate threshold; the effective information retention rate is greater than a preset effective information retention rate threshold; and the data consistency improvement degree is greater than a preset data consistency improvement degree threshold.

[0108] Therefore, if the denoising evaluation result indicates that the denoising effect of the target depth image does not meet the expected standard, it is determined that the currently executed denoising strategy lacks a reasonable leniency / strictness configuration. In this case, the output of the target depth image is paused, and the denoising strategy for the original depth image is iteratively adjusted. Specifically, if the noise removal rate is insufficient (too much residual noise), the prediction consistency threshold or the minimum connected component area threshold is reduced to enhance the denoising effect. If the effective information retention rate is too low (over-cleaning), the prediction consistency threshold or the minimum connected component area threshold is increased to reduce false deletions. If the improvement in data consistency is not significant, the stratification strategy of the region quality evaluation value is checked for rationality, and the boundary of the quality level division is adjusted, i.e., the first, second, and third prediction consistency thresholds mentioned above are adjusted. Finally, the denoising process is re-executed after adjustment, and the effect is evaluated again. This iterative process continues until all indicators simultaneously meet the expected standard. This closed-loop feedback mechanism ensures that the cleaning parameters gradually converge to the optimal configuration.

[0109] Applying the embodiments described above, a predicted depth image generated from a color image is used as a priori guidance, and a prediction consistency threshold is determined. By comparing the relative error between the original depth and the model's predicted depth, semantic understanding is imparted to the denoising process. Thus, features in the original image that conform to the prediction rules are identified as real features, while those that deviate significantly from the prediction are accurately identified as noise. This achieves intelligent differentiation between real features and noise, thereby improving the accuracy of denoising. Furthermore, compared to pixel-level independent processing (such as filtering and smoothing) in related technologies, this application combines depth similarity tolerance (continuity of adjacent pixels) and the prediction consistency threshold (semantic correctness) to extract connected components. This ensures that the spatially continuous and semantically correct surface of a real object is preserved as a whole, perfectly protecting... This ensures that true depth transitions (object edges) are not blurred or destroyed, protecting the integrity of key spatial geometric structures and further improving the accuracy of denoising. In addition, since the two core parameters (prediction consistency threshold and depth similarity tolerance) on which the identified connected components depend are not fixed values, but dynamically determined based on the original depth image and the predicted depth image, this application can adaptively adjust the denoising intensity for different scenes (such as flat areas or complex edges) according to the depth distribution gradient and prediction error quality of the current input image, ensuring the stability and universality of the denoising effect, thus possessing cross-scene adaptive denoising capability.

[0110] The following will describe an exemplary application of the embodiments of this application in a real-world application scenario.

[0111] In related technologies, when preprocessing training images for depth estimation models, a "one-size-fits-all" strategy with fixed parameters is often adopted, making it difficult to distinguish between real scene features and noise artifacts. This makes it easy for the denoising process to destroy key spatial geometric structures (such as object edges) and it cannot adapt to the differences in data from different scenes. Ultimately, this results in poor quality of the cleaned supervision data, which severely restricts the training accuracy and generalization ability of depth estimation models.

[0112] Based on this, the image processing method provided in this application systematically cleans and optimizes depth estimation training data by constructing an intelligent data quality assessment system and a refined noise removal mechanism. The core objective of this method is to provide high-quality, clean, and consistent supervision data for the training of depth estimation models, thereby improving the model's learning effect and generalization performance. Specifically, firstly, this invention innovatively utilizes the prediction results of pre-trained depth estimation models as a high-quality reference to guide the cleaning process of the original training data. This method fully leverages the rich scene knowledge contained in the pre-trained model, constructs a prediction consistency verification mechanism, effectively identifies regions in the original data that significantly deviate from the model's predictions, thereby accurately locating and removing noise or erroneous information, and improving the objectivity and reliability of data cleaning. Secondly, traditional methods often rely on pixel-level independent judgment, ignoring the spatial continuity of depth maps. This invention proposes a cleaning strategy based on connected component analysis, combining depth similarity and prediction consistency constraints to identify continuous depth regions, effectively protecting the real geometric structure, while accurately removing discrete noise and unreasonable small regions, achieving refined cleaning that better conforms to the geometric characteristics of the scene. Thirdly, for the training requirements of depth estimation models, this invention constructs an adaptive parameter adjustment system. This mechanism comprehensively analyzes the statistical characteristics of the data and the distribution of predicted quality, automatically optimizes the cleaning parameters, and dynamically adapts to the characteristics of different datasets. For different scenarios such as navigation, sorting, and handling, it removes harmful noise while retaining detailed information beneficial to model learning, ensuring that the cleaned data maximizes its value during training.

[0113] In practical implementation, the prediction-guided deep estimation model training data cleaning method proposed in this application achieves systematic cleaning and optimization of deep estimation training data by constructing an intelligent data quality assessment system and a refined noise removal mechanism. The core objective of this method is to provide high-quality, clean, and consistent supervised data for the training of deep estimation models, thereby improving the model's learning effect and generalization performance. The technical solution of this application will be described in detail below.

[0114] First, the overall framework for training data cleaning adopts a four-stage process of analysis, evaluation, cleaning, and validation to ensure that each step contributes to improving the final data quality. The entire framework is designed to fully consider the specific needs of depth estimation model training, aiming not only to remove harmful noise data but also to retain as much valuable real information as possible for model learning. Specifically, Stage 1: Data Quality Analysis – Statistical Feature Extraction of Original GT Depth Map (Original Depth Image) – Acquisition of Pre-trained Model Prediction Results – Preliminary Assessment of Prediction Consistency; Stage 2: Intelligent Quality Assessment – ​​Prediction Error Distribution Analysis – Adaptive Cleaning Parameter Calculation – Connected Component Segmentation and Feature Extraction; Stage 3: Refined Data Cleaning – Dual-Constraint Connected Component Analysis – Noise Recognition Based on Prediction Consistency – Intelligent Filtering Decision and Execution; Stage 4: Validation of Cleaning Effect – Comparison of Data Quality Before and After Cleaning – Pre-evaluation of Training Effect – Feedback on Cleaning Parameter Optimization.

[0115] Second, regarding the data quality assessment mechanism based on the pre-trained model, this application introduces a pre-trained depth estimation model as a reference standard for quality assessment, utilizing the model's scene understanding ability to judge the credibility of the original data. The algorithm first uses advanced depth estimation models such as Foundation to infer the RGB image (color image) corresponding to the ground truth depth map (original depth image), obtaining the model's depth prediction result (predicted depth image); such as... Figure 4 As shown, while this prediction result may not be as accurate as the GT depth map, it exhibits good consistency and semantic correctness, especially in scene types where the model is well-trained. Then, referring to formula (1), the algorithm calculates the pixel-level error between the original GT depth map and the predicted depth map, analyzes the error distribution pattern and statistical characteristics, i.e., the region prediction quality assessment value. This quality assessment mechanism considers not only the magnitude of the error but also its distribution characteristics. Regions with high consistency typically indicate that the GT data in that region is reliable, while regions with low consistency may contain noise or erroneous information. This quality assessment based on prediction consistency provides an objective and reliable basis for subsequent data cleaning.

[0116] Third, regarding adaptive parameter calculation for training optimization, this application employs an adaptive parameter calculation mechanism to optimize the training effect of the depth estimation model. By analyzing data characteristics, key parameters are dynamically set: the minimum connected component area (minimum connected component area threshold) is determined based on effective pixel density and scene complexity to balance detail preservation and noise removal; the similarity tolerance (depth similarity tolerance) is adjusted according to the depth gradient distribution to protect boundaries or suppress noise in areas with drastic or gradual changes; and a consistency threshold (prediction consistency threshold) is set based on prediction quality stratification to avoid over-cleaning or residual interference, ensuring that the data can play its maximum training value in various scenarios.

[0117] Fourth, refined data cleaning based on connected component analysis. Connected component analysis is the core technology of this application. By identifying and analyzing spatially continuous depth regions, refined cleaning of training data is achieved. Compared with traditional pixel-level independent processing methods, connected component analysis can better protect the spatial structural integrity of the data and avoid geometric information loss caused by excessive cleaning.

[0118] It should be noted that the algorithm employs an improved flooding filling strategy for connected component identification. Simultaneously, as shown in formula (2), it considers two constraints: depth geometric consistency and prediction semantic consistency. Depth geometric consistency ensures reasonable continuity of depth values ​​within a connected component, while prediction semantic consistency ensures that the data within the connected component remains consistent with the prediction results of the pre-trained model. Only adjacent pixels that simultaneously satisfy both constraints can belong to the same connected component. Thus, after connected component identification, the algorithm performs quality assessment and cleanup decisions for each connected component. The decision-making process comprehensively considers the geometric features (e.g., area, shape complexity), statistical features (e.g., internal depth variance, boundary sharpness), and prediction consistency features (e.g., average prediction error, error distribution) of the connected component. Based on the comprehensive analysis of these features, it determines whether each connected component should be retained as valuable training data or deleted as noisy data.

[0119] Fourth, the intelligent cleaning decision strategy for model training in this application is based on intelligent optimization according to the training requirements of the deep estimation model. For small connected regions, its consistency with the prediction results is strictly evaluated: highly consistent regions are considered as real details and retained, while those with significant differences are removed as noise. For large connected regions, the focus is on internal consistency and boundary clarity, and the training value of the main objects is preserved through local correction. At the same time, the algorithm takes into account the scarcity and representativeness of the data, and appropriately relaxes the cleaning criteria for special scene samples to maintain the diversity of the training set and the learning effectiveness.

[0120] Fifth, data cleaning effect evaluation and parameter optimization feedback: This application establishes a data cleaning effect evaluation mechanism, assessing the cleaning effect from two aspects: quality improvement and model impact, providing feedback for parameter optimization. The cleaning effect is quantitatively evaluated using indicators such as noise removal rate, effective information retention rate, and the degree of improvement in data consistency, ensuring that data cleaning effectively improves model training performance. ...Formula (6); in, This refers to the percentage improvement in data quality. This refers to the data quality indicators after cleaning. This refers to the data quality indicators before cleaning.

[0121] It should be noted that if the cleaned data retains good diversity and representativeness while maintaining high quality, it is expected to improve the training effect of the model; however, if the cleansing process leads to a significant reduction in data diversity or the loss of certain important features, it may have a negative impact on model training, and the cleansing strategy needs to be adjusted.

[0122] Thus, this application constructs an intelligent data cleaning scheme based on depth estimation model-guided ground truth analysis. This method utilizes prior model knowledge and employs refined connected component analysis and adaptive cleaning strategies to effectively identify and remove noise while preserving crucial geometric details and scarce samples for training. Empirical results demonstrate that this method can significantly reduce the noise ratio in multi-scenario test data from 50% to below 5%. The integrated parameter feedback optimization and batch processing mechanisms further ensure continuous improvement in cleaning performance and efficient processing of large-scale datasets, providing high-quality real-world data for training depth estimation models and possessing significant application value.

[0123] Applying the embodiments described above, a predicted depth image generated from a color image is used as a priori guidance, and a prediction consistency threshold is determined. By comparing the relative error between the original depth and the model's predicted depth, semantic understanding is imparted to the denoising process. Thus, features in the original image that conform to the prediction rules are identified as real features, while those that deviate significantly from the prediction are accurately identified as noise. This achieves intelligent differentiation between real features and noise, thereby improving the accuracy of denoising. Furthermore, compared to pixel-level independent processing (such as filtering and smoothing) in related technologies, this application combines depth similarity tolerance (continuity of adjacent pixels) and the prediction consistency threshold (semantic correctness) to extract connected components. This ensures that the spatially continuous and semantically correct surface of a real object is preserved as a whole, perfectly protecting... This ensures that true depth transitions (object edges) are not blurred or destroyed, protecting the integrity of key spatial geometric structures and further improving the accuracy of denoising. In addition, since the two core parameters (prediction consistency threshold and depth similarity tolerance) on which the identified connected components depend are not fixed values, but dynamically determined based on the original depth image and the predicted depth image, this application can adaptively adjust the denoising intensity for different scenes (such as flat areas or complex edges) according to the depth distribution gradient and prediction error quality of the current input image, ensuring the stability and universality of the denoising effect, thus possessing cross-scene adaptive denoising capability.

[0124] The following description continues to illustrate the exemplary structure of the image processing apparatus 455 provided in the embodiments of this application as a software module. In some embodiments, such as... Figure 2 As shown, the software modules stored in the image processing device 455 in the memory 450 may include: The acquisition module 4551 is used to acquire the original depth image for training the depth estimation model, and to acquire the color image corresponding to the original depth image and the corresponding predicted depth image, wherein the predicted depth image is predicted based on the color image; The determining module 4552 is used to determine a prediction consistency threshold based on the original depth image and the predicted depth image, and to determine a depth similarity tolerance based on the original depth image; wherein, the depth similarity tolerance is a threshold for the depth difference between any adjacent pixels in the original depth image, and the prediction consistency threshold is a threshold for the relative error of the depth of the same pixel in the original depth image and the predicted depth image. The identification module 4553 is used to identify multiple connected components from the original depth image based on the depth similarity tolerance and the prediction consistency threshold. The denoising module 4554 is used to denoise the original depth image based on the multiple connected components to obtain the target depth image.

[0125] In some embodiments, the determining module 4552 is further configured to perform pixel alignment on the original depth image and the predicted depth image to obtain an alignment result, and based on the alignment result, determine the depth value difference between the original depth image and the predicted depth image; based on the depth value difference, determine the mean, standard deviation and absolute median of the prediction error, and based on the mean, standard deviation and absolute median of the prediction error, generate a region quality assessment value; and based on the region quality assessment value, construct the prediction consistency threshold.

[0126] In some embodiments, the determining module 4552 is further configured to analyze the depth gradient distribution of the original depth image based on the original depth image; and construct the depth similarity tolerance based on the depth gradient distribution.

[0127] In some embodiments, the identification module 4553 is further configured to traverse the pixels in the original depth image; for the i-th pixel encountered, when the i-th pixel is a pixel whose belonging to a connected component has not yet been determined, at least one neighboring pixel of the i-th pixel is determined, where i is an integer greater than or equal to 1; for each neighboring pixel, based on the depth similarity tolerance and the prediction consistency threshold, when it is determined that the neighboring pixel meets the target condition, it is determined that the neighboring pixel and the i-th pixel belong to the same connected component; wherein, the target condition includes at least one of the following: the depth difference between the neighboring pixel and the i-th pixel is less than or equal to the depth similarity tolerance; the relative error of the depth of the neighboring pixel in the original depth image and the predicted depth image is less than or equal to the prediction consistency threshold; when all pixels in the original depth image have been traversed, multiple connected components in the original depth image are obtained.

[0128] In some embodiments, the denoising module 4554 is further configured to: analyze the effective pixel density and scene complexity of the original depth image based on the original depth image; determine a minimum connected component area threshold based on the effective pixel density and the scene complexity; extract features from each of the plurality of connected components to obtain multidimensional features of the connected components; determine the category of each connected component based on the minimum connected component area threshold and the multidimensional features; wherein the categories of the connected components include small-area connected components and large-area connected components; different categories correspond to different denoising methods; and, for each connected component, use a denoising method corresponding to the category of the connected component to denoise the original depth image to obtain a target depth image.

[0129] In some embodiments, the denoising module 4554 is further configured to, when the category of the connected component is the small-area connected component, perform the following processing for each connected component: determine a region in the predicted depth image corresponding to the connected component as a first region; evaluate the consistency between the first region and the connected component to obtain an evaluation result; if the evaluation result indicates that the first region and the connected component are inconsistent, then delete the connected component as noise; if the evaluation result indicates that the first region and the connected component are consistent, then retain the connected component; and obtain the target depth image based on the processed connected components.

[0130] In some embodiments, the denoising module 4554 is further configured to, when the category of the connected component is the large-area connected component, determine the boundary pixels in the connected component, and determine the internal consistency of the connected component and the boundary sharpness corresponding to each of the boundary pixels; based on the internal consistency and the boundary sharpness, perform local correction on the connected component to obtain the target depth image.

[0131] In some embodiments, the denoising module 4554 is further configured to: select target boundary pixels from the boundary pixels based on the boundary sharpness, wherein the target boundary pixels are boundary pixels with boundary sharpness lower than a sharpness threshold; identify internal variance abnormal regions from the connected component based on the internal consistency, wherein the internal variance abnormal regions are local regions in the connected component where the depth value change is greater than a change threshold; and delete the internal variance abnormal regions in the connected component and the target boundary pixels as noise to obtain the target depth image.

[0132] In some embodiments, the apparatus further includes an adjustment module, configured to determine a denoising evaluation result of the target depth image based on the target depth image and the original depth image; wherein the denoising evaluation result is used to quantify the denoising effect of the target depth image; the denoising evaluation result includes at least one of noise removal rate, effective information retention rate, and data consistency improvement degree; if the denoising evaluation result indicates that the denoising effect of the target depth image does not meet the expected standard, the denoising strategy of the original depth image is adjusted until the denoising effect of the re-obtained target depth image meets the expected standard.

[0133] This application provides a computer program product, which includes computer-executable instructions or a computer program. When the computer-executable instructions or the computer program are executed by a processor, the processor will execute the image processing method provided in this application.

[0134] This application provides a computer-readable storage medium storing computer-executable instructions or a computer program. When the computer-executable instructions or the computer program are executed by a processor, the processor will execute the image processing method provided in this application. For example, ... Figure 3 The image processing method shown.

[0135] In some embodiments, the computer-readable storage medium may be a read-only memory (ROM), random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic surface memory, optical disk, or CD-ROM, etc.; or it may be a device that includes one or any combination of the above-mentioned memories.

[0136] In some embodiments, computer-executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

[0137] As an example, computer-executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a Hyper Text Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple co-located files (e.g., files that store one or more modules, subroutines, or code sections).

[0138] As an example, computer-executable instructions can be deployed to execute on a single electronic device, or on multiple electronic devices located at one location, or on multiple electronic devices distributed across multiple locations and interconnected via a communication network.

[0139] It should be noted that in this application embodiment, data such as images are involved. When this application embodiment is applied to a specific product or technology, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0140] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of this application are included within the scope of protection of this application.

Claims

1. An image processing method, characterized in that, The method includes: Obtain the original depth image used to train the depth estimation model, and obtain the color image corresponding to the original depth image and the corresponding predicted depth image, wherein the predicted depth image is predicted based on the color image; Based on the original depth image and the predicted depth image, a prediction consistency threshold is determined, and based on the original depth image, a depth similarity tolerance is determined. Wherein, the depth similarity tolerance is the threshold of the depth difference between any adjacent pixels in the original depth image, and the prediction consistency threshold is the threshold of the relative error of the depth of the same pixel in the original depth image and the predicted depth image. Based on the depth similarity tolerance and the prediction consistency threshold, multiple connected components are identified from the original depth image; Based on the multiple connected components, the original depth image is denoised to obtain the target depth image.

2. The method according to claim 1, characterized in that, Determining the prediction consistency threshold based on the original depth image and the predicted depth image includes: The original depth image and the predicted depth image are pixel-aligned to obtain an alignment result, and the depth value difference between the original depth image and the predicted depth image is determined based on the alignment result. Based on the depth value difference, the mean, standard deviation and absolute median of the prediction error are determined, and a regional quality assessment value is generated based on the mean, standard deviation and absolute median of the prediction error. Based on the regional quality assessment value, the prediction consistency threshold is constructed.

3. The method according to claim 1, characterized in that, The step of determining the depth similarity tolerance based on the original depth image includes: Based on the original depth image, analyze the depth gradient distribution of the original depth image; Based on the depth gradient distribution, the depth similarity tolerance is constructed.

4. The method according to claim 1, characterized in that, The step of identifying multiple connected components from the original depth image based on the depth similarity tolerance and the prediction consistency threshold includes: The pixels in the original depth image are traversed. For the i-th pixel encountered during traversal, if the i-th pixel is a pixel whose connected component to which it belongs has not yet been determined, at least one neighboring pixel of the i-th pixel is determined, where i is an integer greater than or equal to 1. For each of the adjacent pixels, based on the depth similarity tolerance and the prediction consistency threshold, when it is determined that the adjacent pixel meets the target condition, it is determined that the adjacent pixel and the i-th pixel belong to the same connected component. The target conditions include at least one of the following: the depth difference between the adjacent pixel and the i-th pixel is less than or equal to the depth similarity tolerance; the relative depth error of the adjacent pixel in the original depth image and the predicted depth image is less than or equal to the prediction consistency threshold. When all pixels in the original depth image have been traversed, multiple connected components in the original depth image are obtained.

5. The method according to claim 1, characterized in that, The step of denoising the original depth image based on the multiple connected components to obtain the target depth image includes: Based on the original depth image, the effective pixel density and scene complexity of the original depth image are analyzed, and based on the effective pixel density and scene complexity, the minimum connected component area threshold is determined. Feature extraction is performed on each of the plurality of connected components to obtain the multidimensional features of the connected components; Based on the minimum connected component area threshold and the multidimensional features, the category of each connected component is determined; The connected components are categorized into small-area connected components and large-area connected components; different categories correspond to different noise reduction methods. For each of the connected components, a denoising method corresponding to the category of the connected component is used to denoise the original depth image to obtain the target depth image.

6. The method according to claim 5, characterized in that, The step of denoising the original depth image by employing a denoising method corresponding to the category of each connected component to obtain the target depth image includes: When the category of the connected component is the small-area connected component, the following processing is performed for each connected component: The region in the predicted depth image corresponding to the connected component is identified as the first region; the consistency between the first region and the connected component is evaluated to obtain an evaluation result; if the evaluation result indicates that the first region and the connected component are inconsistent, the connected component is deleted as noise; if the evaluation result indicates that the first region and the connected component are consistent, the connected component is retained. Based on the processed connected components, the target depth image is obtained.

7. The method according to claim 5, characterized in that, The step of denoising the original depth image by employing a denoising method corresponding to the category of each connected component to obtain the target depth image includes: When the category of the connected component is the large-area connected component, the boundary pixels in the connected component are determined, and the internal consistency of the connected component and the boundary sharpness corresponding to each boundary pixel are determined. Based on the internal consistency and the boundary clarity, the connected components are locally corrected to obtain the target depth image.

8. The method according to claim 7, characterized in that, The step of locally correcting the connected components based on the internal consistency and the boundary clarity to obtain the target depth image includes: Based on the boundary sharpness, target boundary pixels are selected from the boundary pixels, and the target boundary pixels are boundary pixels whose boundary sharpness is lower than the sharpness threshold. Based on the internal consistency, internal variance abnormal regions are identified from the connected components. The internal variance abnormal regions are local regions in the connected components where the depth value change is greater than the change threshold. The internal variance abnormal regions in the connected component and the target boundary pixels are removed as noise to obtain the target depth image.

9. The method according to claim 1, characterized in that, After denoising the original depth image based on the multiple connected components to obtain the target depth image, the method further includes: Based on the target depth image and the original depth image, determine the denoising evaluation result of the target depth image; The denoising evaluation result is used to quantify the denoising effect of the target depth image; the denoising evaluation result includes at least one of the following: noise removal rate, effective information retention rate, and data consistency improvement degree. If the denoising evaluation result indicates that the denoising effect of the target depth image does not meet the expected standard, the denoising strategy of the original depth image is adjusted until the denoising effect of the re-obtained target depth image meets the expected standard.

10. An image processing apparatus, characterized in that, The device includes: The acquisition module is used to acquire the original depth image for training the depth estimation model, and to acquire the color image corresponding to the original depth image and the corresponding predicted depth image, wherein the predicted depth image is predicted based on the color image; The determination module is used to determine a prediction consistency threshold based on the original depth image and the predicted depth image, and to determine a depth similarity tolerance based on the original depth image; wherein, the depth similarity tolerance is a threshold for the depth difference between any adjacent pixels in the original depth image, and the prediction consistency threshold is a threshold for the relative error of the depth of the same pixel in the original depth image and the predicted depth image. The identification module is used to identify multiple connected components from the original depth image based on the depth similarity tolerance and the prediction consistency threshold. The denoising module is used to denoise the original depth image based on the multiple connected components to obtain the target depth image.

11. An electronic device, characterized in that, include: Memory is used to store executable instructions or computer programs. A processor, when executing computer-executable instructions or computer programs stored in the memory, implements the image processing method according to any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that, The device stores computer-executable instructions or a computer program for inducing a processor to execute and implement the image processing method according to any one of claims 1 to 9.