An image processing method, apparatus and device

By fusing disparity images using deep learning algorithms and target binocular algorithms, high-precision, noise-free target disparity images are generated, solving the problem of insufficient ranging accuracy in textureless or weakly textured scenes and achieving accurate ranging under conditions of multiple reflections and high resolution.

CN116843749BActive Publication Date: 2026-06-23HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD
Filing Date
2022-03-25
Publication Date
2026-06-23

Smart Images

  • Figure CN116843749B_ABST
    Figure CN116843749B_ABST
Patent Text Reader

Abstract

The application provides an image processing method, device and equipment, the method comprising: obtaining a first original image and a second original image corresponding to a target object; obtaining a first disparity image by using a deep learning algorithm based on the first original image and the second original image; obtaining a second disparity image by using a target binocular algorithm based on the first original image and the second original image; fusing the first disparity image and the second disparity image to obtain a target disparity image; and generating a depth image corresponding to the target object based on the target disparity image. Through the technical scheme of the application, the disparity image of the deep learning algorithm and the disparity image of the target binocular algorithm are fused to obtain a high-precision noise-free target disparity image for distance measurement, and the advantages of the deep learning algorithm and the target binocular algorithm in binocular disparity estimation are fully utilized.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of binocular vision, and more particularly to an image processing method, apparatus, and device. Background Technology

[0002] To achieve the ranging function, the SGM (Semi-Global Matching) algorithm cannot be used for binocular ranging in scenes with no texture or weak texture; the TOF (Time of Flight) algorithm is easily affected by multiple reflections, resulting in reduced accuracy; and when using the structured light algorithm to achieve binocular ranging, it is difficult to handle high-resolution images and the measurement distance is limited.

[0003] In conclusion, there is currently no better distance measurement method in the industry. Summary of the Invention

[0004] This application provides an image processing method, the method comprising:

[0005] Obtain the first and second original images corresponding to the target object;

[0006] Based on the first original image and the second original image, a deep learning algorithm is used to obtain the first disparity image;

[0007] Based on the first and second original images, a second disparity image is obtained using a target binocular algorithm;

[0008] The first disparity image and the second disparity image are fused to obtain the target disparity image;

[0009] Generate a depth image corresponding to the target object based on the target parallax image.

[0010] This application provides an image processing apparatus, the apparatus comprising:

[0011] The acquisition module is used to acquire the first and second original images corresponding to the target object;

[0012] The processing module is configured to obtain a first disparity image based on the first original image and the second original image using a deep learning algorithm, and to obtain a second disparity image based on the first original image and the second original image using a target binocular algorithm; and to fuse the first disparity image and the second disparity image to obtain a target disparity image corresponding to the target object.

[0013] The generation module is used to generate a depth image corresponding to the target object based on the target disparity image.

[0014] This application provides an autonomous driving device, including a first camera and a second camera, wherein:

[0015] The first camera captures a first original image of the target object, and the second camera captures a second original image of the target object.

[0016] Based on the first original image and the second original image, a first disparity image is obtained by using a deep learning algorithm. Based on the first original image and the second original image, a second disparity image is obtained by using a target binocular algorithm. The first disparity image and the second disparity image are then fused to obtain a target disparity image.

[0017] A depth image corresponding to the target object is generated based on the target parallax image, and the distance between the target object and the mobile robot is determined based on the depth image.

[0018] As can be seen from the above technical solutions, in this embodiment, a first disparity image can be obtained using a deep learning algorithm, a second disparity image can be obtained using a target binocular algorithm, and the first and second disparity images can be fused to obtain a target disparity image. A depth image is generated based on the target disparity image, and the distance to the target object is determined based on the depth image, thus achieving binocular ranging. The above method fuses the disparity image from the deep learning algorithm and the disparity image from the target binocular algorithm to achieve binocular ranging. By fusing the disparity image from the deep learning algorithm and the disparity image from the target binocular algorithm, a high-precision, noise-free target disparity image can be obtained. When measuring distance based on this target disparity image, the distance to the target object can be accurately measured. Because it can accurately measure the distance to the target object, the above method can be applied to scenarios with high requirements for disparity maps or depth maps, such as obstacle avoidance for unmanned vehicles, passenger flow statistics, and liveness detection schemes based on depth maps. In the above method, by fusing the disparity images from the deep learning algorithm and the target binocular algorithm, a high-precision, noise-free target disparity image can be obtained even in textureless or weakly textured scenes. This means that the distance to the target object can be accurately measured even in textureless or weakly textured scenes. By fusing the disparity images from the deep learning algorithm and the target binocular algorithm, the method is unaffected by multiple reflections or image resolution. Even with multiple reflections or high image resolution, a high-precision, noise-free target disparity image can still be obtained, thus accurately measuring the distance to the target object. When measuring the distance to the target object, even if the target object is far from the camera, the distance can still be measured, significantly increasing the measurable distance. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments of this application or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings of the embodiments of this application.

[0020] Figure 1 This is a flowchart illustrating an image processing method according to one embodiment of this application;

[0021] Figure 2 This is a flowchart illustrating a binocular ranging method in one embodiment of this application;

[0022] Figure 3 This is a flowchart illustrating a binocular ranging method in one embodiment of this application;

[0023] Figure 4 This is a schematic diagram of a confidence map-based fusion method in one embodiment of this application;

[0024] Figure 5 This is a schematic diagram of a target feature generation network in one embodiment of this application;

[0025] Figure 6 This is a schematic diagram of a network model-based fusion method in one embodiment of this application;

[0026] Figure 7 This is a schematic diagram of a binocular ranging device based on multi-algorithm fusion in one embodiment of this application;

[0027] Figure 8 This is a schematic diagram of the structure of an image processing apparatus according to one embodiment of this application;

[0028] Figure 9 This is a hardware structure diagram of an image processing device according to one embodiment of this application. Detailed Implementation

[0029] The terminology used in the embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. The singular forms “a,” “the,” and “the” as used in this application and claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to any and all possible combinations comprising one or more of the associated listed items.

[0030] It should be understood that although the terms first, second, third, etc., may be used to describe various information in embodiments of this application, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" may also be interpreted as "when," "when," or "in response to a determination."

[0031] This application proposes an image processing method that can be applied to a device that supports binocular ranging. The device may include at least two cameras, i.e., the ranging function is achieved through at least two cameras. Taking two cameras as an example, these two cameras are referred to as the first camera and the second camera.

[0032] See Figure 1 The diagram shown is a flowchart of the image processing method, which may include:

[0033] Step 101: Obtain the first original image and the second original image corresponding to the target object.

[0034] For example, a first original image corresponding to the target object can be captured by a first camera, and a second original image corresponding to the target object can be captured by a second camera. For instance, when a target object exists in the target scene, when capturing an image of the target scene using the first camera, the image may include the target object, and this image captured by the first camera is recorded as the first original image corresponding to the target object. When capturing an image of the target scene using the second camera, the image may include the target object, and this image captured by the second camera is recorded as the second original image corresponding to the target object.

[0035] Step 102: Based on the first original image and the second original image, obtain the first disparity image using a deep learning algorithm. For example, the first original image and the second original image can be input into a deep learning network (i.e., a network trained using a deep learning algorithm) to obtain the first disparity image.

[0036] Step 103: Based on the first and second original images, obtain the second disparity image using a target stereo algorithm. For example, the target stereo algorithm can also be called a traditional stereo algorithm, such as the SGM algorithm. Therefore, based on the first and second original images, the SGM algorithm can be used to obtain the second disparity image. The target stereo algorithm can be any algorithm capable of using disparity images to achieve ranging functionality.

[0037] Step 104: Fuse the first disparity image and the second disparity image to obtain the target disparity image.

[0038] In one possible implementation, based on a first disparity image and a second disparity image, multiple candidate pixels can be selected from all pixels in the first disparity image, and a target disparity image can be generated based on the first disparity value corresponding to the multiple candidate pixels in the first disparity image.

[0039] For example, based on a first disparity image and a second disparity image, multiple candidate pixels can be selected from all pixels in the first disparity image, including but not limited to the following methods:

[0040] Method 1: For each pixel in the first disparity image, determine the variable threshold corresponding to the pixel based on the second disparity value corresponding to the pixel in the second disparity image (i.e., each pixel in the first disparity image corresponds to a variable threshold individually); determine whether the pixel is a candidate pixel or not based on the first disparity value corresponding to the pixel in the first disparity image, the second disparity value corresponding to the pixel in the second disparity image, and the variable threshold corresponding to the pixel.

[0041] When determining whether a pixel is a candidate pixel or not, the following method can be used: if the absolute value of the difference between the first disparity value and the second disparity value is greater than the variable threshold, then the pixel is determined not to be a candidate pixel; if the absolute value of the difference between the first disparity value and the second disparity value is not greater than the variable threshold, then the pixel is determined to be a candidate pixel.

[0042] Method 2: Obtain the first feature corresponding to the first disparity image, the second feature corresponding to the second disparity image, and the third feature corresponding to the reference image. The reference image can be either the first original image or the second original image. Fuse the first feature, the second feature, and the third feature to obtain the target feature. Input the target feature into the trained first network model to obtain the confidence map corresponding to the target feature. Based on the confidence map, select multiple candidate pixels from all pixels in the first disparity image.

[0043] The confidence map includes the confidence value corresponding to each pixel in the first disparity image. When selecting multiple candidate pixels from all pixels in the first disparity image based on the confidence map, the following method can be used: For each pixel in the first disparity image, if the confidence value corresponding to the pixel in the confidence map is greater than a preset threshold, then the pixel is determined to be a candidate pixel; if the confidence value corresponding to the pixel in the confidence map is not greater than the preset threshold, then the pixel is determined not to be a candidate pixel.

[0044] In another possible implementation, a first feature corresponding to a first disparity image, a second feature corresponding to a second disparity image, and a third feature corresponding to a reference image can be obtained. The reference image can be either a first original image or a second original image. The first feature, the second feature, and the third feature are fused to obtain a target feature, and the target feature is input into a trained second network model to obtain a target disparity image corresponding to the target feature. The target disparity image can include disparity values ​​of multiple pixels.

[0045] Step 105: Generate a depth image corresponding to the target object based on the target parallax image. After obtaining the depth image, the distance to the target object can be determined based on the depth image, that is, binocular ranging is realized.

[0046] For example, the distance between the target object and the first camera can be determined based on the depth image; or, the distance between the target object and the second camera can be determined based on the depth image.

[0047] As can be seen from the above technical solutions, in this embodiment, a first disparity image can be obtained using a deep learning algorithm, a second disparity image can be obtained using a target binocular algorithm, and the first and second disparity images can be fused to obtain a target disparity image. A depth image is generated based on the target disparity image, and the distance to the target object is determined based on the depth image, thus achieving binocular ranging. The above method fuses the disparity image from the deep learning algorithm and the disparity image from the target binocular algorithm to achieve binocular ranging. By fusing the disparity image from the deep learning algorithm and the disparity image from the target binocular algorithm, a high-precision, noise-free target disparity image can be obtained. When measuring distance based on this target disparity image, the distance to the target object can be accurately measured. Because it can accurately measure the distance to the target object, the above method can be applied to scenarios with high requirements for disparity maps or depth maps, such as obstacle avoidance for unmanned vehicles, passenger flow statistics, and liveness detection schemes based on depth maps. In the above method, by fusing the disparity images from the deep learning algorithm and the target binocular algorithm, a high-precision, noise-free target disparity image can be obtained even in textureless or weakly textured scenes. This means that the distance to the target object can be accurately measured even in textureless or weakly textured scenes. By fusing the disparity images from the deep learning algorithm and the target binocular algorithm, the method is unaffected by multiple reflections or image resolution. Even with multiple reflections or high image resolution, a high-precision, noise-free target disparity image can still be obtained, thus accurately measuring the distance to the target object. When measuring the distance to the target object, even if the target object is far from the camera, the distance can still be measured, significantly increasing the measurable distance.

[0048] The technical solutions described above in the embodiments of this application will be explained below in conjunction with specific application scenarios.

[0049] This application proposes a binocular ranging method, applicable to the field of binocular vision, which can be used for distance estimation using binocular vision. (See [link to relevant documentation]). Figure 2 The diagram shows the flowchart of this method. A first original image (also called the left image) corresponding to the target object can be acquired by a first camera (assuming the first camera is the left camera in a binocular setup, then the first camera can also be called the left camera), and a second original image (also called the right image) corresponding to the target object can be acquired by a second camera (assuming the second camera is the right camera in a binocular setup, then the second camera can also be called the right camera).

[0050] Based on the first and second original images, a first disparity image can be obtained using a deep learning algorithm. Based on the first and second original images, a second disparity image can be obtained using a target stereo algorithm; for example, the target stereo algorithm can also be called a traditional stereo algorithm.

[0051] After obtaining the first disparity image and the second disparity image, the first disparity image and the second disparity image can be fused to obtain a high-precision, noise-free target disparity image.

[0052] This application proposes a binocular ranging method based on multi-algorithm fusion in its embodiments. See [link to relevant documentation]. Figure 3 The diagram shown is a flowchart of the binocular ranging method, which may include the following steps:

[0053] Step 301: Acquire a first original image corresponding to the target object using a first camera, and acquire a second original image corresponding to the target object using a second camera. For example, the acquisition time of the first original image and the acquisition time of the second original image can be the same, i.e., two original images are acquired at the same time.

[0054] Step 302: Based on the first original image and the second original image, a first disparity image is obtained using a deep learning algorithm. For example, the first original image and the second original image are input into a deep learning network, which processes the first original image and the second original image to obtain the first disparity image.

[0055] To obtain the first disparity image using a deep learning algorithm, the process involves training and detecting the deep learning network. During training, a deep learning network is trained and deployed; this process is performed before step 302. During detection, the first and second original images are input into the deep learning network, which outputs the first disparity image; this process is step 302.

[0056] The training process for deep learning networks involves acquiring sample data and an initial network model to be trained. This initial network model is used to output disparity images, and its structure is not restricted; it can be any deep learning network. The sample data can include multiple datasets. For each dataset, there can be two sample images (i.e., two images captured by two cameras) and the corresponding sample disparity images (i.e., the calibration information corresponding to the two sample images).

[0057] Building upon this, for each dataset, two sample images from that dataset can be input into the initial network model. The initial network model processes these two sample images to obtain the corresponding output disparity image. Then, based on the difference between this output disparity image and the sample disparity images in the dataset, the network parameters of the initial network model are adjusted. This adjustment process is not restricted; the goal is to make the difference between the output disparity image and the sample disparity images increasingly smaller, i.e., to make them closer together. Of course, the above is just an example of network parameter adjustment and is not a limitation.

[0058] After adjusting the network parameters of the initial network model, the adjusted network model is used as the initial network model. The operation of "inputting two sample images from the dataset into the initial network model" is then performed. This process is repeated to continuously adjust the network parameters of the initial network model until the adjusted network model meets the convergence requirements. The adjusted network model is then used as the deep learning network, which is the trained deep learning network. This completes the training process of the deep learning network.

[0059] Of course, the above method is just an example of training a deep learning network. There are no restrictions on the training process, as long as a deep learning network can be trained and used to output disparity images.

[0060] Regarding the detection process of the deep learning network: In step 302, after obtaining the first original image and the second original image, the first original image and the second original image can be input into the deep learning network, which processes the first original image and the second original image. This processing procedure is not limited. After the deep learning network processes the first original image and the second original image, a disparity image is obtained and output. This disparity image is denoted as the first disparity image.

[0061] In summary, the first disparity image can be obtained using deep learning algorithms (i.e., deep learning networks).

[0062] For example, a deep learning network can be a network model based on deep learning. Deep learning is a type of machine learning, and its concept originated from research on artificial neural networks. Deep learning networks typically consist of multiple neural network layers. For instance, a deep learning network can be a neural network (NN), which is an artificial neural network (ANN). An ANN is an algorithmic mathematical model that mimics the behavioral characteristics of animal neural networks, performing distributed parallel information processing. This type of network relies on the complexity of the system, adjusting the interconnections between a large number of internal nodes to achieve information processing. As another example, a deep learning network can be a convolutional neural network (CNN). A CNN is a feedforward neural network whose artificial neurons can respond to a portion of the surrounding units within their coverage area.

[0063] Step 303: Based on the first original image and the second original image, obtain the second disparity image using a target stereo algorithm. For example, the target stereo algorithm is also called a traditional stereo algorithm, such as the SGM algorithm and other stereo matching algorithms. That is to say, the second disparity image can be obtained using the SGM algorithm.

[0064] For example, after obtaining the first and second original images, the second disparity image can be obtained using the SGM algorithm based on the first and second original images. There are no restrictions on the implementation process of the SGM algorithm. The SGM algorithm is a matching algorithm that lies between local matching and global matching, effectively combining the advantages and disadvantages of both, achieving a good balance between accuracy and efficiency.

[0065] Of course, the SGM algorithm mentioned above is just an example of a target stereo algorithm. There are no restrictions on the type of target stereo algorithm. Any algorithm other than deep learning algorithms that can achieve stereo ranging can be used as a target stereo algorithm, and the target stereo algorithm supports the use of disparity images to achieve stereo ranging.

[0066] For example, target binocular algorithms can also be SGBM (Semi-Global Block Matching) algorithm, BM (Block Matching) algorithm, etc.

[0067] Step 304: Fuse the first disparity image and the second disparity image to obtain the target disparity image.

[0068] Considering the limitations of deep learning algorithms in generalization performance, the first disparity image is not directly used as the target disparity image. Similarly, considering that the SGM algorithm cannot achieve binocular ranging in textureless or weakly textured scenes, the second disparity image is not directly used as the target disparity image. In this embodiment, the first and second disparity images are fused to obtain the target disparity image. This involves fusing the first disparity image from the deep learning algorithm with the second disparity image from the target binocular algorithm, fully leveraging the advantages of both algorithms in binocular disparity estimation, and using a fusion method to obtain a high-precision, noise-free target disparity image.

[0069] In this embodiment, the first disparity image is the disparity image corresponding to the first original image, the second disparity image is the disparity image corresponding to the first original image, and the target disparity image is the disparity image corresponding to the first original image. Alternatively, the first disparity image is the disparity image corresponding to the second original image, the second disparity image is the disparity image corresponding to the second original image, and the target disparity image is the disparity image corresponding to the second original image.

[0070] For example, suppose pixel a1 in the first original image corresponds to pixel b1 in the second original image. Based on the pixel values ​​of pixel a1 and pixel b1, the disparity value of pixel c1 in the disparity image can be obtained. If pixel c1 matches pixel a1, meaning the position of pixel c1 in the disparity image is the same as the position of pixel a1 in the first original image, then the first disparity image is the disparity image corresponding to the first original image, the second disparity image is the disparity image corresponding to the first original image, and the target disparity image is the disparity image corresponding to the first original image. Alternatively, suppose pixel c1 matches pixel b1, meaning the position of pixel c1 in the disparity image is the same as the position of pixel b1 in the second original image. Then, the first disparity image is the disparity image corresponding to the second original image, the second disparity image is the disparity image corresponding to the second original image, and the target disparity image is the disparity image corresponding to the second original image.

[0071] In one possible implementation, when fusing the first disparity image and the second disparity image to obtain the target disparity image, the following method can be used. Of course, the following method is just a few examples, and there is no limitation on this fusion method, as long as it can fuse the first disparity image and the second disparity image.

[0072] Method 1: A fusion method based on a variable threshold, which may include the following steps:

[0073] Step S11: For each pixel in the first disparity image, based on the second disparity value corresponding to that pixel in the second disparity image, determine the variable threshold corresponding to that pixel. That is, the variable threshold is determined separately for each pixel in the first disparity image. In this way, the variable thresholds corresponding to different pixels may be the same, or the variable thresholds corresponding to different pixels may be different.

[0074] For example, the variable threshold corresponding to a pixel can be proportional to the second disparity value corresponding to that pixel. That is, the larger the second disparity value corresponding to the pixel, the larger the variable threshold corresponding to the pixel; the smaller the second disparity value corresponding to the pixel, the smaller the variable threshold corresponding to the pixel.

[0075] For example, it can be As the variable threshold corresponding to a pixel, d1 represents the second disparity value corresponding to that pixel in the second disparity image. Obviously, the variable threshold corresponding to that pixel can be obtained based on the second disparity value. Of course, the above method is just an example of determining the variable threshold and is not a limitation.

[0076] In summary, for each pixel in the first disparity image, the second disparity value corresponding to that pixel in the second disparity image can be queried, and the variable threshold corresponding to the pixel can be determined based on the second disparity value.

[0077] Step S12: Based on the first disparity image and the second disparity image, select multiple candidate pixels from all pixels in the first disparity image. Specifically, for each pixel in the first disparity image, based on the first disparity value corresponding to the pixel in the first disparity image, the second disparity value corresponding to the pixel in the second disparity image, and the variable threshold corresponding to the pixel, it can be determined whether the pixel is a candidate pixel or not. For example, if the absolute value of the difference between the first disparity value and the second disparity value is greater than the variable threshold, then the pixel is determined not to be a candidate pixel; if the absolute value of the difference between the first disparity value and the second disparity value is not greater than the variable threshold, then the pixel is determined to be a candidate pixel.

[0078] For example, in the disparity estimation task, the disparity estimation error is proportional to the square of the distance. A one-pixel deviation in disparity estimation at a distance is less acceptable than a one-pixel deviation at a closer distance. Therefore, the variable threshold function given in this embodiment is as shown in formula (1):

[0079]

[0080] In formula (1), d1 represents the second disparity value of the pixel in the second disparity image, and d2 represents the first disparity value of the pixel in the first disparity image. This represents the variable threshold corresponding to a pixel. This variable threshold function can be used to remove disparity values ​​that do not conform to the above relationship, and retain the remaining disparity values. Thus, a disparity map with higher accuracy and less noise can be obtained based on the remaining disparity values.

[0081] As can be seen from formula (1), for each pixel in the first disparity image, if the absolute value of the difference between the first disparity value d2 corresponding to the pixel in the first disparity image and the second disparity value d1 corresponding to the pixel in the second disparity image (i.e., |d1-d2|) is greater than the variable threshold, then the pixel is determined not to be a candidate pixel, that is, the first disparity value of the pixel needs to be removed from the first disparity image.

[0082] If the absolute value of the difference between the first disparity value d2 corresponding to the pixel in the first disparity image and the second disparity value d1 corresponding to the pixel in the second disparity image is not greater than the variable threshold, then the pixel is determined to be a candidate pixel, that is, the first disparity value of the pixel needs to be retained in the first disparity image.

[0083] Of course, the above formula (1) is just an example and is not limited to it. As long as it can be determined whether a pixel is a candidate pixel based on the first disparity value, the second disparity value and the variable threshold, it is acceptable.

[0084] Step S13: Generate a target disparity image based on the first disparity values ​​corresponding to multiple candidate pixels in the first disparity image. For example, if the size of the target disparity image is the same as the size of the first disparity image, for each candidate pixel in the first disparity image, find a target pixel in the target disparity image that matches the candidate pixel (i.e., the position of the target pixel in the target disparity image is the same as the position of the candidate pixel in the first disparity image), and use the first disparity value of the candidate pixel as the disparity value of the target pixel. In this way, the disparity values ​​of all target pixels constitute the target disparity image.

[0085] In summary, it can be seen that when using a fusion method based on a variable threshold, a high-precision, noise-free target disparity image can be obtained based on the first disparity image and the second disparity image.

[0086] Method 2: Fusion method based on confidence graph, see [link / reference] Figure 4 The diagram illustrates a fusion method based on a confidence map. A confidence map is obtained based on a first network model. Multiple candidate pixels are selected from all pixels in the first disparity image based on the confidence map. A target disparity image is generated based on the first disparity values ​​corresponding to these candidate pixels in the first disparity image. This fusion method may include the following steps:

[0087] Step S21: Obtain the first feature corresponding to the first disparity image, the second feature corresponding to the second disparity image, and the third feature corresponding to the reference image, wherein the reference image is either the first original image or the second original image.

[0088] For example, the first disparity image is input into the feature extraction network (a network used to extract image features, the type of which is not limited), and the feature extraction network outputs the first feature (such as local texture features, the type of which is not limited) corresponding to the first disparity image. The second disparity image is input into the feature extraction network, and the feature extraction network outputs the second feature corresponding to the second disparity image. The reference image is input into the feature extraction network, and the feature extraction network outputs the third feature corresponding to the reference image.

[0089] For example, an image feature extraction algorithm (the type of algorithm used to extract image features is not limited) is used to extract the first feature (such as local texture features, etc., the type of feature is not limited) corresponding to the first disparity image, and the image feature extraction algorithm is used to extract the second feature corresponding to the second disparity image, and the image feature extraction algorithm is used to extract the third feature corresponding to the reference image.

[0090] Of course, the above method is just an example and is not a limitation. As long as the first feature corresponding to the first disparity image, the second feature corresponding to the second disparity image, and the third feature corresponding to the reference image can be obtained, it is acceptable.

[0091] In the above embodiments, the reference image can be an RGB image or other types of images, such as a first original image or a second original image. For example, if the first disparity image is the disparity image corresponding to the first original image, and the second disparity image is the disparity image corresponding to the first original image, then the reference image can be the first original image. If the first disparity image is the disparity image corresponding to the second original image, and the second disparity image is the disparity image corresponding to the second original image, then the reference image can be the second original image.

[0092] Step S22: Fuse the first feature, the second feature, and the third feature to obtain the target feature.

[0093] For example, the first, second, and third features can be weighted and fused to obtain the target feature. Of course, other fusion methods can also be used to obtain the target feature; there are no restrictions on this. When weighting and fusing the first, second, and third features, the following formula can be used:

[0094] F = F1*W1 + F2*W2 + F3*W3

[0095] In the above formula, F represents the target feature, F1 represents the first feature, F2 represents the second feature, F3 represents the third feature, W1 represents the weight coefficient of the first feature, W2 represents the weight coefficient of the second feature, and W3 represents the weight coefficient of the third feature. The values ​​of W1, W2, and W3 can be configured empirically, and there are no restrictions on their values. W1 and W2 can be the same or different, W1 and W3 can be the same or different, and W2 and W3 can be the same or different.

[0096] In one possible implementation, the sum of W1, W2, and W3 can be 1.

[0097] In one possible implementation, a target feature generation network (i.e., a shared weight network) can be constructed. The input data of this network consists of a first disparity image, a second disparity image, and a reference image. The output data of this network is the target feature (i.e., the target feature map). This target feature generation network may include N+1 two-dimensional convolutional layers and N stacked layers, where N is a positive integer. See [link to relevant documentation]. Figure 5 The image shown is a schematic diagram of a target feature generation network. Figure 5 Taking N=4 as an example, in practical applications, N can be larger or smaller. See also Figure 5As shown, the input data of the two-dimensional convolutional layer 1 are a first disparity image, a second disparity image, and a reference image. The two-dimensional convolutional layer 1 processes the first disparity image to obtain a feature map a1 of scale 1 corresponding to the first disparity image. The two-dimensional convolutional layer 1 processes the second disparity image to obtain a feature map b1 of scale 1 corresponding to the second disparity image. The two-dimensional convolutional layer 1 processes the reference image to obtain a feature map c1 of scale 1 corresponding to the reference image.

[0098] The input data of the 2D convolutional layer 2 are feature map a1, feature map b1 and feature map c1. The 2D convolutional layer 2 processes feature map a1 to obtain feature map a2 at scale 2 corresponding to feature map a1. The 2D convolutional layer 2 processes feature map b1 to obtain feature map b2 at scale 2 corresponding to feature map b1. The 2D convolutional layer 2 processes feature map c1 to obtain feature map c2 at scale 2 corresponding to feature map c1.

[0099] The input data of the two-dimensional convolutional layer 3 are feature map a2, feature map b2 and feature map c2. The two-dimensional convolutional layer 3 processes feature map a2 to obtain feature map a3 at scale 3 corresponding to feature map a2. The two-dimensional convolutional layer 3 processes feature map b2 to obtain feature map b3 at scale 3 corresponding to feature map b2. The two-dimensional convolutional layer 3 processes feature map c2 to obtain feature map c3 at scale 3 corresponding to feature map c2.

[0100] The input data of the two-dimensional convolutional layer 4 are feature map a3, feature map b3 and feature map c3. The two-dimensional convolutional layer 4 processes feature map a3 to obtain feature map a4 at scale 4. The two-dimensional convolutional layer 4 processes feature map b3 to obtain feature map b4 at scale 4. The two-dimensional convolutional layer 4 processes feature map c3 to obtain feature map c4 at scale 4.

[0101] In the above example, feature maps a1, a2, a3, and a4 can be understood as the first features corresponding to the first disparity image, and feature maps b1, b2, b3, and b4 can be understood as the second features corresponding to the second disparity image, and feature maps c1, c2, c3, and c4 can be understood as the third features corresponding to the reference image.

[0102] See Figure 5As shown, the input data for stacked layer 1 are feature maps a1, b1, and c1. Stacked layer 1 can stack feature maps a1, b1, and c1 to obtain feature map d1 at scale 1. The input data for stacked layer 2 are feature maps a2, b2, and c2. Stacked layer 2 can stack feature maps a2, b2, and c2 to obtain feature map d2 at scale 2. The input data for stacked layer 3 are feature maps a3, b3, and c3. Stacked layer 3 can stack feature maps a3, b3, and c3 to obtain feature map d3 at scale 3. The input data for stacked layer 4 are feature maps a4, b4, and c4. Stacked layer 4 can stack feature maps a4, b4, and c4 to obtain feature map d4 at scale 4.

[0103] See Figure 5 As shown, the input data of the two-dimensional convolutional layer 5 are feature maps d1, d2, d3, and d4. The two-dimensional convolutional layer 5 processes feature maps d1, d2, d3, and d4 (such as convolution processing) to obtain the final feature map. The final feature map can be used as the target feature. Thus, steps S21 and S22 are completed, and the target feature is obtained by stacking multiple scales.

[0104] Step S23: Input the target feature into the trained first network model to obtain the confidence map corresponding to the target feature. The confidence map may include the confidence value corresponding to each pixel in the first disparity image. For example, the target feature can be input into the first network model, and the first network model can process the target feature. There are no restrictions on the processing method, and the confidence map corresponding to the target feature is obtained.

[0105] To obtain the confidence map using the first network model, the process involves training and detecting the first network model. During training, the first network model is trained and deployed; this process occurs before step S23. During detection, target features are input into the first network model, which processes these features to obtain and output the confidence map; this process is step S23.

[0106] The training process for the first network model involves obtaining sample data and an initial network model to be trained. The initial network model is used to output a confidence map. There are no restrictions on the structure of the initial network model; it can be a deep learning network model or other types of machine learning network models.

[0107] The sample data may include multiple datasets. For each dataset, it may include two sample disparity images (corresponding to the first disparity image and the second disparity image in the above embodiment), one sample original image (i.e., an RGB image, corresponding to the reference image in the above embodiment), and one sample confidence image (i.e., the calibration information corresponding to the two sample disparity images. The pixel value of each pixel in the sample confidence image can be a first value or a second value. The first value can be 1, which is used to indicate that the disparity value corresponding to the position of the pixel in the sample disparity image is a reliable disparity value. The second value can be 0, which is used to indicate that the disparity value corresponding to the position of the pixel in the sample disparity image is an unreliable disparity value).

[0108] Based on this, for each dataset, target features can be generated using two sample disparity images and one sample original image from that dataset. The generation method is described in steps S21 and S22, and will not be repeated here. After obtaining the target features, these features are input into the initial network model, which processes them to obtain the corresponding output confidence image. Then, based on the difference between the output confidence image and the sample confidence images in the dataset, the network parameters of the initial network model are adjusted. The adjustment process is not limited; the goal is to make the difference between the output confidence image and the sample confidence images increasingly smaller, i.e., closer and closer. Of course, the above is just an example of network parameter adjustment, and the adjustment method is not limited.

[0109] After adjusting the network parameters of the initial network model, the adjusted network model can be used as the initial network model, and the operation of "inputting the target features into the initial network model" can be performed again. This process can be repeated to continuously adjust the network parameters of the initial network model until the adjusted network model meets the convergence requirements. The adjusted network model is then used as the first network model, which is the trained first network model. This completes the training process of the first network model.

[0110] Of course, the above method is just an example of training the first network model. There are no restrictions on the training process, as long as the first network model can be trained and used to output the confidence map.

[0111] Regarding the detection process of the first network model: In step S23, after obtaining the target features, the target features can be input into the first network model, which processes the target features. This processing procedure is not limited. After the first network model processes the target features, a confidence map can be obtained and output. This confidence map can include the confidence value corresponding to each pixel in the first disparity image. The confidence value can be a value between 0 and 1. A larger confidence value indicates that the disparity value corresponding to the pixel in the first disparity image is more reliable; a smaller confidence value indicates that the disparity value corresponding to the pixel in the first disparity image is less reliable.

[0112] For example, the first network model can be a deep learning-based network model, a machine learning-based network model, or a neural network-based network model, without any limitation.

[0113] Step S24: Select multiple candidate pixels from all pixels in the first disparity image based on the confidence map. For example, for each pixel in the first disparity image, if the confidence value corresponding to the pixel in the confidence map is greater than a preset threshold, then the pixel is determined to be a candidate pixel; if the confidence value corresponding to the pixel in the confidence map is not greater than the preset threshold, then the pixel is determined not to be a candidate pixel.

[0114] For example, since the confidence map includes the confidence value corresponding to each pixel in the first disparity image, the confidence value corresponding to that pixel in the confidence map can be determined for each pixel in the first disparity image. Furthermore, since the confidence value is a value between 0 and 1, a higher confidence value indicates a more reliable disparity value in the first disparity image, while a lower confidence value indicates a less reliable disparity value. Therefore, if the confidence value corresponding to a pixel in the confidence map is greater than a preset threshold (which can be configured empirically and is a value between 0 and 1, such as 0.8, 0.9, etc.), it indicates that the disparity value corresponding to that pixel in the first disparity image is reliable, i.e., the pixel is determined to be a candidate pixel. If the confidence value corresponding to a pixel in the confidence map is not greater than the preset threshold, it indicates that the disparity value corresponding to that pixel in the first disparity image is unreliable, i.e., the pixel is determined not to be a candidate pixel.

[0115] In summary, it can be seen that unreliable disparity values ​​(i.e. erroneous disparity values) in the first disparity image can be removed based on the confidence map, thereby obtaining a high-precision, noise-free disparity map, i.e., the target disparity image.

[0116] Step S25: Generate a target disparity image based on the first disparity values ​​corresponding to multiple candidate pixels in the first disparity image. For example, if the size of the target disparity image is the same as the size of the first disparity image, for each candidate pixel in the first disparity image, find a target pixel in the target disparity image that matches the candidate pixel (i.e., the position of the target pixel in the target disparity image is the same as the position of the candidate pixel in the first disparity image), and use the first disparity value of the candidate pixel as the disparity value of the target pixel. In this way, the disparity values ​​of all target pixels constitute the target disparity image.

[0117] In summary, it can be seen that when using the confidence map-based fusion method, a high-precision, noise-free target disparity image can be obtained based on the first disparity image and the second disparity image.

[0118] Method 3: Fusion method based on network model, see [link / reference] Figure 6 The diagram illustrates a network model-based fusion method, which directly obtains the target disparity image based on the network model. This fusion method may include:

[0119] Step S31: Obtain the first feature corresponding to the first disparity image, the second feature corresponding to the second disparity image, and the third feature corresponding to the reference image, wherein the reference image is either the first original image or the second original image.

[0120] Step S32: Fuse the first feature, the second feature, and the third feature to obtain the target feature.

[0121] For example, steps S31-S32 are similar to steps S21-S22, and will not be described again here.

[0122] Step S33: Input the target feature into the trained second network model to obtain the target disparity image corresponding to the target feature. The target disparity image may include the disparity values ​​of multiple pixels. For example, the target feature can be input into the second network model, which processes the target feature. There are no restrictions on the processing method, thereby obtaining the target disparity image corresponding to the target feature.

[0123] To obtain the target disparity image using the second network model, the process involves training and detection of the second network model. During training, the second network model is trained and deployed; this process is performed before step S33. During detection, target features are input into the second network model, which processes these features to obtain and output the target disparity image; this process is step S33.

[0124] The training process for the second network model involves obtaining sample data and an initial network model to be trained. The initial network model is used to output disparity images. There are no restrictions on the structure of this initial network model; it can be a deep learning network model or other types of machine learning network models.

[0125] The sample data may include multiple datasets. For each dataset, it may include two sample disparity images (corresponding to the first disparity image and the second disparity image in the above embodiment), one sample original image (i.e., an RGB image, corresponding to the reference image in the above embodiment), and one sample calibration disparity image (i.e., the calibration information corresponding to the two sample disparity images, where the pixel value of each pixel in the sample calibration disparity image is a reliable disparity value, that is, the disparity value in the sample calibration disparity image is real and reliable).

[0126] Based on this, for each dataset, target features can be generated using two sample disparity images and one sample original image from that dataset. The generation method is described in steps S21 and S22, and will not be repeated here. After obtaining the target features, these features are input into the initial network model, which processes them to obtain the corresponding output disparity image. Then, based on the difference between the output disparity image and the sample calibration disparity image in the dataset, the network parameters of the initial network model are adjusted. The adjustment process is not limited; the goal is to make the difference between the output disparity image and the sample calibration disparity image smaller and smaller, i.e., closer and closer. Of course, the above is just an example of network parameter adjustment, and the adjustment method is not limited.

[0127] After adjusting the network parameters of the initial network model, the adjusted network model can be used as the initial network model, and the operation of "inputting the target features into the initial network model" can be performed again. This process can be repeated to continuously adjust the network parameters of the initial network model until the adjusted network model meets the convergence requirements. The adjusted network model is then used as the second network model, which is the trained second network model. This completes the training process of the second network model.

[0128] Of course, the above method is just an example of training a second network model. There are no restrictions on the training process, as long as a second network model can be trained and used to output disparity images.

[0129] Regarding the detection process of the second network model: In step S33, after obtaining the target features, the target features can be input into the second network model, which processes the target features. This processing procedure is not restricted. After the second network model processes the target features, a disparity image is obtained and output. For ease of distinction, this disparity image can be denoted as the target disparity image, which is a high-precision, noise-free target disparity image.

[0130] For example, the second network model can be a deep learning-based network model, a machine learning-based network model, or a neural network-based network model, without any limitation.

[0131] In summary, it can be seen that when using a network model-based fusion method, a high-precision, noise-free target disparity image can be obtained based on the first disparity image and the second disparity image.

[0132] Step 305: Generate a depth image corresponding to the target object based on the target parallax image.

[0133] Step 306: Determine the distance to the target object based on the depth image, that is, realize binocular ranging.

[0134] For example, after obtaining the target disparity image, the depth value corresponding to each pixel in the target disparity image can be calculated using formula (2), and the depth value corresponding to each pixel in the target disparity image constitutes the depth image. When performing binocular ranging on the target object, the depth value corresponding to each pixel in the target disparity image is the depth value corresponding to the target object, that is, the depth image is the depth image corresponding to the target object. After obtaining the depth image corresponding to the target object, the depth image represents the distance between the target object and the camera, thereby determining the distance of the target object and realizing binocular ranging.

[0135]

[0136] For each pixel in the target parallax image, depth represents the depth value corresponding to that pixel in the depth image, b represents the baseline distance between the two cameras, i.e., the baseline distance between the first camera and the second camera, f represents the focal length, i.e., the focal length of the first camera (the focal length of the second camera is the same as that of the first camera), and disparity represents the disparity value corresponding to that pixel in the target parallax image.

[0137] In one possible implementation, the above fusion scheme can be ported to an embedded device (i.e., any device that needs to adopt the technical solution of this application), that is, a binocular ranging device based on multi-algorithm fusion is proposed, see [link to relevant documentation]. Figure 7The diagram shown is a structural schematic of the device. The CPU, ROM, RAM, I / O interface, driver, and removable media are inherent components of the embedded device, and their functions are not limited. In addition, the device may further include: a neural network inference module, an algorithmic disparity estimation module, an image input module, and a disparity map output module. The image input module is used to acquire a first original image and a second original image corresponding to the target object; the neural network inference module is used to obtain a first disparity image based on the first and second original images using a deep learning algorithm; the algorithmic disparity estimation module is used to obtain a second disparity image based on the first and second original images using a target binocular algorithm; the disparity map output module is used to fuse the first and second disparity images to obtain a target disparity image, and generate a depth image corresponding to the target object based on the target disparity image.

[0138] Of course, in addition to the aforementioned neural network inference module, algorithm disparity estimation module, image input module, and disparity map output module, the device may also include: a storage module and a communication module. The storage module is used to store the first original image and the second original image into a specified storage medium, such as ROM or RAM, so that the neural network inference module and the algorithm disparity estimation module can read the first original image and the second original image from the specified storage medium. The communication module is used to provide depth images to external applications.

[0139] Of course, the above are just examples of the functions of each module, and this embodiment does not limit the functions of each module.

[0140] As can be seen from the above technical solutions, in this embodiment, a first disparity image can be obtained using a deep learning algorithm, a second disparity image can be obtained using a target binocular algorithm, and the first and second disparity images can be fused to obtain a target disparity image. A depth image is generated based on the target disparity image, and the distance to the target object is determined based on the depth image, thus achieving binocular ranging. The above method fuses the disparity image from the deep learning algorithm and the disparity image from the target binocular algorithm to achieve binocular ranging. By fusing the disparity image from the deep learning algorithm and the disparity image from the target binocular algorithm, a high-precision, noise-free target disparity image can be obtained. When measuring distance based on this target disparity image, the distance to the target object can be accurately measured. Because it can accurately measure the distance to the target object, the above method can be applied to scenarios with high requirements for disparity maps or depth maps, such as obstacle avoidance for unmanned vehicles, passenger flow statistics, and liveness detection schemes based on depth maps.

[0141] Based on the same concept as the above method, this application proposes an image processing apparatus, see [link to previous application]. Figure 8 The diagram shown is a structural schematic of the image processing device, which may include:

[0142] The acquisition module 81 is used to acquire the first original image and the second original image corresponding to the target object;

[0143] The processing module 82 is configured to obtain a first disparity image based on the first original image and the second original image using a deep learning algorithm, and to obtain a second disparity image based on the first original image and the second original image using a target binocular algorithm; and to fuse the first disparity image and the second disparity image to obtain a target disparity image corresponding to the target object.

[0144] The generation module 83 is used to generate a depth image corresponding to the target object based on the target parallax image.

[0145] In one possible implementation, when the processing module 82 fuses the first disparity image and the second disparity image to obtain a target disparity image corresponding to the target object, it specifically performs the following steps: based on the first disparity image and the second disparity image, it selects multiple candidate pixels from all pixels in the first disparity image; and on this basis, it generates the target disparity image based on the first disparity value corresponding to the multiple candidate pixels in the first disparity image.

[0146] For example, when the processing module 82 selects multiple candidate pixels from all pixels in the first disparity image based on the first disparity image and the second disparity image, it specifically performs the following steps: for each pixel in the first disparity image, it determines a variable threshold corresponding to the pixel based on the second disparity value corresponding to the pixel in the second disparity image; and it determines whether the pixel is a candidate pixel or not based on the first disparity value corresponding to the pixel in the first disparity image, the second disparity value corresponding to the pixel in the second disparity image, and the variable threshold corresponding to the pixel.

[0147] For example, when the processing module 82 determines whether a pixel is a candidate pixel or not based on the first disparity value corresponding to the pixel in the first disparity image, the second disparity value corresponding to the pixel in the second disparity image, and the variable threshold corresponding to the pixel, it specifically performs the following: if the absolute value of the difference between the first disparity value and the second disparity value is greater than the variable threshold, then the pixel is determined not to be a candidate pixel; or, if the absolute value of the difference between the first disparity value and the second disparity value is not greater than the variable threshold, then the pixel is determined to be a candidate pixel.

[0148] For example, when the processing module 82 selects multiple candidate pixels from all pixels in the first disparity image based on the first disparity image and the second disparity image, it specifically performs the following steps: obtaining a first feature corresponding to the first disparity image, a second feature corresponding to the second disparity image, and a third feature corresponding to a reference image, wherein the reference image is a first original image or a second original image; fusing the first feature, the second feature, and the third feature to obtain a target feature; inputting the target feature into a first network model to obtain a confidence map corresponding to the target feature, and selecting multiple candidate pixels from all pixels in the first disparity image based on the confidence map.

[0149] For example, the confidence map includes a confidence value corresponding to each pixel in the first disparity image. When the processing module 82 selects multiple candidate pixels from all pixels in the first disparity image based on the confidence map, it specifically performs the following: if the confidence value of the pixel in the confidence map is greater than a preset threshold, then the pixel is determined to be a candidate pixel; if the confidence value of the pixel in the confidence map is not greater than the preset threshold, then the pixel is determined not to be a candidate pixel.

[0150] For example, when the processing module 82 fuses the first disparity image and the second disparity image to obtain a target disparity image corresponding to the target object, it specifically performs the following steps: obtaining a first feature corresponding to the first disparity image, a second feature corresponding to the second disparity image, and a third feature corresponding to a reference image, wherein the reference image is a first original image or a second original image; fusing the first feature, the second feature, and the third feature to obtain a target feature; and inputting the target feature into a trained second network model to obtain a target disparity image corresponding to the target feature, wherein the target disparity image includes disparity values ​​of multiple pixels.

[0151] Based on the same concept as the above method, this application proposes an image processing device, see [link to relevant documentation]. Figure 9 As shown, the image processing apparatus may include a processor 91 and a machine-readable storage medium 92, the machine-readable storage medium 92 storing machine-executable instructions that can be executed by the processor 91; the processor 91 is used to execute the machine-executable instructions to implement the image processing method disclosed in the above example of this application.

[0152] Based on the same concept as the above method, this application also provides a machine-readable storage medium storing a plurality of computer instructions, which, when executed by a processor, can implement the image processing method disclosed in the above examples of this application.

[0153] The aforementioned machine-readable storage medium can be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, etc. For example, machine-readable storage media can be: RAM (Random Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard disk drives), solid-state drives, any type of storage disk (such as optical discs, DVDs, etc.), or similar storage media, or combinations thereof.

[0154] Based on the same concept as the above method, this application also provides an autonomous driving device, which can be a mobile robot, an autonomous vehicle, a ground cleaning device, or an unmanned obstacle avoidance vehicle, etc. The autonomous driving device can include a first camera and a second camera, wherein:

[0155] The first camera captures a first original image of the target object, and the second camera captures a second original image of the target object.

[0156] Based on the first original image and the second original image, a first disparity image is obtained by using a deep learning algorithm. Based on the first original image and the second original image, a second disparity image is obtained by using a target binocular algorithm. The first disparity image and the second disparity image are then fused to obtain a target disparity image.

[0157] A depth image corresponding to the target object is generated based on the target parallax image, and the distance between the target object and the mobile robot is determined based on the depth image, thereby performing obstacle avoidance.

[0158] The systems, devices, modules, or units described in the above embodiments can be implemented by a computer or entity, or by a product with a certain function. A typical implementation device is a computer, which can be a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email sending and receiving device, game console, tablet computer, wearable device, or any combination of these devices.

[0159] For ease of description, the above devices are described separately by function as various units. Of course, in implementing this application, the functions of each unit can be implemented in one or more software and / or hardware.

[0160] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of this application can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0161] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0162] Furthermore, these computer program instructions can also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in the process. Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0163] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0164] The above description is merely an embodiment of this application and is not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. An image processing method, characterized by, The method includes: Obtain the first and second original images corresponding to the target object; Based on the first original image and the second original image, a deep learning algorithm is used to obtain the first disparity image; Based on the first and second original images, a second disparity image is obtained using a target binocular algorithm; The first disparity image and the second disparity image are fused to obtain the target disparity image; Generate a depth image corresponding to the target object based on the target parallax image; The step of fusing the first disparity image and the second disparity image to obtain a target disparity image includes: selecting multiple candidate pixels from all pixels in the first disparity image based on the first disparity image and the second disparity image; wherein, for each pixel in the first disparity image, a variable threshold corresponding to the pixel is determined based on the second disparity value corresponding to the pixel in the second disparity image; the pixel is determined to be a candidate pixel or not a candidate pixel based on the first disparity value corresponding to the pixel in the first disparity image, the second disparity value corresponding to the pixel in the second disparity image, and the variable threshold corresponding to the pixel; or, obtaining a first feature corresponding to the first disparity image, a second feature corresponding to the second disparity image, and a third feature corresponding to a reference image, wherein the reference image is a first original image or a second original image; fusing the first feature, the second feature, and the third feature to obtain a target feature; inputting the target feature into a first network model to obtain a confidence map corresponding to the target feature, and selecting multiple candidate pixels from all pixels in the first disparity image based on the confidence map; The target disparity image is generated based on the first disparity value corresponding to the plurality of candidate pixels in the first disparity image.

2. The method according to claim 1, characterized in that, The step of determining whether a pixel is a candidate pixel or not based on the first disparity value corresponding to the pixel in the first disparity image, the second disparity value corresponding to the pixel in the second disparity image, and the variable threshold corresponding to the pixel includes: If the absolute value of the difference between the first disparity value and the second disparity value is greater than the variable threshold, then the pixel is determined not to be a candidate pixel; or, if the absolute value of the difference between the first disparity value and the second disparity value is not greater than the variable threshold, then the pixel is determined to be a candidate pixel.

3. The method of claim 1, wherein, The confidence map includes a confidence value corresponding to each pixel in the first disparity image. The step of selecting multiple candidate pixels from all pixels in the first disparity image based on the confidence map includes: For each pixel in the first disparity image, if the confidence value corresponding to the pixel in the confidence map is greater than a preset threshold, then the pixel is determined to be a candidate pixel; if the confidence value corresponding to the pixel in the confidence map is not greater than the preset threshold, then the pixel is determined not to be a candidate pixel.

4. The method according to any one of claims 1 to 3, characterized in that, The method is applied to a device that supports binocular ranging, the device including a first camera and a second camera; The acquisition of the first and second original images corresponding to the target object includes: The first original image corresponding to the target object is acquired by the first camera; The second camera captures a second original image corresponding to the target object.

5. The method of claim 4, wherein, After generating a depth image corresponding to the target object based on the target disparity image, the method further includes: The distance between the target object and the first camera is determined based on the depth image; or... The distance between the target object and the second camera is determined based on the depth image.

6. An image processing apparatus characterized by comprising: The device includes: The acquisition module is used to acquire the first and second original images corresponding to the target object; The processing module is configured to obtain a first disparity image based on the first original image and the second original image using a deep learning algorithm, and to obtain a second disparity image based on the first original image and the second original image using a target binocular algorithm; and to fuse the first disparity image and the second disparity image to obtain a target disparity image corresponding to the target object. The generation module is used to generate a depth image corresponding to the target object based on the target disparity image; Specifically, when the processing module fuses the first disparity image and the second disparity image to obtain a target disparity image corresponding to the target object, it is used for: Based on the first disparity image and the second disparity image, multiple candidate pixels are selected from all pixels in the first disparity image. Specifically, for each pixel in the first disparity image, a variable threshold is determined based on the second disparity value corresponding to that pixel in the second disparity image. Based on the first disparity value corresponding to that pixel in the first disparity image, the second disparity value corresponding to that pixel in the second disparity image, and the variable threshold, it is determined whether the pixel is a candidate pixel or not. Alternatively, a first feature corresponding to the first disparity image, a second feature corresponding to the second disparity image, and a third feature corresponding to a reference image are obtained, where the reference image is either a first original image or a second original image. The first feature, the second feature, and the third feature are fused to obtain a target feature. The target feature is input into a first network model to obtain a confidence map corresponding to the target feature, and multiple candidate pixels are selected from all pixels in the first disparity image based on the confidence map. The target disparity image is generated based on the first disparity value corresponding to the plurality of candidate pixels in the first disparity image.

7. An automatic driving apparatus characterized by comprising: Including a first camera and a second camera, wherein: The first camera captures a first original image of the target object, and the second camera captures a second original image of the target object. Based on the first original image and the second original image, a first disparity image is obtained by using a deep learning algorithm. Based on the first original image and the second original image, a second disparity image is obtained by using a target binocular algorithm. The first disparity image and the second disparity image are then fused to obtain a target disparity image. Generate a depth image corresponding to the target object based on the target parallax image, and determine the distance between the target object and the autonomous driving device based on the depth image; The step of fusing the first disparity image and the second disparity image to obtain a target disparity image includes: selecting multiple candidate pixels from all pixels in the first disparity image based on the first disparity image and the second disparity image; wherein, for each pixel in the first disparity image, a variable threshold corresponding to the pixel is determined based on the second disparity value corresponding to the pixel in the second disparity image; the pixel is determined to be a candidate pixel or not a candidate pixel based on the first disparity value corresponding to the pixel in the first disparity image, the second disparity value corresponding to the pixel in the second disparity image, and the variable threshold corresponding to the pixel; or, obtaining a first feature corresponding to the first disparity image, a second feature corresponding to the second disparity image, and a third feature corresponding to a reference image, wherein the reference image is a first original image or a second original image; fusing the first feature, the second feature, and the third feature to obtain a target feature; inputting the target feature into a first network model to obtain a confidence map corresponding to the target feature, and selecting multiple candidate pixels from all pixels in the first disparity image based on the confidence map; The target disparity image is generated based on the first disparity value corresponding to the plurality of candidate pixels in the first disparity image.

8. The automatic driving device according to claim 7, characterized by The autonomous driving equipment is a mobile robot, an autonomous vehicle, or a ground cleaning device.