Depth image acquisition method, device and storage medium
By constructing a binocular depth model using 2D convolution operators and combining data conversion and post-processing algorithms, the problem of electronic devices not being able to support 3D convolution models is solved, enabling 3D depth feature extraction and physically reasonable depth distribution on low-cost hardware.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SPREADTRUM COMMUNICATION (SHANGHAI) CO LTD
- Filing Date
- 2022-12-30
- Publication Date
- 2026-06-16
AI Technical Summary
Some electronic devices cannot support 3D convolution models, making it difficult to achieve 3D feature extraction. Furthermore, pseudo-3D convolution models still rely on 3D convolution operators, which are costly.
A binocular depth model is constructed using 2D convolution operators. The binocular images are converted from five-dimensional data to four-dimensional data through a data conversion module. 3D depth features are extracted by combining 2D and 1D convolution modules. Post-processing algorithms are used to correct the target instance.
Achieving 3D depth feature extraction without relying on high-performance hardware reduces hardware costs, and ensures that the depth distribution conforms to physical reality through target instance correction.
Smart Images

Figure CN116012431B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, and in particular to a method, device and storage medium for acquiring depth images. Background Technology
[0002] With the rapid development of Artificial Intelligence (AI) and related hardware, AI technology is increasingly being applied to electronic devices. Especially in computer vision, such as binocular vision and medical imaging, AI technology has wide applications. Because 3D convolution operators have good feature extraction performance, 3D convolution models can be used to extract features in computer vision. However, due to hardware limitations, some electronic devices may not support 3D convolution models. Therefore, how to achieve 3D feature extraction when electronic device hardware cannot support 3D convolution models has become a problem that needs to be solved. Summary of the Invention
[0003] In view of this, embodiments of the present invention provide a depth image acquisition method, device and storage medium. This solution achieves the effect of acquiring 3D depth images by constructing a binocular depth model using 2D convolution operators, thereby reducing the requirements for electronic device hardware.
[0004] In a first aspect, embodiments of the present invention provide a depth image acquisition method, including:
[0005] Acquire binocular images;
[0006] The binocular image is input into a binocular depth model, which is constructed based on a 2D convolution operator. The binocular depth model is used to extract 3D depth features of the binocular image based on the 2D convolution operator.
[0007] Obtain the target instance from the binocular image;
[0008] The 3D depth features output by the binocular depth model are corrected based on the target instance to obtain a depth image.
[0009] Optionally, the binocular depth model includes:
[0010] The first data conversion module is used to compress the depth dimension of the stereo image to convert the stereo image from five-dimensional data to four-dimensional data and input it into the 2D convolution module;
[0011] The 2D convolution module is used to extract image features of the input data from the height and width dimensions to obtain four-dimensional output features.
[0012] The second data conversion module is used to convert the four-dimensional output features into feature data containing depth dimensions and input them into the 1D convolution module.
[0013] The 1D convolution module is used to extract image features from the input data from the depth dimension to obtain 3D depth features.
[0014] Optionally, the first data conversion module is used to compress the depth dimension of the stereo image to convert the stereo image from five-dimensional data to four-dimensional data, including:
[0015] The depth dimension of the stereo image is compressed to a batch dimension to convert the stereo image from five-dimensional data (N, C, D, H, W) to four-dimensional data (N*D, C, H, W); or,
[0016] The depth dimension of the stereo image is compressed to the channel dimension to convert the stereo image from five-dimensional data (N, C, D, H, W) to four-dimensional data (N, C*D, H, W);
[0017] Where N is the batch dimension, C is the channel dimension, D is the depth dimension, H is the height dimension, and W is the width dimension.
[0018] Optionally, if the four-dimensional data converted by the first data conversion module is (N*D, C, H, W), then the data dimensions of the input and output of the 2D convolution module are both (N*D, C, H, W), and the second data conversion module is used to convert the data dimensions (N*D, C, H, W) into (N, C, D, H*W).
[0019] If the four-dimensional data converted by the first data conversion module is (N, C*D, H, W), then the data dimension input to the 2D convolution module is (N, C*D, H, W). in *D, H, W), the data dimension output by the 2D convolution module is (N, C). out1 *D, H, W), the second data conversion module will convert the (N, C) out1 *D, H, W) is converted to (N, C) out1 (D, H*W).
[0020] Optionally, a 2D batch normalization function and an activation function are connected in series between the 2D convolution module and the second data conversion module.
[0021] Optionally, the 1D convolution module is followed by a 1D batch normalization function and an activation function.
[0022] Optionally, the binocular image includes a left image and a right image, and before inputting the binocular image into the binocular depth model, the method further includes:
[0023] Distortion correction and stereo correction are performed on the left and right images.
[0024] Optionally, the binocular image includes a left image and a right image, and obtaining the target instance in the binocular image includes:
[0025] The left image is segmented to obtain the target instance.
[0026] In a second aspect, embodiments of the present invention provide a depth image acquisition device, comprising:
[0027] The acquisition module is used to acquire stereo images;
[0028] A depth feature extraction module is used to input the stereo image into a stereo depth model, the stereo depth model is constructed based on a 2D convolution operator, and the stereo depth model is used to extract 3D depth features of the stereo image based on the 2D convolution operator;
[0029] Instance segmentation module, used to obtain target instances in the stereo image;
[0030] The post-processing module is used to correct the 3D depth features output by the binocular depth model according to the target instance to obtain a depth image.
[0031] Thirdly, embodiments of the present invention provide an electronic device, including: at least one processor; and at least one memory communicatively connected to the processor, wherein: the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform the method described in the first aspect or any one of the first aspects.
[0032] Fourthly, embodiments of the present invention provide a computer-readable storage medium comprising a stored program, wherein, when the program is executed, it controls the device where the computer-readable storage medium is located to perform the method described in the first aspect or any one of the first aspects.
[0033] The present invention provides a solution for constructing a binocular depth model using 2D convolution operators. This binocular depth model can extract 3D depth features from binocular images based on 2D convolution operators, without relying on high-performance hardware or 3D convolution operators, and supports deployment on various electronic devices. Furthermore, the present invention enhances the processing of depth at target instances. By analyzing the distribution of target instances, the 3D depth features can be corrected to avoid inconsistencies between the depth distribution of target instances and the depth distribution of actual objects, thus making the depth distribution at target instances more physically consistent. Attached Figure Description
[0034] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0035] Figure 1 A flowchart of a depth image acquisition method provided in an embodiment of the present invention;
[0036] Figure 2 A flowchart of another depth image acquisition method provided in an embodiment of the present invention;
[0037] Figure 3 This is a schematic diagram of the structure of a binocular depth-of-field model provided in an embodiment of the present invention;
[0038] Figure 4 This is a schematic diagram of a depth image acquisition device provided in an embodiment of the present invention;
[0039] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0040] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0041] 1D convolution (one-dimensional convolution) refers to convolution operations in one dimension.
[0042] 2D convolution (two-dimensional convolution) refers to convolution operations in two dimensions.
[0043] 3D convolution (three-dimensional convolution) refers to convolution operations in three dimensions.
[0044] Depth of field refers to the range of distances in front of and behind a subject that allow for a sharp image to be captured by the lens of a camera or other imaging device. Aperture, lens, and the distance from the focal plane to the subject are important factors affecting depth of field. After focusing, the range of distances in front of and behind the focal point that result in a sharp image is called depth of field.
[0045] Binocular depth of field refers to the determination of depth features using binocular images captured by a binocular camera.
[0046] The binocular depth-of-field systems in related technologies include: (1) vehicle-mounted binocular depth-of-field systems, which are installed in vehicle-mounted devices and are not limited by hardware size. They can directly use high-performance GPU hardware to run 3D convolution operators. Although this system supports running 3D convolution operators, the hardware cost is high and the size is large, making it unsuitable for electronic devices such as mobile phones and smartwatches. (2) binocular depth-of-field systems based on pseudo-3D convolution models. 3D convolution models have good effects in extracting depth features. However, 3D convolution models have high hardware performance requirements, and some electronic devices cannot support 3D convolution models. In order to achieve the feature extraction effect of 3D convolution models on electronic devices, pseudo-3D convolution models have been proposed in related technologies. Specifically, 3D convolution models usually use a size of F d xf h xf w 3D convolution operators can be used to extract depth features; for example, a 3x3x3 3D convolution operator can be used to extract depth features. A pseudo-3D convolution model is an example of using F... d xf h xf w 3D convolution is split into 1xf h xf w 2D convolution and F d A 1x1x1 1D convolution achieves the computation of 2D and 1D convolutions by changing the dimensions of the 3D convolution operator. For example, a 3x3x3 3D convolution operator can be split into a 1x3x3 2D convolution and a 3x1x1 1D convolution. Although the pseudo-3D convolution model formally splits the 3D convolution operator into 2D and 1D convolutions, it is essentially still a 3D convolution operator, with 1x3x3 2D convolutions and 3x1x1 1D convolutions. Therefore, the pseudo-3D convolution model still relies on 3D convolution operators.
[0047] In view of the aforementioned problems of binocular depth models in related technologies, embodiments of the present invention provide a solution for depth image acquisition based on software algorithms. This solution constructs a binocular depth model based on 2D convolution operators, achieving the effect of 3D depth feature extraction without relying on 3D convolution. Furthermore, this solution eliminates the need for expensive additional hardware, saving costs. In addition, this solution enhances the depth processing at target instances (such as vehicles, people, trees, etc. in the image) to make the depth distribution at target instances more physically consistent.
[0048] See Figure 1 This is a flowchart of a depth image acquisition method provided in an embodiment of the present invention. Figure 1 As shown, the processing steps of this method include:
[0049] 101. Acquire stereo images. Optionally, stereo images can be acquired using a stereo camera, including a left image and a right image.
[0050] 102. Input the stereo image into a stereo depth model, which is constructed based on a 2D convolution operator. The stereo depth model is used to extract 3D depth features from the stereo image based on the 2D convolution operator. Optionally, the stereo depth model extracting 3D depth features from the stereo image based on the 2D convolution operator may include: the stereo depth model may perform convolution operations on the three dimensions of the stereo image based on the 2D convolution operator to extract 3D depth features from the stereo image.
[0051] 103. Obtain the target instance in the stereo image. Optionally, an instance segmentation algorithm can be used to obtain the target instance in the stereo image. Optionally, the instance segmentation algorithm can be performed on one side of the stereo image to obtain the target instance. For example, instance segmentation can be performed on either the left or right image of the stereo image to obtain the target instance. Optionally, the target instance can be an instance contained in or of interest in the stereo image. Optionally, the target instance can be a person, vehicle, animal, plant, etc. in the stereo image.
[0052] 104. Based on the stereo instance, the 3D depth features output by the stereo depth model are corrected to obtain a depth image. This step uses an image post-processing algorithm to optimize and correct the 3D depth features using the segmented target instance, avoiding inconsistencies between the depth at the same instance and the actual distribution. For example, if the segmented target instance includes a bicycle, the 3D depth features of the bicycle can be filled with missing information or adjusted for depth consistency based on the identified bicycle instance, so that the depth distribution of the bicycle more closely matches the actual physical distribution.
[0053] In this embodiment of the invention, a binocular depth model is constructed using 2D convolution operators. This binocular depth model can extract 3D depth features from binocular images based on 2D convolution operators, without relying on high-performance hardware or 3D convolution operators, and supports deployment on various electronic devices. Furthermore, this embodiment of the invention enhances the processing of depth at target instances. By analyzing the distribution of target instances, the 3D depth features can be corrected to avoid inconsistencies between the depth at the same instance and the actual depth distribution of objects, thereby making the depth distribution at target instances more physically consistent.
[0054] See Figure 2 This is a flowchart of another depth image acquisition method provided by an embodiment of the present invention. Figure 2 As shown, the method includes:
[0055] 201. Obtain the stereo image, which includes the left image and the right image.
[0056] 202. Perform instance segmentation on the left image to obtain the target instance.
[0057] 203. Distortion correction and stereo correction are performed on the left and right images. Since images captured by the camera may have some distortion, and the imaging planes of the left and right cameras may not be on the same plane, this embodiment of the invention performs distortion correction and stereo correction on the left image captured by the left camera and the right image captured by the right camera. By performing distortion correction and stereo correction on the left and right images, distortion caused by the camera lens itself can be corrected, and the imaging screens of the left and right images can be aligned on the same plane, facilitating subsequent depth feature extraction.
[0058] 204. The left and right images, after distortion correction and stereo correction, are input into the binocular depth model. This binocular depth model is based on 2D convolution operators. The binocular depth image extracts 3D depth features from the input image using 2D convolution operators.
[0059] 205. Perform image post-processing on the target instance output in step 202 and the 3D depth features output in step 204 to output a depth image of the binocular image. Image post-processing refers to rendering techniques used to improve the quality of image presentation. Algorithms that can be used for image post-processing include, for example, image color filtering algorithms and blurring algorithms. In this embodiment of the invention, the target instance and depth features are combined through a post-processing algorithm. The post-processing algorithm can correct the depth at the target instance, avoiding inconsistencies between the depth distribution at the target instance and the actual depth distribution of the object. This embodiment of the invention uses a post-processing algorithm to combine the target instance and depth features, which can also reduce model maintenance costs. Specifically, if problems occur in instance segmentation, the binocular depth model, or the post-processing algorithm, the corresponding modules can be optimized specifically according to the problems, without needing to optimize the entire system, thereby saving model maintenance costs.
[0060] See Figure 3 This is a schematic diagram of the structure of a binocular depth-of-field model provided in an embodiment of the present invention. The binocular depth-of-field model of this embodiment is constructed using a 2D convolution operator and has the function of extracting 3D depth features. Figure 3 As shown, the binocular depth-of-field model includes: a first data conversion module, a 2D convolution module, a second data conversion module, and a 1D convolution module. Optionally, a 2D batch normalization function and an activation function are connected in series between the 2D convolution module and the second data conversion module. A 1D batch normalization function and an activation function are then connected in series after the 1D convolution module.
[0061] In some embodiments, the image data suitable for 3D convolution operators is typically five-dimensional data. Optionally, the five-dimensional data is typically represented as (N, C, D, H, W), where N is the batch dimension, C is the channel dimension, D is the depth dimension, H is the height dimension, and W is the width dimension. The binocular depth model of this embodiment is constructed based on a 2D convolution operator, and the image data suitable for the 2D convolution operator is four-dimensional data. Therefore, the binocular depth model of this embodiment includes a first data conversion module. This first data conversion module is used to convert the binocular image from five-dimensional data to four-dimensional data. Specifically, the first data conversion module of this embodiment is used to compress the depth dimension of the binocular image, thereby converting the binocular image from five-dimensional data to four-dimensional data.
[0062] In some embodiments, the first data conversion module compresses the depth dimension of the stereo image by:
[0063] Method 1 involves compressing the depth dimension of the stereo image to the batch dimension, thereby converting the stereo image from five-dimensional data (N, C, D, H, W) to four-dimensional data (N*D, C, H, W).
[0064] Method 2 involves compressing the depth dimension of the stereo image to the channel dimension, thereby converting the stereo image from five-dimensional data (N, C, D, H, W) to four-dimensional data (N, C*D, H, W).
[0065] In the first method described above, the batch dimension data does not participate in the convolution operation. Therefore, compressing the depth dimension of the stereo image to the batch dimension will not affect the convolution result. The first data conversion module converts the stereo image from five-dimensional data (N, C, D, H, W) to four-dimensional data (N*D, C, H, W), and then inputs the four-dimensional data (N*D, C, H, W) into the 2D convolution module. The 2D convolution module contains several 2D convolution operators used to extract image features from the input data from the height and width dimensions to obtain four-dimensional output features. Optionally, the 2D convolution operator can be represented as f... h xf w Optionally, if the binocular depth-of-field model adopts the above pseudo-3D convolution model, the 3D convolution operator used can be expressed as 1xf h xf w The 2D convolution operator in this embodiment of the invention can determine its size based on the height and width of the 3D convolution operator in the pseudo-3D convolution model.
[0066] like Figure 3As shown, the 2D convolutional module is followed by a 2D batch normalization function and a leaky ReLU activation function to output the features extracted by the 2D convolutional module, which is referred to here as the four-dimensional output feature. This four-dimensional output feature still has the data dimensions (N*D, C, H, W). Figure 3 The 1D convolution module is used to extract image features from the depth direction, while the four-dimensional output features of the 2D convolution module do not meet the usage conditions of the 1D convolution module. Therefore, the binocular depth model of this embodiment also includes a second data conversion module. The second data conversion module is used to convert the above-mentioned four-dimensional output features into feature data containing the depth dimension. Specifically, the second data conversion module is used to convert the data dimension (N*D, C, H, W) into (N, C, D, H*W). The second data conversion module inputs the converted (N, C, D, H*W) into the 1D convolution module. The 1D convolution module contains several 1D convolution operators. The 1D convolution operator can be represented as F d x1. Optionally, if the binocular depth model adopts the above pseudo-3D convolution model, the depth-related 3D convolution operator used can be expressed as F. d x1x1. The 1D convolution operator in this embodiment of the invention can be based on F in the pseudo-3D convolution model. d The depth of the x1x1 operator determines the size.
[0067] A 1D convolution module is used to extract image features from the input data from the depth dimension. For example... Figure 3 As shown, the aforementioned 1D convolutional module is followed by a 1D batch normalization function and an activation function (Leaky ReLU) to output the features extracted by the 1D convolutional module, thus obtaining 3D depth features. In summary, the 2D convolutional module is used to extract image features from the height and width dimensions, while the 1D convolutional module is used to extract features from the depth dimension. The combination of the 2D and 1D convolutional modules achieves the effect of extracting 3D depth features.
[0068] In Method 2 above, the stereo image is converted from five-dimensional data (N, C, D, H, W) to four-dimensional data (N, C*D, H, W), that is, the depth dimension of the stereo image is compressed to the channel dimension. It is important to note that the size of the channel dimension is related to the configuration of the convolutional layer. In some embodiments, the channel dimension of the input data to the 2D convolutional module becomes C. in *D, the channel dimension of the output data becomes C. out1 *D. Therefore, the input and output channels of the 2D convolution operator in the 2D convolution module need to be set to C respectively. in *D and C out1*D. The data dimension input to the 2D convolution module can then be represented as (N, C). in *D, H, W), the data dimension of the output of the 2D convolution module is represented as (N, C). out1 *D, H, W). For example... Figure 3 As shown, the 2D convolutional module is followed by a 2D batch normalization function and a leaky ReLU activation function to output the features extracted by the 2D convolutional module, referred to here as the four-dimensional output features. The data dimensions of this four-dimensional output feature are (N, C). out1 *D, H, W).
[0069] Figure 3 The 1D convolution module is used to extract image features from the depth direction, while the four-dimensional output features of the 2D convolution module do not meet the usage conditions of the 1D convolution module. Therefore, the binocular depth model of this embodiment also includes a second data conversion module. The second data conversion module is used to convert the above-mentioned four-dimensional output features (N, C) into a single data model. out1 *D, H, W) is converted to (N, C) out1 (D, H*W). Because the second data conversion module transforms the number of data channels, the number of input and output channels of the 1D convolution module in this method are C, D, and H*W, respectively. out1 and C out2 The 1D convolution module contains several 1D convolution operators. A 1D convolution operator can be represented as F... d x1. Optionally, if the binocular depth model adopts the above pseudo-3D convolution model, the depth-related 3D convolution operator used can be expressed as F. d x1x1. The 1D convolution operator in this embodiment of the invention can be based on F in the pseudo-3D convolution model. d The depth of the x1x1 operator determines the size.
[0070] A 1D convolution module is used to extract image features from the input data from the depth dimension. For example... Figure 3 As shown, the aforementioned 1D convolutional module is followed by a 1D batch normalization function and a leaky ReLU activation function, which are used to output the features extracted by the 1D convolutional module to obtain 3D depth features. In summary, the 2D convolutional module is used to extract image features from the height and width dimensions, while the 1D convolutional module is used to extract features from the depth dimension. The combination of the 2D and 1D convolutional modules achieves the effect of extracting 3D depth features.
[0071] See Figure 4This is a schematic diagram of a depth image acquisition device provided in an embodiment of the present invention. The device can be deployed in various electronic devices. These electronic devices may include, for example, mobile phones, tablets, laptops, handheld computers, mobile internet devices (MIDs), wearable devices, virtual reality (VR) devices, augmented reality (AR) devices, electronic devices in industrial control, electronic devices in self-driving vehicles, and electronic devices in smart homes. Figure 4 As shown, the depth image acquisition device includes:
[0072] Acquisition module 401 is used to acquire stereo images;
[0073] The depth feature extraction module 402 is used to input the stereo image into a stereo depth model, the stereo depth model is constructed based on a 2D convolution operator, and the stereo depth model is used to extract the 3D depth features of the stereo image based on the 2D convolution operator;
[0074] Instance segmentation module 403 is used to acquire target instances in the binocular image;
[0075] The post-processing module 404 is used to correct the 3D depth features output by the binocular depth model according to the target instance to obtain a depth image.
[0076] Optionally, the aforementioned binocular depth-of-field model includes:
[0077] The first data conversion module is used to compress the depth dimension of the stereo image to convert the stereo image from five-dimensional data to four-dimensional data and input it into the 2D convolution module;
[0078] The 2D convolution module is used to extract image features of the input data from the height and width dimensions to obtain four-dimensional output features.
[0079] The second data conversion module is used to convert the four-dimensional output features into feature data containing depth dimensions and input them into the 1D convolution module.
[0080] The 1D convolution module is used to extract image features from the input data from the depth dimension to obtain 3D depth features.
[0081] Optionally, the first data conversion module is used to compress the depth dimension of the stereo image to convert the stereo image from five-dimensional data to four-dimensional data, including:
[0082] The depth dimension of the stereo image is compressed to a batch dimension to convert the stereo image from five-dimensional data (N, C, D, H, W) to four-dimensional data (N*D, C, H, W); or,
[0083] The depth dimension of the stereo image is compressed to the channel dimension to convert the stereo image from five-dimensional data (N, C, D, H, W) to four-dimensional data (N, C*D, H, W);
[0084] Where N is the batch dimension, C is the channel dimension, D is the depth dimension, H is the height dimension, and W is the width dimension.
[0085] Optionally, if the four-dimensional data converted by the first data conversion module is (N*D, C, H, W), then the data dimensions of the input and output of the 2D convolution module are both (N*D, C, H, W), and the second data conversion module is used to convert the data dimensions (N*D, C, H, W) into (N, C, D, H*W).
[0086] If the four-dimensional data converted by the first data conversion module is (N, C*D, H, W), then the data dimension input to the 2D convolution module is (N, C*D, H, W). in *D, H, W), the data dimension output by the 2D convolution module is (N, C). out1 *D, H, W), the second data conversion module will convert the (N, C) out1 *D, H, W) is converted to (N, C) out1 (D, H*W).
[0087] The depth image acquisition device of this invention can be implemented using the method described in the embodiments above. For parts not described in detail in this embodiment, please refer to the relevant descriptions in the method embodiments. The execution process and technical effects of this technical solution are described in the method embodiments, and will not be repeated here.
[0088] It should be understood that Figure 4The division of the various modules in the depth image acquisition device shown is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these modules can be implemented entirely in software via processing element calls; they can be fully implemented in hardware; or some modules can be implemented in software via processing element calls, while others are implemented in hardware. For example, the depth feature extraction module 402, instance segmentation module 403, and post-processing module 404 can be separate processing elements, or some or all of these modules can be integrated into a single chip in the electronic device. The implementation of other modules is similar. In addition, these modules can be fully or partially integrated together, or implemented independently. During implementation, each step of the above method or each of the above modules can be completed through integrated logic circuits in the hardware of the processor element or through software instructions.
[0089] For example, these modules can be one or more integrated circuits configured to implement the above methods, such as one or more Application Specific Integrated Circuits (ASICs), one or more Digital Signal Processors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs). Alternatively, these modules can be integrated together as a System-On-a-Chip (SOC).
[0090] See Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Figure 5 As shown, the electronic device is presented in the form of a general-purpose computing device. The components of the electronic device may include, but are not limited to: one or more processors 510, a communication interface 520, a memory 530, and a communication bus 540 connecting different system components (including the memory 530, the communication interface 520, and the processor 510).
[0091] The communication bus 540 represents one or more of several bus architectures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of the various bus architectures. Examples of these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.
[0092] Electronic devices typically include a variety of computer-readable media. These media can be any available media that can be accessed by the electronic device, including volatile and non-volatile media, and removable and non-removable media.
[0093] Memory 530 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) and / or cache memory. The electronic device may further include other removable / non-removable, volatile / non-volatile computer system storage media. Memory 530 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the depth image acquisition method of embodiments of the present invention.
[0094] A program / utility having a set (at least one) of program modules may be stored in memory 530. Such program modules include—but are not limited to—an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. The program modules typically perform the functions and / or methods described in the embodiments of this specification.
[0095] The processor 510 executes various functional applications and data processing by running programs stored in the memory 530, such as implementing the depth image acquisition method provided in the embodiments of the invention.
[0096] In a specific implementation, embodiments of the present invention also provide a computer storage medium, wherein the computer storage medium may store a program, which, when executed, may implement some or all of the steps included in the embodiments provided in this application. The storage medium may be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0097] In a specific implementation, this embodiment of the invention also provides a chip, including: a processor, which is used to execute computer program instructions stored in a memory, wherein when the computer program instructions are executed by the processor, the chip is triggered to execute the depth image acquisition method of this embodiment of the invention.
[0098] In a specific implementation, the present invention also provides a computer program product, which includes executable instructions that, when executed on a computer, cause the computer to perform some or all of the steps in the above method embodiments.
[0099] In this embodiment of the invention, "at least one" refers to one or more, and "more than one" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent the existence of A alone, the simultaneous existence of A and B, or the existence of B alone. A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" and similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, and c can represent: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.
[0100] Those skilled in the art will recognize that the units and algorithm steps described in the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of electronic hardware and software. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0101] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0102] In several embodiments provided by this invention, any function, if implemented as a software functional unit and sold or used as an independent product, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0103] The above description is merely a specific embodiment of the present invention. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this invention should be included within the protection scope of this invention. The protection scope of this invention should be determined by the scope of the claims.
Claims
1. A method for acquiring depth images, characterized in that, include: Acquire binocular images; The binocular image is input into a binocular depth model, which is constructed based on a 2D convolution operator. The binocular depth model is used to extract 3D depth features of the binocular image based on the 2D convolution operator. Obtain the target instance from the binocular image; The 3D depth features output by the binocular depth model are corrected based on the target instance to obtain a depth image; The binocular depth-of-field model includes: The first data conversion module is used to compress the depth dimension of the stereo image to convert the stereo image from five-dimensional data to four-dimensional data and input it into the 2D convolution module; The 2D convolution module is used to extract image features of the input data from the height and width dimensions to obtain four-dimensional output features. The second data conversion module is used to convert the four-dimensional output features into feature data containing depth dimensions and input them into the 1D convolution module. The 1D convolution module is used to extract image features from the input data from the depth dimension to obtain 3D depth features.
2. The method according to claim 1, characterized in that, The first data conversion module is used to compress the depth dimension of the stereo image to convert the stereo image from five-dimensional data to four-dimensional data, including: The depth dimension of the stereo images is compressed to a batch dimension to transform the stereo images from five-dimensional data. Convert to four-dimensional data ;or, The depth dimension of the stereo image is compressed to the channel dimension to transform the stereo image from five-dimensional data. Convert to four-dimensional data ; in, For batch dimension, For channel dimension, For depth dimension, For height dimension, For width dimension.
3. The method according to claim 2, characterized in that, If the four-dimensional data converted by the first data conversion module is Then the data dimensions of the input and output of the 2D convolution module are both... The second data transformation module is used to convert data dimensions Convert to ; If the four-dimensional data converted by the first data conversion module is The data dimension input to the 2D convolution module is then... The data dimension output by the 2D convolution module is The second data conversion module will convert the data into the data that is described in the original text. Convert to .
4. The method according to claim 1, characterized in that, A 2D batch normalization function and an activation function are connected in series between the 2D convolution module and the second data conversion module.
5. The method according to claim 1, characterized in that, The 1D convolution module is followed by a 1D batch normalization function and an activation function.
6. The method according to claim 1, characterized in that, The binocular image includes a left image and a right image. Before inputting the binocular image into the binocular depth model, the method further includes: Distortion correction and stereo correction are performed on the left and right images.
7. The method according to claim 1, characterized in that, The binocular image includes a left image and a right image, and obtaining the target instance in the binocular image includes: The left image is segmented to obtain the target instance.
8. A depth image acquisition device, characterized in that, include: The acquisition module is used to acquire stereo images; A depth feature extraction module is used to input the stereo image into a stereo depth model, the stereo depth model is constructed based on a 2D convolution operator, and the stereo depth model is used to extract 3D depth features of the stereo image based on the 2D convolution operator; Instance segmentation module, used to obtain target instances in the stereo image; The post-processing module is used to correct the 3D depth features output by the binocular depth model according to the target instance to obtain a depth image; The binocular depth-of-field model includes: The first data conversion module is used to compress the depth dimension of the stereo image to convert the stereo image from five-dimensional data to four-dimensional data and input it into the 2D convolution module; The 2D convolution module is used to extract image features of the input data from the height and width dimensions to obtain four-dimensional output features. The second data conversion module is used to convert the four-dimensional output features into feature data containing depth dimensions and input them into the 1D convolution module. The 1D convolution module is used to extract image features from the input data from the depth dimension to obtain 3D depth features.
9. An electronic device, characterized in that, include: At least one processor; as well as At least one memory communicatively connected to the processor, wherein: The memory stores program instructions that can be executed by the processor, which invokes the program instructions to perform the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored program, wherein, when the program is executed, it controls the device on which the computer-readable storage medium is located to perform the method of any one of claims 1 to 7.