Training method, device and equipment of example depth estimation model
By decoupling the calculation of visual depth and attribute depth information in monocular 3D object detection, the accuracy and comprehensiveness of instance depth estimation are improved, and the problem of inaccurate instance depth estimation in monocular 3D object detection is solved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU FABU TECH CO LTD
- Filing Date
- 2022-07-15
- Publication Date
- 2026-06-23
Smart Images

Figure CN115222789B_ABST
Abstract
Description
Technical Field
[0001] This application relates to three-dimensional target detection technology, and more particularly to a training method, apparatus and device for an instance depth estimation model. Background Technology
[0002] Object detection is a traditional task in computer vision. Unlike image recognition, object detection not only needs to identify objects in an image and assign their corresponding categories, but also needs to provide the object's location using a minimum bounding box. Depending on the output, object detection is divided into two-dimensional (2D) object detection and three-dimensional object detection. Generally, using RGB images to perform object detection and outputting the object category and its minimum bounding box in the image is called 2D object detection. Using RGB images, RGB-D depth images, and laser point clouds to output the object category and its length, width, height, and rotation angle in three-dimensional space is called three-dimensional (3D) object detection. 3D object detection is widely used in fields such as autonomous driving and robot navigation.
[0003] Monocular 3D object detection using a monocular camera has garnered significant attention in recent years due to its numerous advantages. However, in monocular 3D object detection, depth information of the target object is lost during camera projection, necessitating the estimation of instance depth. Traditional methods typically employ neural networks for instance depth estimation. This approach fails to consider the inherent coupling of instance depth (which is related to the relative position of the target object and the camera), resulting in inaccurate estimations.
[0004] Improving the accuracy of instance depth estimation remains a challenge in monocular 3D object detection. Summary of the Invention
[0005] This application provides a training method, apparatus, and device for an instance depth estimation model, which addresses the problem of improving the accuracy of instance depth estimation in monocular 3D object detection.
[0006] On the one hand, this application provides a method for training an instance depth estimation model, including:
[0007] An initial instance depth estimation model is obtained, comprising at least an image 2D information extraction network and an image 3D information extraction network. The image 2D information extraction network is used to predict the 2D information of a target object in the image, and the image 3D information extraction network is used to predict the 3D information of the target object based on its 2D information. The 3D information of the target object includes at least the instance depth of the target object, which is determined based on its visual depth information and attribute depth information. The visual depth information and attribute depth information of the target object are output by a portion of the network in the image 3D information extraction network.
[0008] Acquire multiple training images and laser point cloud data of the training images, wherein the training images include at least one target object;
[0009] The initial instance depth estimation model is trained based on multiple training images and laser point cloud data of multiple training images to obtain the three-dimensional information of the target object in each training image.
[0010] Training ends when the termination condition is met, and the target instance depth estimation model is obtained.
[0011] In one embodiment, the image two-dimensional information extraction network includes a deep feature extraction network;
[0012] The step of training the initial instance depth estimation model based on multiple training images and laser point cloud data from multiple training images includes:
[0013] Multiple training images are input into the deep feature extraction network to obtain the deep features of each training image;
[0014] The target area in each training image is labeled based on the laser point cloud data of multiple training images;
[0015] The image 3D information extraction network is trained based on the deep features of each training image and the target area image in each training image to obtain the visual depth information, attribute depth information and instance depth of the target in each training image.
[0016] In one embodiment, the image 3D information extraction network includes a target feature information extraction network and a computation network, wherein the target feature information extraction network includes a visual depth information extraction sub-network and an attribute depth information extraction sub-network;
[0017] The step of training the image 3D information extraction network based on the deep features of each training image and the target object region image in each training image to obtain the visual depth information, attribute depth information, and instance depth of the target object in each training image includes:
[0018] The visual depth information extraction subnetwork and the attribute depth information extraction subnetwork are trained based on the deep features of each training image and the target object region image in each training image, respectively, to obtain the visual depth information of the target object in each training image predicted by the visual depth information extraction subnetwork and the attribute depth information of the target object in each training image predicted by the attribute depth information extraction subnetwork; the visual depth information includes visual depth and visual depth uncertainty value, and the attribute depth information includes attribute depth and attribute depth uncertainty value;
[0019] The visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of the target object in each training image are input into the computational network to obtain the instance depth of the target object in each training image.
[0020] In one embodiment, the step of training the visual depth information extraction subnetwork and the attribute depth information extraction subnetwork based on the deep features of each training image and the target object region image in each training image, respectively, to obtain the visual depth information of the target object in each training image predicted by the visual depth information extraction subnetwork and the attribute depth information of the target object in each training image predicted by the attribute depth information extraction subnetwork, includes:
[0021] The target region image in each training image is divided into multiple sub-region images to obtain a set of sub-region images;
[0022] The sub-region image set is input into the visual depth information extraction sub-network and the attribute depth information extraction sub-network respectively to obtain the visual depth, visual depth uncertainty value, attribute depth and attribute depth uncertainty value of each sub-region image in each target object region image;
[0023] The step of inputting the visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of the target object in each training image into the computational network to obtain the instance depth of the target object in each training image includes:
[0024] The visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of each sub-region image are input into the computational network to obtain the instance depth of the target object in each training image.
[0025] In one embodiment, the computing network is used for:
[0026] The instance depth of any sub-region image is determined based on the visual depth and attribute depth of any sub-region image, wherein the any sub-region image belongs to any target object region image in any training image.
[0027] The instance depth uncertainty value of any given sub-region image is determined based on the visual depth uncertainty value and the attribute depth uncertainty value of any given sub-region image.
[0028] The instance depth of the target object in any training image is determined based on the instance depth of any sub-region image and the instance depth uncertainty value of any sub-region image.
[0029] In one embodiment, when the computational network determines the instance depth of the target object in each training image based on the instance depth of each sub-region image and the instance depth uncertainty value of each sub-region image, it is specifically used for:
[0030] Convert the instance depth uncertainty value of any sub-region image into the instance depth confidence value;
[0031] The instance depth of the target object in any training image is determined based on the instance depth of any sub-region image and the confidence level of the instance depth of any sub-region image.
[0032] In one embodiment, converting the instance depth uncertainty value of any sub-region image into an instance depth confidence value includes:
[0033] According to formula P ins =exp(-u ins The instance depth uncertainty value of any sub-region image is converted into the instance depth confidence value;
[0034] Among them, u ins P represents the instance depth uncertainty value of any of the sub-region images. ins The confidence level represents the instance depth of any given sub-region image.
[0035] In one embodiment, determining the instance depth of any sub-region image based on its visual depth and attribute depth includes:
[0036] The sum of the visual depth and attribute depth of any given sub-region image is determined as the instance depth of that given sub-region image.
[0037] The step of determining the instance depth uncertainty value of any given sub-region image based on the visual depth uncertainty value and the attribute depth uncertainty value of any given sub-region image includes:
[0038] Calculate the square of the visual depth uncertainty of any sub-region image to obtain a first value, and calculate the square of the attribute depth uncertainty of any sub-region image to obtain a second value;
[0039] The square root of the sum of the first value and the second value is determined to be the instance depth uncertainty value of any sub-region image.
[0040] In one embodiment, the visual depth and visual depth uncertainty of each sub-region image follow a Laplace distribution;
[0041] The attribute depth and attribute depth uncertainty of each sub-region image follow a Laplace distribution.
[0042] In one embodiment, each training image has the same size, and the size of any training image is the original size or a scaled-down version of the original size. The scaled-down version of the original size is obtained by scaling down the size with affine transformation properties in the original size.
[0043] At least one different training image is derived from the same initial training image.
[0044] In one embodiment, the termination condition includes any one or more of the following: the training duration reaches a preset duration, the number of training iterations reaches a preset number, and the loss of the initial instance depth estimation model is less than a preset loss.
[0045] On the other hand, this application provides a method for estimating instance depth in 3D detection, including:
[0046] Acquire the image to be detected captured by the camera;
[0047] The image to be detected is input into a target instance depth estimation model trained by the instance depth estimation model training method as described in the first aspect, to obtain the instance depth of at least one target object in the image to be detected.
[0048] On the other hand, this application provides a training apparatus for an instance depth estimation model, comprising:
[0049] An acquisition module is used to acquire an initial instance depth estimation model. The initial instance depth estimation model includes at least an image 2D information extraction network and an image 3D information extraction network. The image 2D information extraction network is used to predict the 2D information of a target object in the image, and the image 3D information extraction network is used to predict the 3D information of the target object in the image based on the 2D information of the target object. The 3D information of the target object in the image includes at least the instance depth of the target object. The instance depth of the target object is determined based on the visual depth information and attribute depth information of the target object. The visual depth information and attribute depth information of the target object are output by a portion of the network in the image 3D information extraction network.
[0050] The acquisition module is also used to acquire multiple training images and laser point cloud data of the multiple training images, wherein the training images include at least one target object;
[0051] The training module is used to train the initial instance depth estimation model based on multiple training images and laser point cloud data of multiple training images to obtain the three-dimensional information of the target object in each training image.
[0052] The training module is also used to end the training when the termination condition is met, and obtain the target instance depth estimation model.
[0053] On the other hand, this application also provides an instance depth estimation device for 3D detection, comprising:
[0054] The acquisition module is used to acquire the image to be detected captured by the camera;
[0055] The processing module is used to input the image to be detected into a target instance depth estimation model trained by the instance depth estimation model training method as described in the first aspect, so as to obtain the instance depth of at least one target object in the image to be detected.
[0056] On the other hand, this application also provides an electronic device, including: a processor, and a memory communicatively connected to the processor;
[0057] The memory stores computer-executed instructions;
[0058] The processor executes computer execution instructions stored in the memory to implement the training method for the instance depth estimation model as described in the first aspect, and / or to implement the instance depth estimation method in 3D detection as described in the second aspect.
[0059] On the other hand, this application also provides a computer-readable storage medium storing computer-executable instructions, which, when executed, cause a computer to perform a training method for an instance depth estimation model as described in the first aspect, and / or to implement an instance depth estimation method in 3D detection as described in the second aspect.
[0060] On the other hand, this application also provides a computer program product, including a computer program that, when executed by a processor, implements a training method for an instance depth estimation model as described in the first aspect, and / or implements an instance depth estimation method in 3D detection as described in the second aspect.
[0061] This application provides a training method for an instance depth estimation model. An initial instance depth estimation model is provided, comprising an image 2D information extraction network and an image 3D information extraction network. The image 2D information extraction network is used to predict the 2D information of a target object in an image, and the image 3D information extraction network is used to predict the 3D information of the target object based on its 2D information. The 3D information of the target object includes at least the instance depth of the target object, which is determined based on the target object's visual depth information and attribute depth information. The visual depth information and attribute depth information of the target object are output by a portion of the image 3D information extraction network.
[0062] This initial instance depth estimation model decouples the instance depth calculation process, determining it using both visual depth information and attribute depth information. Visual depth information is related to the relative position of the target object and the camera, while attribute depth information is related to the target object's inherent attributes. Calculating both visual and attribute depth information separately provides a more comprehensive and accurate understanding of the target object's instance depth. The target instance depth estimation model trained based on this initial model also decouples the instance depth calculation process during application, thereby improving the accuracy and comprehensiveness of instance depth calculation.
[0063] In summary, the training method for the instance depth estimation model provided in the embodiments of this application can solve the problem of how to improve the accuracy of instance depth estimation in monocular 3D object detection. Attached Figure Description
[0064] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.
[0065] Figure 1 A schematic diagram illustrating the instance depth provided for this application;
[0066] Figure 2 A schematic diagram illustrating an application scenario of the training method for the instance depth estimation model provided in this application;
[0067] Figure 3 A flowchart illustrating a training method for an instance depth estimation model provided in one embodiment of this application;
[0068] Figure 4 A schematic diagram illustrating the acquisition of training images provided in one embodiment of this application;
[0069] Figure 5 A schematic diagram illustrating the acquisition of training images provided for another embodiment of this application;
[0070] Figure 6 A schematic diagram of the network structure and training process of an initial instance depth estimation model provided for one embodiment of this application;
[0071] Figure 7 A flowchart illustrating an instance depth estimation method in 3D detection provided in one embodiment of this application;
[0072] Figure 8 A schematic diagram of a training apparatus for an instance depth estimation model provided in one embodiment of this application;
[0073] Figure 9 A schematic diagram of an instance depth estimation device for three-dimensional detection provided in one embodiment of this application;
[0074] Figure 10 A schematic diagram of an electronic device provided for one embodiment of this application.
[0075] The accompanying drawings have illustrated specific embodiments of this disclosure, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concepts of this disclosure to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0076] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0077] In the description of this application, it should be understood that the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0078] Object detection is a traditional task in computer vision. Unlike image recognition, object detection not only needs to identify objects in an image and assign their corresponding categories, but also needs to provide the object's location using a minimum bounding box. Depending on the output, object detection is divided into two-dimensional (2D) object detection and three-dimensional object detection. Generally, using RGB images to detect objects and outputting the object category and its minimum bounding box is called 2D object detection. Using RGB images, RGB-D depth images, and laser point clouds to output the object category and its length, width, height, and rotation angle in three-dimensional space is called three-dimensional (3D) object detection. 3D object detection is widely used in fields such as autonomous driving and robot navigation.
[0079] In 3D object detection, monocular 3D object detection using a monocular camera is quite common. In monocular 3D object detection, the depth information of the object is lost during the camera projection process, so it is necessary to estimate the instance depth of the object.
[0080] by Figure 1 The instance depth of the target vehicle is explained. Using the camera position as a reference, the instance depth of target vehicle 1 is equal to the depth from the visible surface of target vehicle 1 to the camera (visual depth D). vis The attribute depth D of target car 1 att Similarly, the instance depth of target car 2 is equal to the depth from the visible surface of target car 2 to the camera (visual depth D). vis The attribute depth D of target car 2 att The sum of.
[0081] Traditional methods for instance depth estimation typically use neural networks directly. However, this approach fails to consider the inherent coupling of instance depth (which is related to the intrinsic properties of the target object and its relative position to the camera), resulting in inaccurate estimates. Improving the accuracy of instance depth estimation remains a challenge in monocular 3D object detection.
[0082] Based on this, this application provides a training method, apparatus, and device for an instance depth estimation model. The training method provides an initial instance depth estimation model that decouples the process of calculating instance depth when calculating the instance depth of a target object. Specifically, the instance depth is determined using both visual depth information and attribute depth information. Visual depth information is related to the relative position of the target object and the camera, while attribute depth information is related to the attributes of the target object itself. Calculating visual depth information and attribute depth information separately provides a more comprehensive and accurate understanding of the instance depth of the target object. The target instance depth estimation model trained based on this initial instance depth estimation model also decouples the process of calculating instance depth during application, thereby improving the accuracy and comprehensiveness of instance depth calculation.
[0083] The training method for the instance depth estimation model provided in this application is applied to electronic devices, such as computers, servers used in laboratories, etc. Figure 2 This diagram illustrates the application of the training method for the instance depth estimation model provided in this application. The electronic device provides an initial instance depth estimation model, which includes an image 2D information extraction network and an image 3D information extraction network. The image 2D information extraction network predicts the 2D information of a target object in the image, and the image 3D information extraction network predicts the 3D information of the target object based on its 2D information. The 3D information of the target object includes at least the instance depth of the target object. It should be noted that the instance depth of the target object is determined based on its visual depth information and attribute depth information, which are output by a portion of the image 3D information extraction network. Multiple training images and laser point cloud data from these training images are acquired to train the initial instance depth estimation model to obtain a target instance depth estimation model.
[0084] Please see Figure 3 One embodiment of this application provides a method for training an instance depth estimation model, comprising:
[0085] S310, Obtain an initial instance depth estimation model. The initial instance depth estimation model includes at least an image 2D information extraction network and an image 3D information extraction network. The image 2D information extraction network is used to predict the 2D information of the target object in the image, and the image 3D information extraction network is used to predict the 3D information of the target object in the image based on the 2D information of the target object. The 3D information of the target object in the image includes at least the instance depth of the target object. The instance depth of the target object is determined based on the visual depth information and attribute depth information of the target object. The visual depth information and attribute depth information of the target object are output by a portion of the network in the image 3D information extraction network.
[0086] This image 2D information extraction network is used to predict the 2D information of objects in an image, including the 2D heatmap of the object, the 2D center point deviation of the object, and the 2D size of the object. The network also includes a deep feature extraction network for predicting deep features of the image. This deep feature extraction network is a convolutional layer network, and the deep features of the image include, for example, the 2D heatmap of the object, the 2D center point deviation of the object, and the 2D size of the object as described above.
[0087] Two-dimensional information of the target object can be used to predict the two-dimensional box of the target object. By combining the laser point cloud data of the target object with the deep features of the predicted image, the two-dimensional bounding box of the target object can be estimated.
[0088] This image 3D information extraction network is used to predict the 3D information of objects in an image based on their 2D information. The 3D information of the objects includes at least the instance depth. First, after estimating the 2D bounding boxes of the objects, the RoI Align algorithm is used to extract features of individual objects from the deep features of the image. These individual object features are then input into the image 3D information extraction network to predict the intermediate 3D information of the objects. The instance depth of the objects is then calculated using some information from this intermediate 3D information. This intermediate 3D information includes, for example, the object's 3D dimensions, 3D center point offset, orientation, visual depth information, and attribute depth information. The instance depth of the objects can be determined using the visual depth information and attribute depth information from this intermediate 3D information. Finally, the 3D bounding box prediction of the objects can also be achieved using this intermediate 3D information.
[0089] This image 3D information extraction network determines the instance depth of a target object based on its visual depth information and attribute depth information. In an optional embodiment, the image 3D information extraction network includes a target object feature information extraction network and a computation network. The target object feature information extraction network includes a visual depth information extraction sub-network and an attribute depth information extraction sub-network. The visual depth information extraction sub-network is used to predict the visual depth information of the target object, and the attribute depth information extraction sub-network is used to predict the attribute depth information of the target object. That is, when predicting the instance depth of a target object, the image 3D information extraction network predicts the visual depth information and attribute depth information of the target object separately, achieving decoupling of instance depth prediction and making the predicted instance depth more comprehensive and accurate.
[0090] The network structure in the initial instance depth estimation model can also be different from the network structure provided in this embodiment, as long as it can predict the visual depth information and attribute depth information of the target object separately, and then determine the instance depth of the target object based on the separately predicted visual depth information and attribute depth information of the target object.
[0091] S320: Acquire multiple training images and laser point cloud data of the multiple training images, wherein the training images include at least one target object.
[0092] The training image is as follows Figure 4 As shown, it includes at least one target object, such as another vehicle being photographed by a camera mounted on the vehicle while it is in motion.
[0093] In an optional embodiment, affine transformations can be used to augment the training images to increase their quantity and diversity. For example... Figure 4 As shown, an initial training image is randomly selected and cropped into multiple training images. After cropping, an affine transformation method is used to unify the size of the selected cropped training images, that is, to convert the size of the selected cropped training images to the same size.
[0094] In monocular imaging, visual depth is a crucial characteristic. For monocular systems, visual depth is highly dependent on the 2D box size of the target object (distant objects appear smaller in the image, while nearby objects appear larger) and the target object's position within the image. If an affine transformation is performed on the image, the visual depth needs to be transformed accordingly, where the depth values need to be scaled. When resizing the training image, the visual depth of the target object scales proportionally to the size of the training image.
[0095] Attribute depth refers to the depth offset from the visual surface of an object to its 3D center. Attribute depth is more likely related to the object's intrinsic properties. For example, when a car's orientation is parallel to the z-axis (depth direction) in 3D space, the attribute depth of the car's rear is half its length. Conversely, if the car's orientation is parallel to the x-axis, the attribute depth is half its width. Attribute depth depends on the object's intrinsic properties and is invariant to any affine transformation relative to visual depth. This is because the attribute depth of the object does not change when the size of the training image is transformed.
[0096] For example, the scaling factor for resizing a training image is (S). x S y ), where S y This represents the scaling factor in the depth direction. For example... Figure 5 Figure (a) shows the training images before the size was changed. Figure 5Figure (b) shows the training image after the size change. Figure 5 (a) and Figure 5 The attribute depths in (b) are equal (D) att1 =D att2 ). Figure 5 Visual depth (D) in (a) vis1 )and Figure 5 (b) Visual depth (D) vis2 The relationship between D and D is: vis2 =D vis1 / S y .
[0097] In this step, each of the multiple training images acquired has the same size. The size of any training image is either the original size or a scaled-down version of the original size. The scaled-down version is obtained by scaling the dimensions with affine transformation properties within the original size. Furthermore, at least one different training image originates from the same initial training image. Thus, scaling the dimensions with affine transformation properties expands the number and diversity of training images, enriching the training data. In an optional embodiment, an upper or lower limit can be set for the number of training images to control the duration, quality, etc., of each training session.
[0098] The laser point cloud data of the training image is used to select and mark the target objects in the training image, that is, to achieve the two-dimensional bounding box estimation of the target objects as described in step S310.
[0099] S330, the initial instance depth estimation model is trained based on multiple training images and laser point cloud data of multiple training images to obtain the three-dimensional information of the target object in each training image.
[0100] like Figure 6 The diagram shows the network structure and training process of the initial instance depth estimation model. This initial instance depth estimation model includes at least a two-dimensional image information extraction network and a three-dimensional image information extraction network.
[0101] This image 2D information extraction network includes a deep feature extraction network. First, multiple training images are input into this deep feature extraction network to obtain the deep features of each training image. Then, the target region image in each training image is labeled based on the laser point cloud data of the multiple training images, thus achieving 2D bounding box estimation of the target object. Next, the image 3D information extraction network is trained using the deep features and target region images of each training image to obtain the visual depth information, attribute depth information, and instance depth of the target object in each training image.
[0102] The image 3D information extraction network includes a target feature information extraction network and a computational network. The target feature information extraction network further includes a visual depth information extraction sub-network and an attribute depth information extraction sub-network. During training, the visual depth information extraction sub-network and the attribute depth information extraction sub-network are trained independently. That is, the visual depth information extraction sub-network and the attribute depth information extraction sub-network in the target feature information extraction network are trained separately based on the deep features of each training image and the target region image in each training image.
[0103] A visual depth information extraction subnetwork is trained based on the deep features of each training image and the target region image in each training image to obtain the visual depth information of the target in each training image predicted by the visual depth information extraction subnetwork. Similarly, an attribute depth information extraction subnetwork is trained based on the deep features of each training image and the target region image in each training image to obtain the attribute depth information of the target in each training image predicted by the attribute depth information extraction subnetwork.
[0104] The visual depth information includes visual depth and visual depth uncertainty, while the attribute depth information includes attribute depth and attribute depth uncertainty. Visual depth uncertainty and attribute depth uncertainty are essentially different expressions of confidence level; the higher the visual depth uncertainty, the lower the visual depth confidence. Similarly, the higher the attribute depth uncertainty, the lower the attribute depth confidence.
[0105] 3D object detection is challenging, and 2D object detection results cannot fully represent the confidence level of 3D object detection. Previous approaches typically used instance depth confidence or 3D IOU loss integrated with 2D detection confidence to represent the final 3D detection confidence. This embodiment decouples instance depth into visual depth and attribute depth, allowing for further decoupling of instance depth confidence. Instance depth only has high confidence when both visual depth and attribute depth confidence are high. We assume that the visual depth and visual depth uncertainty of each sub-region image follow a Laplace distribution L(Di, Di, Di). vis u vis ), where D vis Represents visual depth, u vis This represents the visual depth uncertainty. The attribute depth and attribute depth uncertainty of each sub-region image follow a Laplace distribution L(D). att u att ), where D att Represents attribute depth, u att This represents the depth of the attribute and its uncertainty.
[0106] The instance depth distribution derived from the associated visual and attribute depth is L(D)ins u ins ), where D ins =D vis +D att , Where D ins Represents instance depth, u ins This represents the uncertainty value of the instance depth.
[0107] The visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of the target object in each training image are then input into the computational network to obtain the instance depth of the target object in each training image.
[0108] In an optional embodiment, when training the visual depth information extraction subnetwork and the attribute depth information extraction subnetwork based on the deep features of each training image and the target region image in each training image, the target region image in each training image is divided into multiple sub-region images to obtain a set of sub-region images.
[0109] The set of images of this sub-region is input into a visual depth information extraction sub-network and an attribute depth information extraction sub-network, respectively, to obtain the visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of each sub-region image in each target object region image. Specifically, the visual depth information extraction sub-network predicts the visual depth information and attribute depth information of the target object in each training image, while the attribute depth information extraction sub-network predicts the attribute depth information of the target object in each training image.
[0110] For example, an image of a target region can be divided into m*n grid images (e.g., 7*7 grid images, resulting in 49 sub-region images). A visual depth and an attribute depth are assigned to each sub-region image. The visual depth information is used to extract the sub-network to predict the visual depth information of each sub-region image, and the attribute depth information is used to extract the sub-network to predict the attribute depth information of each sub-region image.
[0111] In an alternative embodiment, sub-region images in the sub-region image set can be filtered to enhance the model training effect, for example, by filtering out some unclear sub-region images or sub-region images with too few objects.
[0112] After obtaining the visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of each sub-region image, the visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of each sub-region image are input into the computational network to obtain the instance depth of the target object in each training image.
[0113] In an optional embodiment, the computing network is used to perform the methods described in the following three points:
[0114] 1. Determine the instance depth of any sub-region image based on its visual depth and attribute depth, wherein the sub-region image belongs to any target region image in any training image.
[0115] In an optional embodiment, the sum of the visual depth and attribute depth of any sub-region image is determined as the instance depth of the any sub-region image.
[0116] As described above, D ins =D vis +D att In the formula D vis D represents the visual depth of any sub-region image. att D represents the attribute depth of any sub-region image. ins Represents the instance depth of any sub-region image.
[0117] Based on the visual depth and attribute depth of any sub-region image, formula D ins =D vis +D att It can determine the instance depth of any sub-region image.
[0118] 2. Determine the instance depth uncertainty of any given sub-region image based on the visual depth uncertainty and attribute depth uncertainty of that given sub-region image.
[0119] In an optional embodiment, the square of the visual depth uncertainty of any given sub-region image is calculated to obtain a first value, and the square of the attribute depth uncertainty of any given sub-region image is calculated to obtain a second value. The square root of the sum of the first value and the second value is determined to be the instance depth uncertainty of any given sub-region image.
[0120] As described above, In the formula u vis Represents the visual depth uncertainty value of any sub-region image. This represents the first value. u att The depth uncertainty value represents the attribute of any sub-region image. This represents the second value, u ins This represents the instance depth uncertainty value of any sub-region image.
[0121] Based on the visual depth uncertainty and attribute depth of any sub-region image, the formula... The instance depth uncertainty value can be determined for any sub-region image.
[0122] 3. Determine the instance depth of the target object in any training image based on the instance depth of any sub-region image and the instance depth uncertainty value of any sub-region image.
[0123] First, according to formula P ins =exp(-u ins The instance depth uncertainty of each sub-region image in any target region image from any training image is converted into the instance depth confidence, where u ins P represents the instance depth uncertainty value of any given sub-region image. ins This represents the confidence level of the instance depth of any given sub-region image. Alternatively, other methods or formulas can be used to convert the uncertainty value of the instance depth of any given sub-region image into the confidence level of the instance depth; this embodiment does not impose any limitations on this.
[0124] Next, the instance depth of the target object in the arbitrary training image is determined based on the instance depth of the arbitrary sub-region image and the confidence level of the instance depth of the arbitrary sub-region image. For the sub-region image set (i.e., any training image), the instance depth can be determined according to the formula. Determine the instance depth of the target object in any training image.
[0125] Correspondingly, the instance depth confidence value is The final confidence level for 3D object detection is p = p 2d p ins , where p 2d The confidence level for 2D object detection.
[0126] S340, training ends when the termination condition is met, and the target instance depth estimation model is obtained.
[0127] The termination condition includes any one or more of the following: the training duration reaches a preset duration, the number of training iterations reaches a preset number, or the loss of the initial instance depth estimation model is less than a preset loss.
[0128] The loss of the initial instance depth estimation model is determined based on the loss function of the initial instance depth estimation model, which is described below.
[0129] For the image 2D information extraction network part:
[0130] Following the design in CenterNet, the 2D heatmap H is intended to indicate the center of rough objects on the image. The 2D offset O... 2d Represents the residual towards the center of the two-dimensional roughness, with a two-dimensional dimension S. 2d Let L represent the height and width of the two-dimensional box. Therefore, we have the loss function L.H ,
[0131] For image 3D information extraction networks:
[0132] For the dimensions of a three-dimensional object, the typical size transformation loss applies. For the direction loss, the network predicts the corresponding observation angle and uses multi-bin loss L. θ Simultaneously, the 3D position of the object is recovered using the 3D center projection on the image plane and instance depth. For the 3D center projection, this is achieved by predicting the offset between the 3D projection and the 2D center. The loss function is: The asterisk (*) is used to denote the corresponding label. As described above, instance depth is decoupled into visual depth and attribute depth. The visual depth label is obtained by projecting LiDAR points onto the image, and the attribute depth label is obtained by subtracting the visual depth label from the instance depth label. Incorporating uncertainty, the visual depth loss is: Where u vis This represents uncertainty. Similarly, there is attribute depth loss. and instance depth loss We set the weight of all loss terms to 1.0.
[0133] In summary, the overall loss of this initial instance depth estimation model is:
[0134] In summary, this embodiment provides a training method for an instance depth estimation model. An initial instance depth estimation model is provided, comprising an image 2D information extraction network and an image 3D information extraction network. The image 2D information extraction network is used to predict the 2D information of objects in the image, and the image 3D information extraction network is used to predict the 3D information of objects in the image based on their 2D information. The 3D information of the objects in the image includes at least the instance depth of the objects, which is determined based on the visual depth information and attribute depth information of the objects. The visual depth information and attribute depth information of the objects are output by a portion of the network in the image 3D information extraction network.
[0135] This initial instance depth estimation model decouples the instance depth calculation process, determining it using both visual depth information and attribute depth information. Visual depth information is related to the relative position of the target object and the camera, while attribute depth information is related to the target object's inherent attributes. Calculating both visual and attribute depth information separately provides a more comprehensive and accurate understanding of the target object's instance depth. The target instance depth estimation model trained based on this initial model also decouples the instance depth calculation process during application, thereby improving the accuracy and comprehensiveness of instance depth calculation.
[0136] In addition, the training method for the instance depth estimation model provided in this embodiment allows the network to extract different types of features for different depths, facilitating model learning. Thanks to depth decoupling, the method provided in this embodiment can also effectively augment training data based on affine transformations.
[0137] Please see Figure 7 An embodiment of this application also provides a method for estimating instance depth in 3D detection, comprising:
[0138] S710 acquires the image to be detected captured by the camera.
[0139] The image to be detected is, for example, an image captured in real time by a camera while a car is moving, or an image captured in real time while a robot is moving.
[0140] S720, the image to be detected is input into the target instance depth estimation model trained by the instance depth estimation model training method provided in any of the above embodiments, and the instance depth of at least one target in the image to be detected is obtained.
[0141] It should be noted that laser point cloud data is required to train the initial instance depth estimation model during the training process. However, when using the instance depth estimation model, only the image to be detected is needed to output the instance depth of at least one target in the image to be detected.
[0142] The image to be detected is input into the target instance depth estimation model. The target instance depth estimation model processes the image to be detected based on the decoupled instance depth and outputs the instance depth of at least one target in the image to be detected.
[0143] In summary, this embodiment provides a method for instance depth estimation in 3D detection. After acquiring the image to be detected, the image is input into a target instance depth estimation model trained using the instance depth estimation model training method provided in any of the preceding embodiments. When calculating the instance depth of the target object, the process of calculating instance depth is decoupled; that is, the instance depth is determined using visual depth information and attribute depth information. Visual depth information is related to the relative position of the target object and the camera, while attribute depth information is related to the attributes of the target object itself. Calculating visual depth information and attribute depth information separately can provide a more comprehensive and accurate understanding of the instance depth of the target object. The target instance depth estimation model trained based on this initial instance depth estimation model also decouples the process of calculating instance depth during application, thereby improving the accuracy and comprehensiveness of instance depth calculation.
[0144] Please see Figure 8 An embodiment of this application also provides a training apparatus 10 for an instance depth estimation model, comprising:
[0145] The acquisition module 11 is used to acquire an initial instance depth estimation model. The initial instance depth estimation model includes at least an image two-dimensional information extraction network and an image three-dimensional information extraction network. The image two-dimensional information extraction network is used to predict the two-dimensional information of the target object in the image, and the image three-dimensional information extraction network is used to predict the three-dimensional information of the target object in the image based on the two-dimensional information of the target object. The three-dimensional information of the target object includes at least the instance depth of the target object. The instance depth of the target object is determined according to the visual depth information and attribute depth information of the target object. The visual depth information and attribute depth information of the target object are output by a part of the network in the image three-dimensional information extraction network.
[0146] The acquisition module 11 is also used to acquire multiple training images and laser point cloud data of the multiple training images, wherein the training images include at least one target object.
[0147] Training module 12 is used to train the initial instance depth estimation model based on multiple training images and laser point cloud data of multiple training images to obtain the three-dimensional information of the target object in each training image.
[0148] The training module 12 is also used to terminate training when a termination condition is met, thus obtaining the target instance depth estimation model. The termination condition includes any one or more of the following: the training duration reaches a preset duration, the number of training iterations reaches a preset number, or the loss of the initial instance depth estimation model is less than a preset loss.
[0149] The image 2D information extraction network includes a deep feature extraction network. The training module 12 is specifically used for: inputting multiple training images into the deep feature extraction network to obtain the deep features of each training image; labeling the target area image in each training image based on the laser point cloud data of the multiple training images; and training the image 3D information extraction network based on the deep features of each training image and the target area image in each training image to obtain the visual depth information, attribute depth information, and instance depth of the target in each training image.
[0150] The image 3D information extraction network includes a target feature information extraction network and a computational network. The target feature information extraction network includes a visual depth information extraction sub-network and an attribute depth information extraction sub-network. The training module 12 is specifically used to: train the visual depth information extraction sub-network and the attribute depth information extraction sub-network based on the deep features of each training image and the target region image in each training image, respectively, to obtain the visual depth information of the target in each training image predicted by the visual depth information extraction sub-network and the attribute depth information of the target in each training image predicted by the attribute depth information extraction sub-network; the visual depth information includes visual depth and visual depth uncertainty, and the attribute depth information includes attribute depth and attribute depth uncertainty; the visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of the target in each training image are input into the computational network to obtain the instance depth of the target in each training image.
[0151] The training module 12 is specifically used for: dividing the target region image in each training image into multiple sub-region images to obtain a set of sub-region images; inputting the set of sub-region images into a visual depth information extraction sub-network and an attribute depth information extraction sub-network respectively to obtain the visual depth, visual depth uncertainty value, attribute depth, and attribute depth uncertainty value of each sub-region image in each target region image; inputting the visual depth, visual depth uncertainty value, attribute depth, and attribute depth uncertainty value of the target in each training image into the computational network to obtain the instance depth of the target in each training image includes: inputting the visual depth, visual depth uncertainty value, attribute depth, and attribute depth uncertainty value of each sub-region image into the computational network to obtain the instance depth of the target in each training image.
[0152] The computational network is used to: determine the instance depth of any given sub-region image based on its visual depth and attribute depth, wherein the given sub-region image belongs to any object region image in any training image; determine the instance depth uncertainty of any given sub-region image based on its visual depth uncertainty and attribute depth uncertainty; and determine the instance depth of the object in any training image based on the instance depth of the given sub-region image and its instance depth uncertainty.
[0153] The computational network is specifically used to: convert the instance depth uncertainty of any sub-region image into the instance depth confidence; and determine the instance depth of the target object in any training image based on the instance depth of any sub-region image and the instance depth confidence of any sub-region image.
[0154] Specifically, this computational network is used to determine the sum of the visual depth and attribute depth of any given sub-region image as the instance depth of that given sub-region image.
[0155] The computational network is specifically used to: calculate the square of the visual depth uncertainty of any sub-region image to obtain a first value; calculate the square of the attribute depth uncertainty of any sub-region image to obtain a second value; and determine the square root of the sum of the first value and the second value as the instance depth uncertainty of any sub-region image.
[0156] The visual depth and visual depth uncertainty of each sub-region image follow a Laplace distribution, as do the attribute depth and attribute depth uncertainty of each sub-region image.
[0157] Each training image has the same size, and the size of any training image is either the original size or a scaled-down version of the original size. The scaled-down version is obtained by scaling the dimensions of the original size that have affine transformation properties. At least one different training image originates from the same initial training image.
[0158] Please see Figure 9 An embodiment of this application also provides an instance depth estimation device 20 for three-dimensional detection, comprising:
[0159] The acquisition module 21 is used to acquire the image to be detected captured by the camera.
[0160] Processing module 22 is used to input the image to be detected into a target instance depth estimation model trained by the instance depth estimation model training method provided in any of the above embodiments, so as to obtain the instance depth of at least one target in the image to be detected.
[0161] Please see Figure 10 One embodiment of this application also provides an electronic device 30, including: a processor 31, and a memory 32 communicatively connected to the processor. The memory 32 stores computer-executable instructions, and the processor 31 executes the computer-executable instructions stored in the memory 32 to implement the training method of the instance depth estimation model provided in any of the preceding embodiments, and / or to implement the instance depth estimation method in 3D detection provided in any of the preceding embodiments.
[0162] One embodiment of this application also provides a computer-readable storage medium storing computer-executable instructions that, when executed, cause a computer to perform a training method for an instance depth estimation model as provided in any of the preceding embodiments, and / or to implement an instance depth estimation method in 3D detection as provided in any of the preceding embodiments.
[0163] One embodiment of this application also provides a computer program product, including a computer program that, when executed by a processor, implements a training method for an instance depth estimation model as provided in any of the preceding embodiments, and / or implements an instance depth estimation method in 3D detection as provided in any of the preceding embodiments.
[0164] It should be noted that the aforementioned computer-readable storage media can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM), etc. It can also be various electronic devices that include one or any combination of the above-mentioned memories, such as mobile phones, computers, tablet devices, personal digital assistants, etc.
[0165] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0166] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0167] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0168] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0169] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0170] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0171] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. A training method for an instance depth estimation model, characterized in that, include: An initial instance depth estimation model is obtained, comprising at least an image 2D information extraction network and an image 3D information extraction network. The image 2D information extraction network is used to predict the 2D information of a target object in the image, and the image 3D information extraction network is used to predict the 3D information of the target object based on its 2D information. The 3D information of the target object includes at least the instance depth of the target object, which is determined based on its visual depth information and attribute depth information. The visual depth information and attribute depth information of the target object are output by a portion of the network in the image 3D information extraction network. Acquire multiple training images and laser point cloud data of the training images, wherein the training images include at least one target object; The initial instance depth estimation model is trained based on multiple training images and laser point cloud data of multiple training images to obtain the three-dimensional information of the target object in each training image. Training ends when the termination condition is met, and the target instance depth estimation model is obtained. The image two-dimensional information extraction network includes a deep feature extraction network; The image 3D information extraction network includes a target feature information extraction network and a computational network. The target feature information extraction network includes a visual depth information extraction subnetwork and an attribute depth information extraction subnetwork. The image 3D information extraction network is trained based on the deep features of each training image and the target object region image in each training image to obtain the visual depth information, attribute depth information, and instance depth of the target object in each training image, including: The visual depth information extraction subnetwork and the attribute depth information extraction subnetwork are trained based on the deep features of each training image and the target object region image in each training image, respectively, to obtain the visual depth information of the target object in each training image predicted by the visual depth information extraction subnetwork and the attribute depth information of the target object in each training image predicted by the attribute depth information extraction subnetwork; the visual depth information includes visual depth and visual depth uncertainty value, and the attribute depth information includes attribute depth and attribute depth uncertainty value; The visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of the target objects in each training image are input into the computational network to obtain the instance depth of the target objects in each training image, including: The instance depth of any sub-region image is determined based on the visual depth and attribute depth of any sub-region image, wherein the any sub-region image belongs to any target object region image in any training image. The instance depth uncertainty value of any given sub-region image is determined based on the visual depth uncertainty value and the attribute depth uncertainty value of any given sub-region image. The instance depth of the target object in any training image is determined based on the instance depth of any sub-region image and the instance depth uncertainty value of any sub-region image.
2. The method according to claim 1, characterized in that, The step of training the initial instance depth estimation model based on multiple training images and laser point cloud data from multiple training images includes: Multiple training images are input into the deep feature extraction network to obtain the deep features of each training image; The target area in each training image is labeled based on the laser point cloud data of multiple training images; The image 3D information extraction network is trained based on the deep features of each training image and the target area image in each training image to obtain the visual depth information, attribute depth information and instance depth of the target in each training image.
3. The method according to claim 1, characterized in that, The process involves training the visual depth information extraction subnetwork and the attribute depth information extraction subnetwork based on the deep features of each training image and the target object region image in each training image, respectively, to obtain the visual depth information of the target object in each training image predicted by the visual depth information extraction subnetwork and the attribute depth information of the target object in each training image predicted by the attribute depth information extraction subnetwork, including: The target region image in each training image is divided into multiple sub-region images to obtain a set of sub-region images; The sub-region image set is input into the visual depth information extraction sub-network and the attribute depth information extraction sub-network respectively to obtain the visual depth, visual depth uncertainty value, attribute depth and attribute depth uncertainty value of each sub-region image in each target object region image; The step of inputting the visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of the target object in each training image into the computational network to obtain the instance depth of the target object in each training image includes: The visual depth, visual depth uncertainty, attribute depth, and attribute depth uncertainty of each sub-region image are input into the computational network to obtain the instance depth of the target object in each training image.
4. The method according to claim 3, characterized in that, The computing network is used for: The instance depth of any sub-region image is determined based on the visual depth and attribute depth of any sub-region image, wherein the any sub-region image belongs to any target object region image in any training image. The instance depth uncertainty value of any given sub-region image is determined based on the visual depth uncertainty value and the attribute depth uncertainty value of any given sub-region image. The instance depth of the target object in any training image is determined based on the instance depth of any sub-region image and the instance depth uncertainty value of any sub-region image.
5. The method according to claim 4, characterized in that, When the computational network is used to determine the instance depth of the target object in each training image based on the instance depth of each sub-region image and the instance depth uncertainty value of each sub-region image, it is specifically used for: Convert the instance depth uncertainty value of any sub-region image into the instance depth confidence value; The instance depth of the target object in any training image is determined based on the instance depth of any sub-region image and the confidence level of the instance depth of any sub-region image.
6. The method according to claim 1, characterized in that, Each training image has the same size. The size of any training image is either the original size or a scaled-down version of the original size. The scaled-down version of the original size is obtained by scaling down the dimensions that have affine transformation properties in the original size. At least one different training image is derived from the same initial training image.
7. A method for estimating instance depth in 3D detection, characterized in that, include: Acquire the image to be detected captured by the camera; The image to be detected is input into the target instance depth estimation model trained by the instance depth estimation model training method as described in any one of claims 1-6, so as to obtain the instance depth of at least one target object in the image to be detected.
8. An electronic device, characterized in that, include: A processor, and a memory communicatively connected to the processor; The memory stores computer-executed instructions; The processor executes computer execution instructions stored in the memory to implement the training method of the instance depth estimation model as described in any one of claims 1 to 6, and / or to implement the instance depth estimation method in 3D detection as described in claim 7.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed, cause the computer to perform the training method of the instance depth estimation model as described in any one of claims 1 to 6, and / or to implement the instance depth estimation method in 3D detection as described in claim 7.