A target detection method based on millimeter wave radar and camera and related equipment
By combining feature extraction and fusion networks from millimeter-wave radar and cameras in autonomous driving, accurate target detection results are generated, solving the problem of inaccurate detection caused by data acquisition errors and improving the target detection accuracy of autonomous driving.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHONGQING CHANGAN TECH CO LTD
- Filing Date
- 2023-08-21
- Publication Date
- 2026-06-26
Smart Images

Figure CN117036657B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of target detection in autonomous driving vehicles, and in particular to a target detection method, system, vehicle, and computer-readable storage medium based on millimeter-wave radar and camera. Background Technology
[0002] Currently, various types of sensors are used for target detection in autonomous driving vehicles to detect targets around the vehicle. Cameras are inexpensive to deploy and can acquire color and texture information, but lack depth information; LiDAR can acquire accurate and relatively dense depth and 3D (3-dimensional) position information, but is greatly affected by weather and has high deployment costs; millimeter-wave radar is inexpensive to deploy and can acquire depth and 3D position information, but lacks color and texture information, and the acquired data is sparse and contains many noisy points. Therefore, fusing data from millimeter-wave radar and cameras can achieve complementary advantages.
[0003] However, when fusing data from millimeter-wave radar and cameras, data acquisition errors can lead to inaccurate target detection results, which in turn can affect the detection of surrounding targets during autonomous driving.
[0004] Therefore, existing technologies still need to be improved and developed. Summary of the Invention
[0005] The main objective of this application is to provide a target detection method, system, vehicle, and computer-readable storage medium based on millimeter-wave radar and camera, aiming to solve the problem in the prior art where data acquisition errors lead to inaccurate target detection results when using data fusion from millimeter-wave radar and camera, thus affecting the detection of surrounding targets during autonomous driving.
[0006] The first aspect of this application provides a target detection method based on millimeter-wave radar and a camera, comprising the following steps: acquiring target point cloud data collected by millimeter-wave radar and target images captured by a camera; extracting millimeter-wave features from the target point cloud data using a trained millimeter-wave feature extraction network; extracting image features from the target image using a trained image feature extraction network; inputting the millimeter-wave features and the image features into a trained region candidate network to generate and output candidate boxes; inputting the millimeter-wave features, the image features, and the candidate boxes into a trained detection head network to generate and output target boxes; and obtaining target detection results based on the target boxes.
[0007] Based on the aforementioned technical means, this application embodiment combines the low deployment cost and ability of millimeter-wave radar to acquire object depth and 3D position information with the ability of a camera to acquire object color and texture information. The information acquired by the millimeter-wave radar and camera is fused to obtain corresponding candidate boxes. After selecting the features from the millimeter-wave radar and camera using these candidate boxes, the corresponding target boxes are finally output through the detection head network. Targets selected by these target boxes yield the target detection result. This process combines the advantages of millimeter-wave radar and camera, and through fusion output, provides a global judgment and target detection result. This avoids inaccurate target detection results due to data acquisition errors, which could affect the detection of surrounding targets during autonomous driving.
[0008] Optionally, in one embodiment of this application, the step of extracting millimeter-wave features from the target point cloud data using a trained millimeter-wave feature extraction network specifically includes: performing position compensation on the target point cloud data to obtain compensated target point cloud data; inputting the compensated target point cloud data into a sparse embedding convolutional detection network in the trained millimeter-wave feature extraction network to obtain preliminary millimeter-wave features; and inputting the preliminary millimeter-wave features into a feature pyramid network in the trained millimeter-wave feature extraction network to obtain the millimeter-wave features.
[0009] Based on the above technical means, the embodiments of this application can perform position compensation on the target point cloud data after obtaining it, thereby eliminating the problem of low azimuth resolution and accuracy of the radar due to the wavelength and working principle of millimeter-wave radar, the presence of a lot of noise, and the sparse number of point clouds. Furthermore, after obtaining the corresponding compensated target point cloud data, millimeter-wave features are obtained through a sparse embedding convolutional detection network and a feature pyramid network. Since a sparse embedding convolutional detection network is used, the problem of a large number of invalid calculations can be greatly reduced. At the same time, the feature pyramid can be used to obtain millimeter-wave features that integrate multi-scale features.
[0010] Optionally, in one embodiment of this application, the step of extracting image features from the target image through a trained image feature extraction network specifically includes: inputting the target image into the residual network of the trained image feature extraction network to obtain preliminary image features; and inputting the preliminary image features into the feature pyramid network of the trained image feature extraction network to obtain the image features.
[0011] Based on the above technical means, the embodiments of this application can process multiple target images through two networks to obtain image features. Specifically, a residual network containing multiple convolutions is used to process the target image to obtain preliminary image features. The preliminary image features are then processed by a feature pyramid network, thereby effectively obtaining the semantic features contained in the image and obtaining more accurate image features.
[0012] Optionally, in one embodiment of this application, the trained region candidate network includes a first feature weight prediction module and a convolutional neural network.
[0013] Based on the above technical means, the embodiments of this application combine the first weight prediction module and the convolutional neural network into a trained region candidate network. The first weight prediction module can fuse millimeter-wave features and image features, and then use the convolutional neural network to generate corresponding candidate boxes using the fused features. Since feature fusion is used, the obtained candidate boxes can select more accurate features.
[0014] Optionally, in one embodiment of this application, the step of inputting the millimeter-wave features and the image features into a trained region candidate network to generate and output candidate boxes specifically includes: generating anchor boxes based on the millimeter-wave features and the image features; projecting the anchor boxes onto the millimeter-wave features and the image features to obtain first region of interest features and second region of interest features; inputting the first region of interest features and the second region of interest features into the first feature weight prediction module to obtain a first fused feature; and inputting the first fused feature into the convolutional neural network to generate the candidate boxes.
[0015] Based on the above technical means, this embodiment of the application can easily generate an anchor box using millimeter-wave features and image features. After projecting the anchor box onto the millimeter-wave features and image features, the step of introducing a box into the features is realized. Then, the features are processed by a neural network to further optimize the anchor box and obtain candidate boxes. In particular, generating the anchor box based on millimeter-wave radar and image features introduces a preliminary box for this embodiment of the application, instead of generating an initial box, thus saving the candidate box generation process. Subsequently, multiple neural networks are used to process the millimeter-wave features and image features projected onto the anchor box to obtain accurate candidate boxes.
[0016] Optionally, in one embodiment of this application, the trained detection head network includes a second feature weight prediction module and a multilayer perceptron network.
[0017] Based on the above technical means, the embodiments of this application can process the candidate boxes after generating them using a detection head obtained by the second feature weight prediction module and the multilayer perceptron network, thereby obtaining accurate target boxes and achieving final target detection through the target boxes.
[0018] Optionally, in one embodiment of this application, the step of inputting the millimeter-wave features, the image features, and the candidate boxes into a trained detection head network to generate and output target boxes specifically includes: projecting the candidate boxes onto the millimeter-wave features and the image features to obtain third region of interest features and fourth region of interest features; inputting the third region of interest features and the fourth region of interest features into a second feature weight prediction module to obtain second fused features; and inputting the second fused features into the multilayer perceptron network to generate the target boxes.
[0019] According to the above technical means, in the embodiments of this application, when generating the target box, the third interest region features and the fourth interest region features projected through the candidate box are input into the second feature weight module to obtain the fused second fused feature. The second fused feature is then input into the corresponding multilayer perceptron to obtain an accurate target box and to detect vehicles, pedestrians and objects around the vehicle. By using a multilayer perceptron, it has a better generalization ability, thus obtaining a more accurate target box.
[0020] A second aspect of this application provides a target detection system based on millimeter-wave radar and a camera. The system includes: a feature generation module for acquiring target point cloud data collected by the millimeter-wave radar and target images captured by the camera; extracting millimeter-wave features from the target point cloud data using a trained millimeter-wave feature extraction network; and extracting image features from the target images using a trained image feature extraction network; a candidate box generation module for inputting the millimeter-wave features and the image features into a trained region candidate network to generate and output candidate boxes; a target box generation module for inputting the millimeter-wave features, the image features, and the candidate boxes into a trained detection head network to generate and output target boxes; and a result output module for obtaining target detection results based on the target boxes.
[0021] Optionally, in one embodiment of this application, the feature generation module includes: a position compensation unit, used to perform position compensation on the target point cloud data to obtain compensated target point cloud data; a preliminary millimeter-wave feature generation unit, used to input the compensated target point cloud data into a sparse embedding convolutional detection network in the trained millimeter-wave feature extraction network to obtain preliminary millimeter-wave features; and a millimeter-wave feature generation unit, used to input the preliminary millimeter-wave features into a feature pyramid network in the trained millimeter-wave feature extraction network to obtain the millimeter-wave features.
[0022] Optionally, in one embodiment of this application, the feature generation module includes: a preliminary image feature generation unit, used to input the target image into the residual network in the trained image feature extraction network to obtain preliminary image features; and an image feature generation module, used to input the preliminary image features into the feature pyramid network in the trained image feature extraction network to obtain the image features.
[0023] Optionally, in one embodiment of this application, the trained region candidate network includes a first feature weight prediction module and a convolutional neural network.
[0024] Optionally, in one embodiment of this application, the candidate box generation module includes: a candidate region of interest generation unit, configured to generate anchor boxes based on the millimeter-wave features and the image features, and project the anchor boxes onto the millimeter-wave features and the image features to obtain a first region of interest feature and a second region of interest feature; a candidate fusion unit, configured to input the first region of interest feature and the second region of interest feature into the first feature weight prediction module to obtain a first fused feature; and a candidate box generation unit, configured to input the first fused feature into the convolutional neural network to generate the candidate boxes.
[0025] Optionally, in one embodiment of this application, the trained detection head network includes a second feature weight prediction module and a multilayer perceptron network.
[0026] Optionally, in one embodiment of this application, the target bounding box generation module includes: a target region of interest generation unit, used to project the candidate box onto the millimeter-wave feature and the image feature to obtain a third region of interest feature and a fourth region of interest feature; a target fusion unit, used to input the third region of interest feature and the fourth region of interest feature into the second feature weight prediction module to obtain a second fused feature; and a target bounding box generation unit, used to input the second fused feature into the multilayer perceptron network to generate the target bounding box.
[0027] A third aspect of this application provides a vehicle, the vehicle including: a memory, a processor, and a target detection program based on millimeter-wave radar and camera stored in the memory and executable on the processor, wherein when the target detection program based on millimeter-wave radar and camera is executed by the processor, it implements the steps of the target detection method based on millimeter-wave radar and camera as described in the above embodiments.
[0028] A fourth aspect of this application provides a computer-readable storage medium storing a target detection program based on millimeter-wave radar and camera. When executed by a processor, the target detection program based on millimeter-wave radar and camera implements the steps of the target detection method based on millimeter-wave radar and camera as described in the above embodiments.
[0029] The beneficial effects of this application are:
[0030] (1) This embodiment combines the low deployment cost and the ability of millimeter-wave radar to acquire object depth and 3D position information with the ability of a camera to acquire object color and texture information. The information acquired by the millimeter-wave radar and camera is fused to obtain corresponding candidate boxes. After selecting the features of the millimeter-wave radar and camera using these candidate boxes, the corresponding target boxes are finally output through the detection head network. The target selected by these target boxes yields the target detection result. This process combines the advantages of millimeter-wave radar and camera, and by fusing the output results, a global judgment is made to provide the target detection result. This avoids the problem of inaccurate target detection results due to data acquisition errors, which could affect the detection of surrounding targets during autonomous driving.
[0031] (2) The embodiments of this application perform position compensation on the acquired millimeter-wave radar, thereby converging multiple frames of point cloud data, thereby eliminating the problem of sparse point cloud quantity caused by the wavelength and working principle of millimeter-wave radar.
[0032] (3) The embodiments of this application use the front view features of the image and the bird’s-eye view features of the millimeter wave features to predict different targets. This effectively utilizes the differences in the features in terms of view to compensate for each other, while avoiding the huge amount of computation required to project the image features in the front view to the bird’s-eye view and the errors that may be introduced therein.
[0033] (4) The present application embodiment controls the problem of feature mismatch of multiple sensors by predicting the weights. When the features of different sensors conflict with each other, or when the features are misaligned due to reasons such as viewing angle and acquisition time, more reliable features can be obtained through weights for subsequent prediction, thereby efficiently realizing feature fusion and improving detection accuracy.
[0034] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description
[0035] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0036] Figure 1 This is a flowchart of a preferred embodiment of the target detection method based on millimeter-wave radar and camera in this application;
[0037] Figure 2 This is a schematic diagram of the network flow of the target detection method based on millimeter-wave radar and camera in this application;
[0038] Figure 3 This is a schematic diagram of the region candidate network in the target detection method based on millimeter-wave radar and camera in this application;
[0039] Figure 4 This is a schematic diagram of the feature weight prediction module in the target detection method based on millimeter-wave radar and camera in this application;
[0040] Figure 5 This is a schematic diagram of a preferred embodiment of the target detection system based on millimeter-wave radar and camera in this application;
[0041] Figure 6 This is a structural schematic diagram of a preferred embodiment of the vehicle described in this application.
[0042] Among them, 10-target detection system based on millimeter-wave radar and camera; 100-feature generation module, 200-candidate box generation module, 300-target box generation module and 400-result output module; 501-memory, 502-processor and 503-communication interface. Detailed Implementation
[0043] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.
[0044] The following describes a target detection method and related equipment based on millimeter-wave radar and camera according to embodiments of this application, with reference to the accompanying drawings. Addressing the problem mentioned in the background art where data acquisition errors lead to inaccurate target detection results when fusing data from millimeter-wave radar and cameras, thus affecting the detection of surrounding targets during autonomous driving, this application provides a target detection method based on millimeter-wave radar and camera. In this method, target point cloud data acquired by millimeter-wave radar and target images captured by a camera are acquired. Millimeter-wave features are extracted from the target point cloud data using a trained millimeter-wave feature extraction network, and image features are extracted from the target image using a trained image feature extraction network. The millimeter-wave features and image features are input into a trained region candidate network to generate and output candidate boxes. The millimeter-wave features, image features, and candidate boxes are input into a trained detection head network to generate and output target boxes. Based on the target boxes, a target detection result is obtained. This solves the technical problem in the related art where data acquisition errors lead to inaccurate target detection results when fusing data from millimeter-wave radar and cameras, thus affecting the detection of surrounding targets during autonomous driving.
[0045] In this application, target point cloud data acquired by millimeter-wave radar and target images captured by a camera are obtained. Millimeter-wave features are extracted from the target point cloud data using a trained millimeter-wave feature extraction network, and image features are extracted from the target images using a trained image feature extraction network. The millimeter-wave features and the image features are input into a trained region candidate network to generate and output candidate boxes. The millimeter-wave features, the image features, and the candidate boxes are input into a trained detection head network to generate and output target boxes. Based on the target boxes, target detection results are obtained.
[0046] Specifically, Figure 1 This is a schematic flowchart of a target detection method based on millimeter-wave radar and camera provided in an embodiment of this application.
[0047] like Figure 1 As shown, the target detection method based on millimeter-wave radar and camera includes the following steps:
[0048] In step S101, target point cloud data acquired by millimeter-wave radar and target images captured by camera are obtained. Millimeter-wave features are extracted from the target point cloud data through a trained millimeter-wave feature extraction network, and image features are extracted from the target images through a trained image feature extraction network.
[0049] Specifically, millimeter-wave radar operates in the millimeter-wave band. Millimeter waves typically refer to the 30–300 GHz frequency band. Since the wavelength of millimeter waves falls between centimeter waves and light waves, they combine the advantages of microwave guidance and photoelectric guidance. Compared to centimeter-wave seekers, millimeter-wave seekers are smaller, lighter, and have higher spatial resolution. Compared to infrared, laser, and television optical seekers, millimeter-wave seekers have a stronger ability to penetrate fog, smoke, and dust, and are suitable for all weather conditions (except heavy rain) and all-day operation.
[0050] Understandably, millimeter-wave radar, mounted on the vehicle, acquires target environment information to obtain target point cloud data, with the radar extracting data from a bird's-eye view. Similarly, a camera, mounted on the vehicle, captures images of the target environment from a forward-looking perspective. The target environment is the setting from which the user needs information, and its range is pre-defined by the user. A trained millimeter-wave feature extraction network extracts millimeter-wave features from the target point cloud data; a trained image feature extraction network extracts image features from the target images. The millimeter-wave feature extraction network is pre-trained using a labeled dataset.
[0051] Specifically, the trained millimeter-wave radar network includes a sparse embedding convolutional detection network and a feature pyramid network; the trained image feature extraction network includes a residual network and a feature pyramid network.
[0052] Among them, the Sparsely Embedded Convolutional Detection (SECOND) network of the trained millimeter-wave radar network is used to extract features from the target, the Feature Pyramid Network (FPN) of the trained millimeter-wave radar network is used to obtain millimeter-wave feature maps that integrate multi-scale features, i.e., millimeter-wave features; the residual network is used to obtain image feature maps, and the Feature Pyramid Network of the trained millimeter-wave radar network is used to obtain image feature maps that integrate multi-scale features, i.e., image features.
[0053] Furthermore, the extraction of millimeter-wave features from the target point cloud data using a trained millimeter feature extraction network specifically includes:
[0054] Position compensation is performed on the target point cloud data to obtain compensated target point cloud data; the compensated target point cloud data is input into the sparse embedding convolutional detection network in the trained millimeter-wave feature extraction network to obtain preliminary millimeter-wave features; the preliminary millimeter-wave features are input into the feature pyramid network in the trained millimeter-wave feature extraction network to obtain the millimeter-wave features.
[0055] Specifically, in this embodiment, a common 3D millimeter-wave radar is used. This type of radar can acquire the radial distance, radial velocity, and horizontal azimuth of the reflection point. However, due to the wavelength and working principle of millimeter-wave radar, the azimuth resolution and accuracy of the radar are low, there is a lot of noise, and the point cloud is sparse. Therefore, in this embodiment, position compensation is performed on the target point cloud data. A total of 6 frames of point cloud data are extracted and converged. At the same time, based on the radial velocity and vehicle trajectory provided by the millimeter-wave radar, and according to the time difference with the current frame, the radial movement distance of each frame's data points is calculated, and position compensation is performed. That is, by connecting the point cloud data of the previous 6 frames, the position of the point cloud in a certain frame can be obtained based on the time difference and the radial velocity and vehicle trajectory provided by the millimeter-wave radar. The point cloud data of the previous 6 frames are then fused into the same frame to obtain the compensated point cloud data. The compensated point cloud data is then voxelized. Voxelization is the process of converting the geometric representation of an object into a voxel representation that is closest to the object, generating a volumetric dataset, which not only contains the surface information of the model but also describes the internal properties of the model. During voxelization, the point cloud data is rasterized according to the pre-determined voxel length, width, and height, and then allocated to the corresponding voxels to reduce data dimensionality and optimize feature extraction efficiency; the pre-determined voxel length, width, and height are values set according to the actual situation.
[0056] The voxelized, compensated target point cloud data is input into the sparse embedding convolutional detection network within a trained millimeter-wave feature extraction network to obtain preliminary millimeter-wave features. The input compensated target point cloud data is the voxelized target point cloud data, i.e., the volume dataset. In addition to the voxelized target point cloud data, the trained millimeter-wave feature extraction network also inputs the point cloud's reflection intensity, voxel occupancy rate, and velocity relative to the ground for feature extraction. The velocity relative to the ground refers to the velocity of the vehicle equipped with millimeter-wave radar relative to the ground. The preliminary millimeter-wave features are then input into the feature pyramid network within the millimeter-wave feature extraction network. The feature pyramid network consists of four convolutional blocks, each containing 2, 2, 3, and 3 convolutional layers respectively. Batch Normalization (BN) and ReLU (Rectified Linear Unit) activation functions are applied after each convolutional layer to achieve faster convergence. Finally, a max-pooling layer with a stride of 2 is added to the end of each convolutional block, halving the input size of each block while doubling the feature dimension. After the last convolutional block, the feature map size is doubled and the feature dimension is halved by using a 1x1 convolution and bilinear upsampling. The corresponding elements are then added to the output of the third convolutional block to obtain the first added feature. The first added feature is then doubled again by using a 1x1 convolution and bilinear upsampling, and the feature map dimension is halved. This is added to the output of the second convolutional block to obtain the second added feature. The second added feature is then doubled again by using a 1x1 convolution and bilinear upsampling, and the feature map dimension is halved to obtain the final feature map, which is half the size of the original input.
[0057] It is understood that, after obtaining the target point cloud data, the embodiments of this application can perform position compensation on the target point cloud data, thereby eliminating the problem of low azimuth resolution and accuracy of the radar due to the wavelength and working principle of millimeter-wave radar, the presence of a lot of noise, and the sparse number of point clouds. Furthermore, after obtaining the corresponding compensated target point cloud data, millimeter-wave features are obtained through a sparse embedding convolutional detection network and a feature pyramid network. Since a sparse embedding convolutional detection network is used, the problem of a large number of invalid calculations can be greatly reduced. At the same time, the feature pyramid can obtain millimeter-wave features that integrate multi-scale features.
[0058] Furthermore, the step of extracting image features from the target image using a trained image feature extraction network specifically includes: inputting the target image into the residual network of the trained image feature extraction network to obtain preliminary image features; and inputting the preliminary image features into the feature pyramid network of the trained image feature extraction network to obtain the image features.
[0059] Specifically, in this embodiment, the residual network is a ResNet50 (Residual Neural Network 50) neural network. The target image is input into the ResNet50 neural network to obtain preliminary image features. These preliminary image features are then input into the feature pyramid network of the trained image feature extraction network to obtain the image features. The feature pyramid network in the trained image feature extraction network has the same structure as the feature pyramid network in the trained millimeter-wave feature extraction network. The image feature extraction network is pre-trained using a labeled dataset.
[0060] It is understood that the embodiments of this application can process multiple target images through two networks to obtain image features. Specifically, a residual network containing multiple convolutions is used to process the target image to obtain preliminary image features. The preliminary image features are then processed by a feature pyramid network, thereby effectively obtaining the semantic features contained in the image and obtaining more accurate image features.
[0061] In step S102, the millimeter-wave features and the image features are input into a trained region candidate network to generate and output candidate boxes.
[0062] Specifically, the trained region candidate network includes a first feature weight prediction module and a convolutional neural network. The trained region candidate network is as follows: Figure 3 As shown, it includes a feature weight prediction module and a Convolutional Neural Network (CNN). The first feature weight prediction module consists of a 3*3 convolutional layer with an output feature dimension of 1 and a max pooling layer with a kernel of the same size as the input feature, using the sigmoid function as the activation function. The region candidate network is pre-trained on a labeled dataset.
[0063] Furthermore, the step of inputting the millimeter-wave features and the image features into a trained region candidate network to generate and output candidate boxes specifically includes: generating anchor boxes based on the millimeter-wave features and the image features; projecting the anchor boxes onto the millimeter-wave features and the image features to obtain first region of interest features and second region of interest features; inputting the first region of interest features and the second region of interest features into the first feature weight prediction module to obtain a first fused feature; and inputting the first fused feature into the convolutional neural network to generate the candidate boxes.
[0064] Since image features are extracted from the front view and millimeter-wave features from the bird's-eye view, the length, width, and height of the bounding boxes can be obtained from image features, and the specific 3D position of the bounding boxes can be obtained from millimeter-wave features when generating 3D anchor boxes and 3D candidate boxes. Because the height distribution of road objects is relatively large, multiple layers are stacked when generating 3D anchor boxes to meet the actual road conditions. Based on the generated 3D anchor boxes, they are projected onto the image features in the front view and the millimeter-wave features in the bird's-eye view, respectively, to obtain the first region of interest (ROI) features and the second ROI features. Since the ROI features have different sizes in different views, the features of the ROI are processed to a uniform size using bilinear interpolation through RoIAlign (a region feature aggregation method). The resized first and second ROI features are concatenated and input into the first feature weight prediction module. The first feature weight prediction module obtains the weights of the image features and the millimeter-wave features. Based on these weights, the first and second ROI features are weighted and summed to obtain the final fused first feature. During the training process, the first feature weight prediction module updates the weights of different features as the overall network weights are updated. The feature weighted fusion module is as follows: Figure 4 As shown, the first fused feature is obtained by processing the input millimeter-wave features (millimeter-wave region of interest features) and (visual region of interest features). Based on the fused feature, the final size and position of the candidate box are predicted by a convolutional neural network with two convolutional layers of output dimension 256, thus obtaining the candidate box.
[0065] It is understood that the embodiments of this application can simply generate an anchor box using millimeter-wave features and image features. After projecting the anchor box onto the millimeter-wave features and image features, the step of introducing a box into the features is realized. Then, neural network processing is performed on the features to further optimize the anchor box and obtain candidate boxes. In this embodiment, generating the anchor box based on millimeter-wave radar and image features introduces a preliminary box instead of generating an initial box, thus saving the candidate box generation process. Subsequently, multiple neural networks are used to process the millimeter-wave features and image features projected onto the anchor box to obtain accurate candidate boxes.
[0066] In step S103, the millimeter-wave features, the image features, and the candidate bounding boxes are input into the trained detection head network to generate and output target bounding boxes.
[0067] Specifically, the trained detection head network includes a second feature weight prediction module and a multilayer perceptron network. The second feature weight prediction module has the same structure and data processing method as the first feature weight prediction module; and the multilayer perceptron network of the trained detection head network is a three-layer MLP (Multilayer Perceptron, abbreviated as MLP) network. In this embodiment, the detection head network is pre-trained using a labeled dataset.
[0068] Further, the candidate box is projected onto the millimeter-wave feature and the image feature to obtain the third region of interest feature and the fourth region of interest feature; the third region of interest feature and the fourth region of interest feature are input into the second feature weight prediction module to obtain the second fused feature; the second fused feature is input into the multilayer perceptron network to generate the target box.
[0069] Specifically, the candidate boxes output from the region candidate network are projected onto the image features in the front view and the millimeter-wave features in the bird's-eye view, respectively. The second feature weight prediction module then obtains the fused second feature map, i.e., the second fused feature. The second fused feature is then input into a three-layer MLP network to obtain the target box.
[0070] It is understood that, in the embodiments of this application, when generating the target box, the third interest region features and the fourth interest region features projected through the candidate box are input into the second feature weight module to obtain the fused second fused features. The second fused features are then input into the corresponding multilayer perceptron to obtain an accurate target box and to detect vehicles, pedestrians and objects around the vehicle. By using a multilayer perceptron, it has a better generalization ability, thus obtaining a more accurate target box.
[0071] In step S104, the target detection result is obtained based on the target bounding box.
[0072] Understandably, after obtaining the target bounding box, the target bounding box selects the vehicles, pedestrians and objects around the vehicle. Therefore, the target detection result can be obtained from the target bounding box, that is, the detection of vehicles, pedestrians and objects around the vehicle.
[0073] Furthermore, embodiments of this application are implemented through... Figure 2To further explain the network flow of the target detection method based on millimeter-wave radar and camera, specifically, the target point cloud data and target image obtained from millimeter-wave radar and camera are processed to obtain millimeter-wave features and image features; anchor boxes are generated from the millimeter-wave features and image features, and then the anchor boxes are projected onto the millimeter-wave features and image features, and then input into the trained region candidate network. After outputting candidate boxes, the candidate boxes are projected onto the millimeter-wave features and image features, and then input into the trained detection head network, finally obtaining the target box.
[0074] In summary, this application embodiment acquires target point cloud data collected by millimeter-wave radar and target images captured by a camera. A trained millimeter-wave feature extraction network extracts millimeter-wave features from the target point cloud data, and a trained image feature extraction network extracts image features from the target images. The millimeter-wave features and the image features are input into a trained region candidate network to generate and output candidate boxes. The millimeter-wave features, the image features, and the candidate boxes are input into a trained detection head network to generate and output target boxes. Based on the target boxes, the target detection result is obtained.
[0075] Next, referring to the accompanying drawings, a target detection system based on millimeter-wave radar and camera according to an embodiment of this application is described.
[0076] Figure 5 This is a block diagram of a target detection system based on millimeter-wave radar and camera according to an embodiment of this application.
[0077] like Figure 5 As shown, the target detection system 10 based on millimeter-wave radar and camera includes: a feature generation module 100, a candidate box generation module 200, a target box generation module 300, and a result output module 400.
[0078] Specifically, the feature generation module 100 is used to acquire target point cloud data collected by millimeter-wave radar and target images captured by camera, extract millimeter-wave features from the target point cloud data through a trained millimeter-wave feature extraction network, and extract image features from the target image through a trained image feature extraction network.
[0079] The candidate box generation module 200 is used to input the millimeter-wave features and the image features into a trained region candidate network to generate and output candidate boxes.
[0080] The target bounding box generation module 300 is used to input the millimeter-wave features, the image features, and the candidate boxes into a trained detection head network to generate and output target bounding boxes.
[0081] The result output module 400 is used to obtain the target detection result based on the target bounding box.
[0082] Optionally, in one embodiment of this application, the feature generation module 100 includes: a position compensation unit, a preliminary millimeter-wave feature generation unit, and a millimeter-wave feature generation unit.
[0083] The location compensation unit is used to perform location compensation on the target point cloud data to obtain compensated target point cloud data.
[0084] The preliminary millimeter-wave feature generation unit is used to input the compensated target point cloud data into the sparse embedding convolutional detection network in the trained millimeter-wave feature extraction network to obtain preliminary millimeter-wave features.
[0085] A millimeter-wave feature generation unit is used to input the preliminary millimeter-wave features into the feature pyramid network in the trained millimeter-wave feature extraction network to obtain the millimeter-wave features.
[0086] Optionally, in one embodiment of this application, the feature generation module 100 includes: a preliminary image feature generation unit and an image feature generation module.
[0087] The preliminary image feature generation unit is used to input the target image into the residual network of the trained image feature extraction network to obtain preliminary image features.
[0088] The image feature generation module is used to input the preliminary image features into the feature pyramid network in the trained image feature extraction network to obtain the image features.
[0089] Optionally, in one embodiment of this application, the trained region candidate network includes a first feature weight prediction module and a convolutional neural network.
[0090] Optionally, in one embodiment of this application, the candidate box generation module 200 includes: a candidate region of interest generation unit, a candidate fusion unit, and a candidate box generation unit.
[0091] The candidate region of interest generation unit is used to generate anchor boxes based on the millimeter-wave features and the image features, and project the anchor boxes onto the millimeter-wave features and the image features to obtain the first region of interest features and the second region of interest features.
[0092] The candidate fusion unit is used to input the first region of interest features and the second region of interest features into the first feature weight prediction module to obtain the first fused feature.
[0093] The candidate box generation unit is used to input the first fused feature into the convolutional neural network to generate the candidate box.
[0094] Optionally, in one embodiment of this application, the trained detection head network includes a second feature weight prediction module and a multilayer perceptron network.
[0095] Optionally, in one embodiment of this application, the target bounding box generation module 300 includes: a target region of interest generation unit, a target fusion unit, and a target bounding box generation unit.
[0096] The target region of interest generation unit is used to project the candidate box onto the millimeter-wave feature and the image feature to obtain the third region of interest feature and the fourth region of interest feature.
[0097] The target fusion unit is used to input the third region of interest features and the fourth region of interest features into the second feature weight prediction module to obtain the second fused features.
[0098] The target box generation unit is used to input the second fused feature into the multilayer perceptron network to generate the target box.
[0099] It should be noted that the foregoing explanation of the target detection method embodiment based on millimeter-wave radar and camera also applies to the target detection system based on millimeter-wave radar and camera in this embodiment, and will not be repeated here.
[0100] The target detection system based on millimeter-wave radar and camera proposed in this application can acquire target point cloud data collected by millimeter-wave radar and target images captured by camera. Millimeter-wave features are extracted from the target point cloud data using a trained millimeter-wave feature extraction network, and image features are extracted from the target images using a trained image feature extraction network. The millimeter-wave features and the image features are input into a trained region candidate network to generate and output candidate boxes. The millimeter-wave features, the image features, and the candidate boxes are input into a trained detection head network to generate and output target boxes. Based on the target boxes, the target detection result is obtained.
[0101] This solves the problem in related technologies where data acquisition errors can lead to inaccurate target detection results, thus affecting the detection of surrounding targets during autonomous driving.
[0102] Figure 6 A schematic diagram of the structure of a vehicle provided in an embodiment of this application. The vehicle may include:
[0103] The memory 501, the processor 502, and the computer program stored on the memory 501 and capable of running on the processor 502.
[0104] When the processor 502 executes the program, it implements the target detection method based on millimeter-wave radar and camera provided in the above embodiments.
[0105] Furthermore, the vehicle also includes:
[0106] Communication interface 503 is used for communication between memory 501 and processor 502.
[0107] The memory 501 is used to store computer programs that can run on the processor 502.
[0108] The memory 501 may include high-speed RAM memory, and may also include non-volatile memory, such as at least one disk storage device.
[0109] If the memory 501, processor 502, and communication interface 503 are implemented independently, then the communication interface 503, memory 501, and processor 502 can be interconnected via a bus to complete communication between them. The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EIS) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, Figure 6 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.
[0110] Optionally, in a specific implementation, if the memory 501, processor 502, and communication interface 503 are integrated on a single chip, then the memory 501, processor 502, and communication interface 503 can communicate with each other through an internal interface.
[0111] Processor 502 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application.
[0112] This embodiment also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described target detection method based on millimeter-wave radar and camera.
[0113] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0114] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "N" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0115] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or N executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.
[0116] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a ordered list of executable instructions for implementing logical functions, and can be embodied in any computer-readable storage medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable storage medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable storage media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable storage medium could be paper or other suitable media on which the program can be printed, since the program can be obtained electronically by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.
[0117] It should be understood that the various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0118] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
[0119] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
[0120] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.
[0121] It should be understood that the application of this application is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
Claims
1. A target detection method based on millimeter-wave radar and camera, characterized in that, The target detection method based on millimeter-wave radar and camera includes: The system acquires target point cloud data collected by millimeter-wave radar and target images captured by a camera, and extracts millimeter-wave features from the target point cloud data using a trained millimeter-wave feature extraction network. The time difference of a preset number of multi-frame point cloud data is calculated. Based on the time difference, the radial velocity collected by the millimeter-wave radar and the vehicle trajectory, the position of each target point cloud data in the corresponding frame is calculated. After fusing the point cloud data corresponding to multiple frames into the same frame, the compensated target point cloud data is obtained. Image features are extracted from the target image using a trained image feature extraction network; After the compensation target point cloud data is voxelized, it is input into the sparse embedding convolutional detection network in the trained millimeter-wave feature extraction network to obtain preliminary millimeter-wave features. Then, the preliminary millimeter-wave features are input into the feature pyramid network in the millimeter-wave feature extraction network to output millimeter-wave features. The millimeter-wave features and the image features are input into a trained region candidate network to generate and output candidate boxes, specifically including: An anchor frame is generated based on the millimeter-wave features and the image features, and the anchor frame is projected onto the millimeter-wave features and the image features to obtain the first region of interest features and the second region of interest features; The first region of interest features and the second region of interest features are input into the first feature weight prediction module to obtain the first fused feature; The first fused feature is input into a convolutional neural network to generate the candidate box; During the training process of the first feature weight prediction module, the overall weight of the network is updated. The millimeter-wave features, the image features, and the candidate boxes are input into a trained detection head network to generate and output target boxes. Based on the target bounding box, the target detection result is obtained.
2. The target detection method based on millimeter-wave radar and camera according to claim 1, characterized in that, The extraction of millimeter-wave features from the target point cloud data using a trained millimeter feature extraction network specifically includes: Position compensation is performed on the target point cloud data to obtain compensated target point cloud data; The compensation target point cloud data is input into the sparse embedding convolutional detection network in the trained millimeter-wave feature extraction network to obtain preliminary millimeter-wave features; The preliminary millimeter-wave features are input into the feature pyramid network in the trained millimeter-wave feature extraction network to obtain the millimeter-wave features.
3. The target detection method based on millimeter-wave radar and camera according to claim 1, characterized in that, The step of extracting image features from the target image using a trained image feature extraction network specifically includes: The target image is input into the residual network of the trained image feature extraction network to obtain preliminary image features; The preliminary image features are input into the feature pyramid network in the trained image feature extraction network to obtain the image features.
4. The target detection method based on millimeter-wave radar and camera according to claim 1, characterized in that, The trained region candidate network includes a first feature weight prediction module and a convolutional neural network.
5. The target detection method based on millimeter-wave radar and camera according to claim 1, characterized in that, The trained detection head network includes a second feature weight prediction module and a multilayer perceptron network.
6. The target detection method based on millimeter-wave radar and camera according to claim 5, characterized in that, The step of inputting the millimeter-wave features, the image features, and the candidate bounding boxes into a trained detection head network to generate and output target bounding boxes specifically includes: The candidate box is projected onto the millimeter-wave feature and the image feature to obtain the third region of interest feature and the fourth region of interest feature; The third region of interest features and the fourth region of interest features are input into the second feature weight prediction module to obtain the second fused feature; The second fused feature is input into the multilayer perceptron network to generate the target bounding box.
7. A target detection system based on millimeter-wave radar and camera, characterized in that, The target detection system based on millimeter-wave radar and camera is used to implement the target detection method based on millimeter-wave radar and camera as described in any one of claims 1-6, including: The feature generation module is used to acquire target point cloud data collected by millimeter-wave radar and target images captured by camera, extract millimeter-wave features from the target point cloud data through a trained millimeter-wave feature extraction network, and extract image features from the target images through a trained image feature extraction network. The candidate box generation module is used to input the millimeter-wave features and the image features into a trained region candidate network to generate and output candidate boxes; The target bounding box generation module is used to input the millimeter-wave features, the image features, and the candidate boxes into a trained detection head network to generate and output target bounding boxes; The result output module is used to obtain the target detection result based on the target bounding box.
8. A vehicle, characterized in that, The vehicle includes: a memory, a processor, and a target detection program based on millimeter-wave radar and camera stored in the memory and executable on the processor. When the target detection program based on millimeter-wave radar and camera is executed by the processor, it implements the steps of the target detection method based on millimeter-wave radar and camera as described in any one of claims 1-6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a target detection program based on millimeter-wave radar and camera, which, when executed by a processor, implements the steps of the target detection method based on millimeter-wave radar and camera as described in any one of claims 1-6.