Target detection methods, systems, electronic devices and computer-readable storage media

By integrating depth completion modules from vision and millimeter-wave radar, along with a cross-attention network, into the target detection method, the problems of low detection accuracy and poor fusion are solved, achieving more efficient target detection.

CN116030270BActive Publication Date: 2026-06-30AXERA TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
AXERA TECH (BEIJING) CO LTD
Filing Date
2023-02-08
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing target detection methods that fuse vision and millimeter-wave radar suffer from reduced detection accuracy and poor fusion due to the direct introduction of millimeter-wave radar waves, making it difficult to achieve effective detection.

Method used

By combining the input image with the depth information completed by the millimeter-wave radar output, the sparse information is made denser. The dense predicted depth features and RGB features are then fused and projected into the BEV domain. Graph networks and cross-attention networks are used for correlation and weighted fusion to improve detection accuracy and effectiveness.

Benefits of technology

It improves the density of millimeter-wave radar information and the effectiveness of visual feature correlation, reduces clutter interference on the model, and enhances detection capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116030270B_ABST
    Figure CN116030270B_ABST
Patent Text Reader

Abstract

This application provides a target detection method, system, electronic device, and computer-readable storage medium, relating to the field of target detection. The target detection method includes: obtaining RGB semantic features based on RGB image information of the target to be detected; obtaining dense depth information based on the RGB semantic features and point cloud data information of the target to be detected; fusing the RGB semantic features and dense depth information to obtain dense fused features; obtaining BEV fused features based on the dense fused features, a graph network, and / or a cross-attention network; and detecting the target to be detected based on the BEV fused features. Using the target detection method provided in this application, the problems of sparse millimeter-wave radar echo data and reduced detection accuracy due to clutter can be solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of target detection, and more specifically, to a target detection method, system, electronic device, and computer-readable storage medium. Background Technology

[0002] Existing target detection methods that fuse vision and millimeter-wave radar mainly detect targets by directly projecting each pixel of the millimeter-wave radar point cloud onto the camera coordinate system or bird's-eye view and fusing it with features extracted based on visual CNN (convolutional neural network). Then, a detection head predicts 3D target boxes.

[0003] However, due to the direct introduction of millimeter radar waves, this method reduces the detection accuracy of the target and has poor fusion between the two, making it difficult to achieve effective target detection. Summary of the Invention

[0004] The purpose of this application is to provide a target detection method, system, electronic device, and computer-readable storage medium. By inputting an image and millimeter-wave radar, it outputs completed depth information, making sparse information denser. Further, it projects the dense predicted depth features and the fused RGB features, along with the millimeter-wave radar velocity features, into the BEV domain, and associates the two in the BEV domain based on a graph network and a cross-attention network. Finally, it weights the predictions based on the graph network and cross-attention network to obtain the final RGB millimeter-wave fused features. Using the target detection method provided in this application, on the one hand, it improves the density of millimeter-wave radar information, thereby improving the model's prediction performance for depth information. On the other hand, it improves the effectiveness of associating millimeter-wave radar with visual features, reduces clutter interference on the model, improves the model's fusion effect on millimeter-wave radar information, and thus improves the model's detection capability.

[0005] In a first aspect, embodiments of this application provide a target detection method, the method comprising: obtaining RGB semantic features based on RGB image information of the target to be detected; obtaining dense depth information based on the RGB semantic features and point cloud data information of the target to be detected; fusing the RGB semantic features and dense depth information to obtain dense fusion features; obtaining BEV fusion features based on the dense fusion features, graph networks and / or cross-attention networks; and detecting the target to be detected based on the BEV fusion features.

[0006] In the above implementation process, the target detection method provided in this application acquires RGB semantic features and dense depth information respectively; further, it removes interference from the completed dense fusion features to obtain BEV fusion features; finally, it detects the target to be detected based on the BEV fusion features. The depth completion module fuses the depth information of the target from the millimeter-wave radar with the image semantic features, thereby improving the depth prediction accuracy of the monitoring model; the feature fusion module performs weighted fusion of dense fusion features or other features, thereby reducing the impact of noise introduced by millimeter-wave radar clutter on the detection network and improving the effectiveness of multi-mode fusion.

[0007] Optionally, in this embodiment of the application, obtaining dense depth information based on the RGB semantic features and point cloud data information of the target to be detected includes: inputting an RGB image including the target to be detected into a convolutional neural network to obtain RGB semantic features; and obtaining dense depth information based on the point cloud data including the target to be detected and the RGB semantic features.

[0008] In the above implementation process, in order to obtain detailed depth information in this application embodiment, RGB semantic features are first obtained from the RGB image; then, dense depth information is obtained from the RGB semantic features and point cloud data. It should be understood that the detection accuracy is not high due to the sparse depth information of millimeter-wave radar. Therefore, in order to obtain denser depth information, the RGB image semantic features and point cloud data are fused to obtain dense depth information, so as to solve the problem of sparse depth information of millimeter-wave radar.

[0009] Optionally, in this embodiment of the application, obtaining dense depth information based on point cloud data including the target to be detected and RGB semantic features includes: projecting the point cloud data onto the camera coordinate system to obtain a sparse depth map; inputting the sparse depth map into a convolutional neural network to obtain semi-dense depth information; and obtaining dense depth information based on the semi-dense depth information and RGB semantic features.

[0010] In the above implementation process, in order to obtain dense depth information in this application embodiment, the point cloud data of a radio detection device, such as a millimeter-wave radar, can be projected onto the camera coordinate system to obtain a sparse depth map. After obtaining the sparse depth map, dense depth information can be obtained based on the sparse depth map and RGB semantic features, thereby increasing the amount of detection data and improving the accuracy of target detection.

[0011] Optionally, in this embodiment of the application, the BEV fusion feature includes a first BEV fusion feature; obtaining the BEV fusion feature based on the dense fusion feature, graph network and / or cross-attention network includes: projecting the dense fusion feature onto the BEV coordinate system; in the BEV coordinate system, inputting the dense fusion feature into the cross-attention network to obtain the first BEV fusion feature; wherein, the cross-attention network is used to extract the relation weights between features and to weight and fuse the features.

[0012] In the above implementation process, in order to obtain the first BEV fusion feature in this application embodiment, firstly, the dense fusion feature is projected onto the BEV coordinate system, and then the dense fusion feature is input into the cross-attention network in the BEV coordinate system to extract the relationship weights between the features; further, the features are weighted and fused according to the relationship weights, thereby reducing the impact of noise introduced by millimeter-wave radar clutter on the detection network and improving the effectiveness of multi-mode fusion.

[0013] Optionally, in this embodiment, the BEV fusion feature further includes a second BEV fusion feature; obtaining the BEV fusion feature based on the dense fusion feature, graph network, and / or cross-attention network further includes: inputting point cloud data into a graph network to obtain radio detection device velocity features; wherein, the graph network is used to learn the correlation between points; projecting the dense fusion feature and the radio detection device velocity features onto the BEV coordinate system; in the BEV coordinate system, performing weighted fusion of the dense fusion feature and the radio detection device velocity features based on a cross-attention network to obtain the second BEV fusion feature; wherein, the second BEV fusion feature represents the feature of the motion velocity of the target to be detected.

[0014] In the above implementation process, to obtain the second BEV fusion feature, point cloud data is input into a graph network to obtain velocity features. Further, the velocity features and dense fusion features are projected onto the BEV coordinate system, and then weighted and fused in the BEV coordinate system to obtain the second BEV fusion feature representing the velocity of the target to be detected. Due to the use of the graph network and cross-attention network, the influence of clutter in millimeter-wave radar on the detection results is reduced, improving the detection accuracy.

[0015] Optionally, in this embodiment of the application, the weighted fusion of dense fusion features and radio detection device velocity features based on a cross-attention network to obtain a second BEV fusion feature includes: inputting the radio detection device velocity features and dense fusion features into the cross-attention network respectively to obtain weighted radio detection device velocity features and weighted dense fusion features respectively; and summing and fusing the weighted radio detection device velocity features and weighted dense fusion features to obtain the second BEV fusion feature.

[0016] In the above implementation process, in order to obtain the second BEV fusion feature, the radio detection device velocity feature and dense fusion feature are first weighted and fused based on a cross-attention network; further, the two weighted and fused features are summed and fused; finally, the second BEV fusion feature with reduced clutter effect is obtained.

[0017] Optionally, in this embodiment of the application, detecting the target to be detected based on the BEV fusion features includes: inputting the BEV fusion features into the detection head, determining whether the output of the detection head includes multiple detection boxes corresponding to the target to be detected; if it is determined that the output of the detection head includes multiple detection boxes corresponding to the target to be detected, then obtaining the confidence of all detection boxes; and taking the detection box with the highest confidence among all detection boxes as the detection result of the target to be detected.

[0018] In the above implementation process, when detecting the target based on the BEV fusion features, the BEV fusion features are input into the detection head, which may result in one or more detection boxes; if there are multiple overlapping detection boxes, the detection box with the highest confidence is selected as the detection result.

[0019] Secondly, embodiments of this application provide a target detection system, which includes: a depth completion module, a BEV fusion feature acquisition module, and a detection module; the depth completion module is used to acquire RGB semantic features based on the RGB image information of the target to be detected; the depth completion module is also used to acquire dense depth information based on the RGB semantic features and point cloud data information of the target to be detected; the depth completion module is also used to fuse the RGB semantic features and dense depth information to obtain dense fusion features; the BEV fusion feature acquisition module is used to obtain BEV fusion features based on dense fusion features, graph networks, and / or cross-attention networks; the detection module is used to detect the target to be detected based on the BEV fusion features.

[0020] Thirdly, embodiments of this application provide an electronic device, which includes a memory and a processor. The memory stores program instructions, and when the processor reads and runs the program instructions, it executes the steps in any implementation of the target detection method provided in the first aspect of this application.

[0021] Fourthly, embodiments of this application also provide a computer-readable storage medium storing computer program instructions, which, when read and executed by a processor, perform steps in any implementation of the target detection method provided in the first aspect of this application. Attached Figure Description

[0022] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments of this application will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0023] Figure 1 This is a first flowchart of the target detection method provided in the embodiments of this application;

[0024] Figure 2 A first flowchart for obtaining dense depth information provided in an embodiment of this application;

[0025] Figure 3 A second flowchart for obtaining dense depth information provided in the embodiments of this application;

[0026] Figure 4 A flowchart of the first BEV fusion feature acquisition provided in this application embodiment;

[0027] Figure 5 A first flowchart for obtaining the second BEV fusion feature provided in this application embodiment;

[0028] Figure 6 This is a second flowchart for obtaining the second BEV fusion feature provided in an embodiment of this application;

[0029] Figure 7 This is a schematic diagram of BEV fusion feature acquisition provided in an embodiment of this application;

[0030] Figure 8 This is a comparison image of features extracted before and after reducing clutter interference, provided in an embodiment of this application.

[0031] Figure 9 This is a second flowchart of the target detection method provided in the embodiments of this application;

[0032] Figure 10 This is a schematic diagram of the modules of the target detection system provided in the embodiments of this application;

[0033] Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0034] The technical solutions of the embodiments of this application will now be described with reference to the accompanying drawings. For example, the flowcharts and block diagrams in the drawings illustrate the architecture, functions, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of code, which contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and / or flowchart, and combinations of blocks in the block diagram and / or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or action, or can be implemented using a combination of dedicated hardware and computer instructions. In addition, the functional modules in the various embodiments of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

[0035] During the research process, the applicant discovered that in the field of object detection, the accuracy of predicting depth information of objects using only pure visual 3D object detection models as input is relatively low. Existing research improves the accuracy of models in predicting depth and geometric information by fusing LiDAR information; however, visual and LiDAR sensors are sensitive to weather conditions and are prone to significant loss of obstacle information in severe weather conditions such as heavy fog, heavy rain, and heavy snow; furthermore, the wavelength of millimeter-wave radar signals is much larger than the tiny particles that form fog, rain, and snow, making them easy to penetrate or diffract. Therefore, many methods fusing visual and millimeter-wave radar have been proposed. Current research mainly involves directly projecting each pixel of the millimeter-wave radar point cloud onto the camera coordinate system or bird's-eye view and fusing it with features extracted based on visual CNNs, and then using a detection head to predict 3D target boxes.

[0036] However, the method of directly projecting each pixel of the millimeter-wave radar point cloud onto the camera coordinate system or bird's-eye view and fusing it with features extracted based on visual CNN, and then predicting the 3D target box through the detection head to detect target objects, has two problems. First, the millimeter-wave radar echo data is very sparse, resulting in low prediction accuracy of the depth of the target object. Second, the millimeter-wave radar echo data contains clutter. The method of associating relevant information before fusing millimeter-wave radar and RGB features is relatively direct, simply aggregating multi-modal features directly through the neighborhood of radar points or candidate proposal regions, without fine-grained optimization of the associated multi-modal features, resulting in poor fusion effectiveness and making it difficult to achieve effective target detection.

[0037] Based on this, embodiments of this application provide a target detection method, system, electronic device, and computer-readable storage medium. The target detection method adds a depth completion module based on millimeter-wave radar, and a feature fusion module based on graph networks and cross-attention networks to the target detection model. The depth completion module fuses the depth information of the target from the millimeter-wave radar with image semantic features, thereby improving the depth prediction accuracy of the monitoring model. The feature fusion module weights and fuses densely fused features or other features, thereby reducing the impact of noise introduced by millimeter-wave radar clutter on the detection network and improving the effectiveness of multi-mode fusion, thus achieving effective target detection.

[0038] Please refer to Figure 1 , Figure 1 A first flowchart of a target detection method provided in this application embodiment; the method includes the following steps:

[0039] Step S100: Obtain RGB semantic features based on the RGB image information of the target to be detected.

[0040] In step S100 above, RGB semantic features are obtained based on the RGB image information of the target to be detected. Those skilled in the art will understand that an RGB image refers to an image displayed using the RGB color mode. The RGB color mode is a color standard in the industry, which obtains various colors by changing the three color channels of red (R), green (G), and blue (B) and superimposing them with each other. RGB represents the colors of the three channels of red, green, and blue.

[0041] In the embodiments of this application, RGB image information can be obtained through various image acquisition devices or directly from a gallery; the method of obtaining RGB image information should not be a limitation on the scope of protection of the target detection method in the embodiments of this application.

[0042] To better understand this scheme, a brief introduction to image semantic features is provided here. Typically, when identifying or determining whether an image or video contains target semantics (such as a specified person or object), we can identify whether the target semantics are present in video frames or images. Furthermore, when identifying whether an image contains target semantics, we often identify it by the features of potential target formations within the image, and then determine whether the image contains the target semantics based on the characteristics of the target object. In other words, the semantic features of an image can serve as one of the criteria for identifying or determining target objects.

[0043] Step S101: Obtain dense depth information based on the RGB semantic features and point cloud data information of the target to be detected.

[0044] In step S101 above, after obtaining the RGB semantic features of the image, dense depth information is further obtained based on the RGB semantic features and point cloud data. This can be understood as follows: relying solely on point cloud data results in low detection accuracy due to the sparse depth information of millimeter-wave radar. Therefore, to obtain denser depth information, the RGB image semantic features and point cloud data are fused to obtain dense depth information, thus solving the problem of sparse depth information in millimeter-wave radar.

[0045] As will be understood by those skilled in the art, point cloud data refers to a massive collection of points representing the surface characteristics of a target; it is generally obtained through laser measurement or photogrammetry and can reflect the true condition of the object under test with high accuracy, such as ground conditions, ground object reflection characteristics, etc. The point cloud data in the embodiments of this application can be acquired by radio detection equipment, such as millimeter-wave radar.

[0046] Step S102: Fuse the RGB semantic features and dense depth information to obtain dense fused features.

[0047] In step S102 above, after obtaining the RGB semantic features and dense depth information, the RGB semantic features and dense depth information are fused to obtain dense fused features. It should be understood that the obtained dense fused features solve the problem of sparse depth information in point cloud data.

[0048] Step S103: Obtain BEV fusion features based on dense fusion features, graph networks and / or cross-attention networks.

[0049] In step S103 above, BEV fusion features are further obtained based on dense fusion features, graph networks, and / or cross-attention networks. The cross-attention network configuration makes the BEV fusion features effective in addressing the clutter problem present in millimeter-wave radar.

[0050] Step S104: Detect the target to be detected based on the BEV fusion features.

[0051] pass Figure 1As can be seen, the target detection method provided in this application acquires RGB semantic features and dense depth information respectively; further, it removes interference from the completed dense fusion features to obtain BEV fusion features; finally, it detects the target to be detected based on the BEV fusion features. The depth completion module fuses the depth information of the target from the millimeter-wave radar with the image semantic features, thereby improving the depth prediction accuracy of the monitoring model; the feature fusion module performs weighted fusion of dense fusion features or other features, thereby reducing the impact of noise introduced by millimeter-wave radar clutter on the detection network and improving the effectiveness of multi-mode fusion.

[0052] Please refer to Figure 2 , Figure 2 A first flowchart for obtaining dense depth information provided in this application embodiment; in optional embodiments of this application embodiment, obtaining dense depth information based on the RGB semantic features of the target to be detected and point cloud data information may include the following steps:

[0053] Step S200: Input the RGB image including the target to be detected into the convolutional neural network to obtain RGB semantic features.

[0054] Step S201: Obtain dense depth information based on point cloud data including the target to be detected and RGB semantic features.

[0055] In steps S200-S201 above, to obtain dense depth information, the RGB image including the target to be detected is first input into a convolutional neural network to obtain RGB semantic features. After obtaining the RGB semantic features, dense depth information is further obtained based on these semantic features and point cloud data including the target to be detected.

[0056] pass Figure 2 As can be seen, in order to obtain detailed depth information in this application embodiment, RGB semantic features are first obtained from the RGB image; then, dense depth information is obtained from the RGB semantic features and point cloud data. It should be understood that the detection accuracy is not high due to the sparse depth information of millimeter-wave radar. Therefore, in order to obtain denser depth information, the semantic features of the RGB image and the point cloud data are fused to obtain dense depth information, thereby solving the problem of sparse depth information of millimeter-wave radar.

[0057] Please refer to Figure 3 , Figure 3 A second flowchart for obtaining dense depth information provided in this application embodiment; in an optional embodiment of this application, obtaining dense depth information based on point cloud data including the target to be detected and RGB semantic features can be achieved through the following steps:

[0058] Step S300: Project the point cloud data onto the camera coordinate system to obtain a sparse depth map.

[0059] In step S300 above, the point cloud data is projected onto the camera coordinate system to obtain a sparse depth map. In other words, the main purpose of this step is to transform the point cloud coordinates measured by the millimeter-wave radar coordinate system into the camera coordinate system to obtain the coordinate values ​​of the pixels, and then obtain the sparse depth map. Those skilled in the art will understand that existing mature technologies can be used to transform the radar coordinate system into the camera coordinate system, which will not be elaborated here.

[0060] Step S301: Input the sparse depth map into the convolutional neural network to obtain semi-dense depth information.

[0061] In step S301 above, after obtaining the sparse depth map, the sparse depth map is input into the convolutional neural network to obtain the dense depth information of the plate.

[0062] Step S302: Obtain dense depth information based on semi-dense depth information and RGB semantic features.

[0063] In step S302 above, the semi-dense depth information and RGB semantic features are combined to obtain dense depth information. For example, the semi-dense depth information and RGB semantic features are summed and fused, and then the features are fused through ASPP (Atrous Spatial Pyramid Pooling). Finally, a depth prediction layer based on a single convolution is used to determine whether the pixels around the millimeter-wave radar projection pixel have the same depth as the pixel. Pixels with the same predicted depth are assigned the depth of the millimeter-wave radar projection pixel, thereby obtaining dense depth information.

[0064] pass Figure 3 As can be seen, in order to obtain dense depth information in this embodiment of the application, the point cloud data of a radio detection device, such as a millimeter-wave radar, can be projected onto the camera coordinate system to obtain a sparse depth map. After obtaining the sparse depth map, dense depth information can be obtained based on the sparse depth map and RGB semantic features, thereby increasing the amount of detection data and improving the accuracy of target detection.

[0065] Please refer to Figure 4 , Figure 4 A flowchart for obtaining the first BEV fusion feature provided in this application embodiment; in an optional embodiment of this application, the BEV fusion feature includes the first BEV fusion feature; and obtaining the BEV fusion feature based on dense fusion features, graph networks and / or cross-attention networks can be achieved through the following steps:

[0066] Step S400: Project the densely fused features onto the BEV coordinate system.

[0067] In step S400 above, in order to obtain the first BEV fusion feature, the obtained dense fusion is first projected onto the BEV coordinate system.

[0068] To better understand the target detection method provided in this application, a brief introduction to BEV coordinates is given here. A point cloud BEV (Bird's Eye View) refers to the projection of a point cloud onto a plane perpendicular to its height. Typically, before obtaining the BEV view, the space is divided into voxels, the point cloud is downsampled using voxels, and then each voxel is projected as a point. The pixel coordinates of the BEV view can be obtained during voxel projection.

[0069] Step S401: In the BEV coordinate system, input the dense fusion feature into the cross-attention network to obtain the first BEV fusion feature.

[0070] In step S401 above, after acquiring the densely fused features, the densely fused features are input into the cross-attention network in the BEV coordinate system to obtain the first BEV fused features. The cross-attention network is used to extract the relational weights between features and to weight and fuse the various features.

[0071] Those skilled in the art will understand that Cross Attention Net can obtain contextual information of neighboring pixels on the cross path of a pixel through a new cross attention module; through further recursive operations, each pixel can eventually obtain the long-range dependencies of all pixels. In other words, Cross Attention Net uses the cross attention of features and coordinates to learn features, and then uses a forward propagation method to fuse point cloud features of different scales when combining contextual information.

[0072] pass Figure 4 As can be seen, in order to obtain the first BEV fusion feature in this application embodiment, firstly, the dense fusion feature is projected onto the BEV coordinate system, and then the dense fusion feature is input into the cross-attention network in the BEV coordinate system to extract the relationship weights between the features; furthermore, the features are weighted and fused according to the relationship weights, thereby reducing the impact of noise introduced by millimeter-wave radar clutter on the detection network and improving the effectiveness of multi-mode fusion.

[0073] Please refer to Figure 5 , Figure 5 The first flowchart for obtaining the second BEV fusion feature provided in this application embodiment; in an optional embodiment of this application embodiment, obtaining the BEV fusion feature based on dense fusion features, graph networks and / or cross-attention networks can also be achieved through the following steps:

[0074] Step S500: Input the point cloud data into the graph network to obtain the velocity characteristics of the radio detection device.

[0075] In step S500 above, in order to obtain the second BEV fusion feature, the point cloud data first needs to be input into the graph network. The original point cloud data first learns the correlation between points through a distance-based dynamic graph network, thereby obtaining the velocity features of the radio detection device, such as the velocity features of millimeter-wave radar. Those skilled in the art will understand that the obtained millimeter-wave radar velocity features can reflect the motion speed of the target under test.

[0076] It should be noted that Dynamic Graph is an algorithm that can extract millimeter-wave radar BEV features from millimeter-wave radar point cloud features; that is, the target detection method provided in this application embodiment can obtain the velocity features of the radio detection device through a Dynamic Graph. In other words, the second BEV fusion feature representation in this application embodiment includes features of the target's motion velocity.

[0077] Step S501: Project the dense fusion features and the radio detection device velocity features onto the BEV coordinate system.

[0078] In step S501 above, similarly, after obtaining the dense fusion features and the radio detection device velocity features, both are projected onto the BEV coordinate system.

[0079] Step S502: In the BEV coordinate system, the dense fusion feature and the radio detection device velocity feature are weighted and fused based on a cross-attention network to obtain the second BEV fusion feature.

[0080] In step S502 above, in the BEV coordinate system, the dense fusion feature and the radio detection device velocity feature are weighted and fused based on a cross-attention network to obtain the second BEV fusion feature. For example, the dense fusion feature and the radio detection device velocity feature are input into the cross-attention network, which extracts relation weights and performs weighted fusion.

[0081] pass Figure 5 It is known that, to obtain the second BEV fusion feature, point cloud data is input into a graph network to obtain velocity features. Further, the velocity features and dense fusion features are projected onto the BEV coordinate system, and then weighted and fused in the BEV coordinate system to obtain the second BEV fusion feature representing the velocity of the target to be detected. Due to the use of the graph network and cross-attention network, the influence of clutter in millimeter-wave radar on the detection results is reduced, improving the detection accuracy.

[0082] Please refer to Figure 6 , Figure 6The second flowchart for obtaining the second BEV fusion feature provided in this application embodiment; in an optional embodiment of this application embodiment, the dense fusion feature and the radio detection device velocity feature are weighted and fused based on a cross-attention network to obtain the second BEV fusion feature, which can be achieved through the following steps:

[0083] Step S600: Input the radio detection device velocity features and dense fusion features into the cross-attention network to obtain the weighted radio detection device velocity features and weighted dense fusion features respectively.

[0084] In step S600 above, the velocity characteristics of the radio detection device, such as the velocity characteristics of millimeter-wave radar, are input into the cross-attention network to obtain weighted velocity characteristics of the radio detection device; simultaneously, the dense fusion features are input into the cross-attention network to obtain weighted dense fusion features. Those skilled in the art will understand that the weighted fusion is performed based on relational weights.

[0085] Step S601: The weighted radio detection device velocity characteristics and the weighted dense fusion characteristics are summed and fused to obtain the second BEV fusion characteristics.

[0086] In step S601 above, the weighted radio detection device velocity characteristics, such as the weighted millimeter-wave radar velocity, and the weighted dense fusion characteristics are summed and fused to obtain the second BEV fusion characteristics.

[0087] pass Figure 6 As can be seen, in order to obtain the second BEV fusion feature, the radio detection device velocity feature and dense fusion feature are first weighted and fused based on a cross-attention network; further, the two weighted and fused features are summed and fused; finally, the second BEV fusion feature with reduced clutter effect is obtained.

[0088] In an alternative embodiment, please refer to Figure 7 and Figure 8 , Figure 7 This is a schematic diagram of BEV fusion feature acquisition provided in an embodiment of this application; Figure 8 This is a comparison image of features extracted before and after reducing clutter interference, provided in an embodiment of this application. Figure 7 In this paper, the inputs are RGB BEV features and RADAR point cloud features, respectively; the RGB BEV features are the dense fusion features mentioned above in this application; the RADAR point cloud features are obtained from point cloud data acquired by millimeter-wave radar, and the features extracted before and after processing by graph network and cross-attention network can be found in [reference needed]. Figure 8 ;in, Figure 8 The left side shows a schematic diagram of the features extracted before reducing clutter interference. Figure 8 The right side shows a schematic diagram of the features extracted after reducing clutter interference. Through comparison... Figure 8 Left side and Figure 8 The feature diagram on the right shows the extracted features after reducing clutter interference. Figure 8 The right side clearly compares the extracted features before reducing clutter interference. Figure 8 The left side shows less useless and interfering information. A schematic diagram of features extracted after reducing clutter interference is shown. Figure 8 On the right side, features highly relevant to the target are more clearly identified and presented.

[0089] Figure 7 In this diagram, Dynamic Graph is a graph network, and Cross Attention is a cross-attention network. RADAR point cloud features are processed through Dynamic Graph to obtain RADAR BEV features. k, q, and v in the diagram can be understood as different input features. The RADAR BEV features and RGB BEV features are input into Cross Attention respectively to obtain weighted RADAR BEV features and weighted RGB BEV features. Finally, the weighted RADAR BEV features and weighted RGB BEV features are summed and fused to obtain the fused BEV features.

[0090] pass Figure 7 It is known that existing RGB millimeter-wave radar multi-mode fusion networks directly extract neighborhood features for fusion, without considering the noise information introduced by millimeter-wave radar clutter, thus leading to reduced detection accuracy. The target detection method provided in this application proposes to extract association weights for millimeter-wave radar features and semantic features at the BEV feature level based on graph networks and cross-attention networks, and then perform weighted fusion of the two features based on these association weights. Specifically, the original Radar point cloud first learns the relationships between points through a distance-based dynamic graph network, and after weighted fusion in 3D space, it is projected onto the BEV feature map. Subsequently, the RGB projected BEV feature map and the RADAR BEV feature map are weighted and fused using a cross-attention network. The weighted fused RGB BEV features and RADAR BEV features are then summed and fused to obtain the final fused BEV features. In other words, using the target detection method provided in this application, the effectiveness of radar feature extraction can be improved and clutter interference reduced through graph networks and cross-attention networks.

[0091] Please refer to Figure 9 , Figure 9This is a second flowchart of the target detection method provided in the embodiments of this application; detecting the target to be detected based on BEV fusion features may include the following steps:

[0092] Step S700: Input the BEV fusion feature into the detection head and determine whether the output of the detection head includes detection boxes corresponding to multiple targets to be detected.

[0093] Step S701: If the output of the detection head is determined to include multiple detection boxes corresponding to the targets to be detected, then obtain the confidence scores of all detection boxes.

[0094] In steps S700-S701 above, the BEV fusion feature is input into the detection head, and it is determined whether the output of the detection head includes multiple detection boxes corresponding to the targets to be detected; that is, the output may have only one box or may contain multiple boxes; if it is determined that the output of the detection head includes multiple detection boxes corresponding to the targets to be detected, then the confidence of all detection boxes is obtained.

[0095] Step S702: Take the detection box with the highest confidence among all detection boxes as the detection result of the target to be detected.

[0096] In step S702 above, the detection head predicts the detection box result based on the fused feature input. The detection head is a convolutional network with an output dimension of (x,y,z,l,w,h,yaw,velocity,class,score), which outputs the confidence score of each pixel in the BEV coordinate system. In the case of multiple overlapping boxes, non-maximum suppression is used to filter out the boxes with lower confidence and obtain the box with the highest confidence as the final detection result.

[0097] pass Figure 9 As can be seen, in the embodiments of this application, when detecting the target to be detected based on the BEV fusion features, the BEV fusion features are input into the detection head, which may result in one or more detection boxes; if there are multiple overlapping detection boxes, the detection box with the highest confidence is obtained as the detection result.

[0098] Please refer to Figure 10 , Figure 10 This is a schematic diagram of the target detection system provided in an embodiment of this application; the target detection system 100 includes a depth completion module 110, a BEV fusion feature acquisition module 120, and a detection module 130.

[0099] The depth completion module 110 is used to obtain RGB semantic features based on the RGB image information of the target to be detected; the depth completion module 110 is also used to obtain dense depth information based on the RGB semantic features and point cloud data information of the target to be detected; the depth completion module 110 is also used to fuse the RGB semantic features and dense depth information to obtain dense fused features.

[0100] BEV fusion feature acquisition module 120 is used to obtain BEV fusion features based on dense fusion features, graph networks and / or cross-attention networks.

[0101] The detection module 130 is used to detect the target to be detected based on the BEV fusion features.

[0102] In an optional embodiment, the depth completion module 110 includes RGB semantic features 111 and a dense depth information acquisition submodule 112; the depth completion module 110 acquires dense depth information based on the RGB semantic features of the target to be detected and point cloud data information, including: the RGB semantic features 111 inputs the RGB image including the target to be detected into a convolutional neural network to obtain RGB semantic features; the dense depth information acquisition submodule 112 acquires dense depth information based on the point cloud data including the target to be detected and the RGB semantic features.

[0103] In an optional embodiment, the dense depth information acquisition submodule 112 acquires dense depth information based on point cloud data including the target to be detected and RGB semantic features, including: the dense depth information acquisition submodule 112 projects the point cloud data onto the camera coordinate system to obtain a sparse depth map; the dense depth information acquisition submodule 112 inputs the sparse depth map into a convolutional neural network to obtain semi-dense depth information; and the dense depth information acquisition submodule 112 acquires dense depth information based on the semi-dense depth information and RGB semantic features.

[0104] In an optional embodiment, the BEV fusion feature includes a first BEV fusion feature. The BEV fusion feature acquisition module 120 obtains the BEV fusion feature based on dense fusion features, a graph network, and / or a cross-attention network by: projecting the dense fusion feature onto the BEV coordinate system; inputting the dense fusion feature into the cross-attention network in the BEV coordinate system to obtain the first BEV fusion feature; wherein the cross-attention network is used to extract the relational weights between features and to weight and fuse the various features.

[0105] In an optional embodiment, the BEV fusion feature further includes a second BEV fusion feature, and the BEV fusion feature acquisition module 120 includes a radio detection device velocity feature acquisition submodule 121. The BEV fusion feature acquisition module 120 obtains the BEV fusion feature based on dense fusion features, a graph network, and / or a cross-attention network, further comprising: the radio detection device velocity feature acquisition submodule 121 inputting point cloud data into a graph network to obtain radio detection device velocity features; wherein, the graph network is used to learn the correlation between points. The BEV fusion feature acquisition module 120 projects the dense fusion feature and the radio detection device velocity feature onto a BEV coordinate system; in the BEV coordinate system, the BEV fusion feature acquisition module 120 performs weighted fusion of the dense fusion feature and the radio detection device velocity feature based on a cross-attention network to obtain a second BEV fusion feature; wherein, the second BEV fusion feature characterizes features including the motion velocity of the target to be detected.

[0106] In an optional embodiment, the BEV fusion feature acquisition module 120 further includes a weighted fusion submodule 122. The BEV fusion feature acquisition module 120 performs weighted fusion of dense fusion features and radio detection device velocity features based on a cross-attention network to obtain a second BEV fusion feature. This includes: the weighted fusion submodule 122 inputting the radio detection device velocity features and dense fusion features into the cross-attention network respectively to obtain weighted radio detection device velocity features and weighted dense fusion features; and the BEV fusion feature acquisition module 120 summing and fusing the weighted radio detection device velocity features and weighted dense fusion features to obtain the second BEV fusion feature.

[0107] In an optional embodiment, the detection module 130 detects the target to be detected based on the BEV fusion features, including: the detection module 130 inputs the BEV fusion features into the detection head, and determines whether the output of the detection head includes multiple detection boxes corresponding to the target to be detected; if it is determined that the output of the detection head includes multiple detection boxes corresponding to the target to be detected, the detection module 130 obtains the confidence scores of all detection boxes; the detection module 130 takes the detection box with the highest confidence score among all detection boxes as the detection result of the target to be detected.

[0108] Please see Figure 11 , Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. An electronic device 300 provided in this application includes: a processor 301 and a memory 302. The memory 302 stores machine-readable instructions executable by the processor 301. When the machine-readable instructions are executed by the processor 301, the method described above is performed.

[0109] Based on the same inventive concept, embodiments of this application also provide a computer-readable storage medium storing computer program instructions, which, when read and executed by a processor, perform the steps in any of the above implementations.

[0110] The computer-readable storage medium can be any medium capable of storing program code, such as Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM). The storage medium stores the program, and the processor executes the program after receiving an execution instruction. The method executed by the electronic terminal as defined in any embodiment of this invention can be applied to the processor or implemented by the processor.

[0111] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.

[0112] Furthermore, the units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0113] Furthermore, the functional modules in the various embodiments of this application can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

[0114] It can be replaced and can be implemented, wholly or partially, through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented, wholly or partially, in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated.

[0115] The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.

[0116] In this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, without necessarily requiring or implying any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising..." does not exclude the presence of additional identical elements in the process, method, article, or apparatus that includes said element.

[0117] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A target detection method, characterized in that, The method includes: Based on the RGB image information of the target to be detected, obtain RGB semantic features; Based on the RGB semantic features and point cloud data of the target to be detected, dense depth information is obtained; The RGB semantic features and the dense depth information are fused to obtain dense fused features; Based on the aforementioned dense fusion features, graph network, and cross-attention network, BEV fusion features are obtained; and The target to be detected is detected based on the BEV fusion features; The BEV fusion feature further includes a second BEV fusion feature; obtaining the BEV fusion feature based on the dense fusion feature, graph network, and cross-attention network includes: The point cloud data is input into the graph network to obtain the velocity characteristics of the radio detection device; wherein, the graph network is used to learn the correlation between points; Project the dense fusion feature and the radio detection device velocity feature onto the BEV coordinate system; In the BEV coordinate system, the dense fusion feature and the radio detection device velocity feature are weighted and fused based on the cross-attention network to obtain the second BEV fusion feature; wherein, the second BEV fusion feature characterizes the feature of the target motion velocity to be detected; The weighted fusion of the dense fusion feature and the radio detection device velocity feature based on the cross-attention network to obtain the second BEV fusion feature includes: The radio detection device velocity features and the dense fusion features are respectively input into the cross-attention network to obtain the weighted radio detection device velocity features and the weighted dense fusion features. The weighted radio detection device velocity feature and the weighted dense fusion feature are summed and fused to obtain the second BEV fusion feature.

2. The method according to claim 1, characterized in that, The step of obtaining dense depth information based on the RGB semantic features and point cloud data information of the target to be detected includes: The RGB image including the target to be detected is input into a convolutional neural network to obtain the RGB semantic features; Dense depth information is obtained based on point cloud data including the target to be detected and the RGB semantic features.

3. The method according to claim 2, characterized in that, The step of obtaining dense depth information based on point cloud data including the target to be detected and the RGB semantic features includes: The point cloud data is projected onto the camera coordinate system to obtain a sparse depth map; The sparse depth map is input into a convolutional neural network to obtain semi-dense depth information. The dense depth information is obtained based on the semi-dense depth information and the RGB semantic features.

4. The method according to claim 1, characterized in that, in, The BEV fusion feature includes a first BEV fusion feature; obtaining the BEV fusion feature based on the dense fusion feature, graph network, and / or cross-attention network includes: The densely fused features are projected onto the BEV coordinate system; In the BEV coordinate system, the densely fused features are input into a cross-attention network to obtain the first BEV fused features; wherein, the cross-attention network is used to extract the relational weights between features and to weight and fuse the features.

5. The method according to claim 1, characterized in that, The step of detecting the target based on the BEV fusion features includes: The BEV fusion feature is input into the detection head, and it is determined whether the output of the detection head includes multiple detection boxes corresponding to the targets to be detected. If it is determined that the output of the detection head includes multiple detection boxes corresponding to the targets to be detected, then the confidence scores of all detection boxes are obtained. The detection box with the highest confidence among all detection boxes is taken as the detection result of the target to be detected.

6. A target detection system, characterized in that, The target detection system includes: a depth completion module, a BEV fusion feature acquisition module, and a detection module; The depth completion module is used to obtain RGB semantic features based on the RGB image information of the target to be detected; The depth completion module is also used to obtain dense depth information based on the RGB semantic features and point cloud data information of the target to be detected; The depth completion module is also used to fuse the RGB semantic features and the dense depth information to obtain dense fused features; The BEV fusion feature acquisition module is used to obtain BEV fusion features based on the dense fusion features, graph network, and cross-attention network. The detection module is used to detect the target to be detected based on the BEV fusion features; The BEV fusion feature acquisition module is specifically used to input the point cloud data into the graph network to obtain the radio detection device velocity features; wherein, the graph network is used to learn the correlation between points; project the dense fusion features and the radio detection device velocity features onto the BEV coordinate system; in the BEV coordinate system, perform weighted fusion of the dense fusion features and the radio detection device velocity features based on the cross-attention network to obtain a second BEV fusion feature; wherein, the second BEV fusion feature represents features including the motion velocity of the target to be detected; input the radio detection device velocity features and the dense fusion features into the cross-attention network respectively to obtain weighted radio detection device velocity features and weighted dense fusion features respectively; sum and fuse the weighted radio detection device velocity features and the weighted dense fusion features to obtain the second BEV fusion feature.

7. An electronic device, characterized in that, The electronic device includes a memory and a processor. The memory stores program instructions, and when the processor executes the program instructions, it performs the steps of the method according to any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer program instructions that, when executed by a processor, perform the steps of the method according to any one of claims 1-5.