Target detection method and device, electronic equipment and machine readable storage medium
By voxelizing and clustering point cloud data, and generating target detection boxes using a cluster-based Transformer structure, the problem of sparse feature destruction in 3D target detection by Transformer is solved, thereby improving detection efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD
- Filing Date
- 2023-05-22
- Publication Date
- 2026-06-23
Smart Images

Figure CN116665200B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a target detection method, apparatus, electronic device, and machine-readable storage medium. Background Technology
[0002] The Transformer (a network architecture) has achieved excellent performance in 2D detection tasks and has surpassed many CNN (Convolutional Neural Networks)-based detection algorithms. Compared to 2D tasks, due to the irregular, unstructured, and sparse characteristics of LiDAR point cloud data, the Transformer architecture can show greater potential in 3D detection tasks. Summary of the Invention
[0003] In view of this, this application provides a target detection method, apparatus, electronic device, and machine-readable storage medium to optimize target detection performance.
[0004] Specifically, this application is implemented through the following technical solution:
[0005] According to a first aspect of the embodiments of this application, a target detection method is provided, comprising:
[0006] The input point cloud data is voxelized, and voxelized features are extracted to obtain the voxel features of the input point cloud data.
[0007] Based on the voxel features, the non-empty voxels corresponding to the input point cloud data are classified as foreground and background, and the offset of each foreground voxel relative to its target center point is determined.
[0008] Based on the offset of each foreground voxel relative to its target center point, the foreground voxels are clustered to obtain the clustered target clusters;
[0009] Based on the target cluster, a target detection bounding box is generated using a cluster-based target detection structure.
[0010] According to a second aspect of the embodiments of this application, a target detection device is provided, comprising:
[0011] The voxel feature extraction unit is used to voxelize the input point cloud data and extract voxel features to obtain the voxel features of the input point cloud data.
[0012] A classification unit is used to classify the non-empty voxels corresponding to the input point cloud data into foreground and background based on the voxel features.
[0013] The determination unit is used to determine the offset of each foreground voxel relative to its target center point;
[0014] Clustering units are used to cluster foreground voxels based on the offset of each foreground voxel relative to its target center point, so as to obtain the clustered target clusters.
[0015] The target detection unit is used to generate a target detection box based on the target cluster and using a cluster-based target detection structure.
[0016] According to a third aspect of the embodiments of this application, an electronic device is provided, including a processor and a memory, the memory storing machine-executable instructions executable by the processor, the processor being configured to execute the machine-executable instructions to implement the method provided in the first aspect.
[0017] According to a fourth aspect of the embodiments of this application, a machine-readable storage medium is provided, wherein machine-executable instructions are stored therein, and when the machine-executable instructions are executed by a processor, the method provided in the first aspect is implemented.
[0018] The technical solution provided in this application can bring at least the following beneficial effects:
[0019] By voxelizing the input point cloud data and extracting voxel features, voxel features of the input point cloud data are obtained. Based on the obtained voxel features, the non-empty voxels corresponding to the input point cloud data are classified as foreground and background, and the offset of each foreground voxel relative to its target center point is determined. Based on the offset of each foreground voxel relative to its target center point, the foreground voxels are clustered to obtain clustered target clusters. Then, based on the target clusters, target detection boxes are generated using a cluster-based target detection structure. This avoids BEV space transformation of features, thereby reducing the introduction of empty features, effectively avoiding the destruction of feature sparsity, improving target detection efficiency, and improving target detection performance. Attached Figure Description
[0020] Figure 1 This is a flowchart illustrating an exemplary embodiment of the target detection method of this application;
[0021] Figure 2 This is a schematic diagram illustrating a process for generating a target detection box, as shown in an exemplary embodiment of this application;
[0022] Figure 3 This is a schematic diagram of the main framework of a cluster-based Transformer 3D target detector shown in an exemplary embodiment of this application;
[0023] Figure 4 This is a schematic diagram illustrating a decoder structure according to an exemplary embodiment of this application;
[0024] Figure 5This is a schematic diagram of the structure of a target detection device shown in an exemplary embodiment of this application;
[0025] Figure 6 This is a schematic diagram of the hardware structure of an electronic device illustrated in an exemplary embodiment of this application. Detailed Implementation
[0026] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0027] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise.
[0028] To enable those skilled in the art to better understand the technical solutions provided in the embodiments of this application, and to make the above-mentioned objectives, features and advantages of the embodiments of this application more apparent and understandable, the technical solutions in the embodiments of this application will be further described in detail below with reference to the accompanying drawings.
[0029] It should be noted that the sequence number of each step in the embodiments of this application does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.
[0030] Please see Figure 1 This is a flowchart illustrating a target detection method provided in an embodiment of this application, as shown below. Figure 1 As shown, the target detection method may include the following steps:
[0031] Step S100: Voxelize the input point cloud data and extract voxelized features to obtain the voxel features of the input point cloud data.
[0032] For example, the input point cloud data may include radar point cloud data, which may include, but is not limited to, point cloud data acquired by radars (such as lidar) deployed in fields such as autonomous driving or intelligent transportation.
[0033] For example, in autonomous driving scenarios, the input point cloud data can be the point cloud data obtained by the vehicle-mounted LiDAR.
[0034] Among them, voxels are a regularized representation of point clouds.
[0035] For example, a 3D sparse convolutional backbone with a symmetrical structure can be used to extract voxel features from point cloud data.
[0036] Step S110: Based on the voxel features, classify the non-empty voxels corresponding to the input point cloud data into foreground and background, and determine the offset of each foreground voxel relative to its target center point.
[0037] For example, the offset of a foreground voxel relative to its target center point (i.e., the center point of the target to which the foreground voxel belongs) can be predicted using a pre-trained neural network model.
[0038] Step S120: Based on the offset of each foreground voxel relative to its target center point, cluster the foreground voxels to obtain the clustered target cluster.
[0039] In this embodiment of the application, after determining the offset of each foreground voxel relative to its target center point, the foreground voxels can be clustered according to the offset of each foreground voxel relative to its target center point to obtain the clustered target cluster, and foreground voxels that may belong to the same target can be clustered into one target cluster.
[0040] Step S130: Based on the target cluster, generate the target detection box using the cluster-based target detection structure.
[0041] In this embodiment of the application, after obtaining the clustered target clusters in accordance with the manner described in the above embodiments, the target detection box can be generated using the cluster-based target detection structure. Thus, the target detection box can be generated without performing feature BEV (Bird Eye View) space transformation.
[0042] It can be seen that, in Figure 1 In the illustrated method, the input point cloud data is voxelized and voxelized features are extracted to obtain voxel features. Based on the obtained voxel features, the non-empty voxels corresponding to the input point cloud data are classified as foreground and background, and the offset of each foreground voxel relative to its target center point is determined. Based on the offset of each foreground voxel relative to its target center point, the foreground voxels are clustered to obtain clustered target clusters. Then, based on the target clusters, target detection boxes are generated using a cluster-based target detection structure. This avoids BEV space transformation of features, thereby reducing the introduction of empty features, effectively avoiding the destruction of feature sparsity, improving target detection efficiency, and enhancing target detection performance.
[0043] In some embodiments, the clustering of foreground voxels based on the offset of each foreground voxel relative to its target center point may include:
[0044] For any foreground voxel, based on the offset of the foreground voxel relative to the target center point, the foreground voxel is moved to a position closer to its target center point;
[0045] The moved foreground voxels are converted to a BEV map, and the target center point is determined based on the number of foreground voxels in each grid of the BEV map.
[0046] For any moved foreground voxel, based on the distance between the moved foreground voxel and each target center point, the moved foreground voxel and the nearest target center point are assigned to the same target cluster.
[0047] For example, given that the offset of each foreground voxel relative to its target center point is determined, for any given foreground voxel, the foreground voxel can be moved to a position closer to its target center point based on the offset of the foreground voxel relative to its target center point.
[0048] For example, the offset can be a three-dimensional vector, and moving a foreground voxel closer to its target center point based on the offset can be achieved by adding the offset to the three-dimensional coordinates of the foreground voxel.
[0049] The moved foreground voxels can be converted to a BEV map (coordinate projection), and the target center point can be determined based on the number of foreground voxels in each grid (pixel, or grid) in the BEV map.
[0050] For example, the grid with the most foreground voxels in a local area can be determined as the target center point based on the number of foreground voxels in each grid.
[0051] For example, we can traverse each grid. For any grid (which can be called the target grid), we determine the number of foreground voxels in each grid within a 3*3 region centered on the target grid. If the number of foreground voxels in the target grid is the largest, we determine the target grid as the target center point; otherwise, we continue to traverse the grid.
[0052] Once the target center point is determined, for any moved foreground voxel, based on the distance between the moved foreground voxel and each target center point, the moved foreground voxel is assigned to the same target cluster as the nearest target center point.
[0053] It should be noted that, in the embodiments of this application, the clustering of foreground voxels is not limited to the above-described method. For example, based on the distance between the moved foreground voxels, two moved foreground voxels with a distance less than a preset distance threshold can be added to the same connected component, and foreground voxels belonging to the same connected component can be assigned to the same target cluster.
[0054] In some embodiments, the target detection structure described above is a Transformer structure.
[0055] like Figure 2 As shown, in step S130, generating a target detection box based on the target cluster using a cluster-based target detection structure may include:
[0056] Step S131: Encode the center coordinates of the target cluster to obtain the initial Query;
[0057] Step S132: Decode the initial Query using a cluster-based Transformer decoder structure to obtain the target detection box.
[0058] For example, the target detection structure described above is a Transformer structure.
[0059] Once the target clusters are obtained after clustering, the initial query (query vector) can be obtained by encoding the center coordinates of the target clusters.
[0060] For example, for any target cluster, the center coordinates of the target cluster can be the average of the coordinates of each moved foreground voxel in the target cluster.
[0061] For example, the center coordinates of the target cluster can be encoded using a fully connected layer to obtain the initial query.
[0062] For example, a specific encoding function can be two fully connected layers, represented as follows:
[0063]
[0064] Where F1 and F2 represent two fully connected layers, Let N represent the i-th initial query, where N is the number of target clusters and C is the number of feature channels of the query.
[0065] For example, C can be set to 128.
[0066] Given the initial query corresponding to each target cluster, the cluster-based Transformerdecoder structure can be used to decode the initial query to obtain the target detection box.
[0067] As can be seen, in this embodiment of the application, the initialization of the Query is obtained by encoding the center coordinates of the target cluster. Since the target cluster is obtained by clustering foreground voxels that may be the same target, a target cluster can better represent a possible target. The initial Query obtained in this way has explicit physical meaning, which is beneficial to the convergence of the model.
[0068] In one example, each decoder in the cluster-based Transformer architecture includes a mutual attention layer and a self-attention layer;
[0069] The above-mentioned decoding of the initial query using a cluster-based Transformer decoding layer decoder structure includes:
[0070] For any decoder, the input query is updated based on the key and value features of the target cluster to which it belongs; where the key and value features are the voxel features of the foreground voxels; in the cluster-based Transformer structure, the input of the first decoder is the initial query, and the input of non-first decoders is the updated query output by the previous decoder.
[0071] Based on the updated Query output by the last decoder in the cluster-based Transformer structure, a target detection box is generated.
[0072] For example, in a cluster-based Transformer architecture, each decoder includes a mutual attention layer and a self-attention layer.
[0073] To give the query more information about the target detection box, the initial query can be updated using mutual attention and self-attention mechanisms.
[0074] Since the initial query usually has a low correlation with the actual target, it usually has less target detection box information. The effective information obtained by updating through the self-attention mechanism is too little. Therefore, for any decoder, it is possible to first update the query based on the mutual attention mechanism, using the key and value features of the target cluster to which the input query belongs, in order to obtain more information about the target detection box. Then, the self-attention mechanism can be used to further update the query.
[0075] For example, for the mutual attention mechanism, the voxel features of the foreground voxels can be used as the key features and value features (the key features and value features are the same), and the query can be used to interact with the key features and value features within the same target cluster to obtain the information required for the target detection box.
[0076] Since the initial query is determined based on the center coordinates of the target cluster, for any query corresponding to a target cluster, the key features and value features corresponding to each foreground voxel in that target cluster belong to the same target cluster.
[0077] It should be noted that the specific implementation of updating the Query using the self-attention mechanism can be found in the relevant descriptions of traditional self-attention mechanism implementation schemes, and will not be elaborated upon in this application embodiment.
[0078] As an example, updating the input query based on the key and value features belonging to the same target cluster as the input query can include:
[0079] The input query is updated according to the following formula:
[0080]
[0081] Where D is the normalization constant, Q l For the updated Query, Q l-1 Let K be the query before the update, K be the key feature, V be the value feature, and A be the attention mask. When the query and key features belong to the same target cluster, A is 0. When the query and key features belong to different target clusters, A is negative infinity.
[0082] For example, to improve the efficiency of query updates, attention masks can be used to update the query, limiting the interaction between the query and the key and value features to the same target cluster.
[0083] For example, when the Query and Key features belong to the same target cluster, the attention mask A is 0. When the Query and Key features belong to different target clusters, the attention mask A is negative infinity (which can be denoted as -inf). Therefore, based on the above formula, the Query corresponding to each target cluster can be updated in parallel, and the interaction range between the Query and Key features and Value features can be effectively limited to the same target cluster.
[0084] It should be noted that, in the embodiments of this application, the aforementioned introduction of an attention mask can enable parallel updates of the queries corresponding to each target cluster, improving the efficiency of attention computation while ensuring that the interaction range between the query and the key and value features is limited to the same target cluster. However, in practical applications, for any query, it is also possible to first traverse and determine the key and value features belonging to the same target cluster as the query, and then update the query based on the key and value features belonging to the same target cluster. In this implementation, the aforementioned attention mask may not be necessary.
[0085] In one example, a cluster-based Transformer decoder structure includes multiple decoders;
[0086] The target detection method provided in this application embodiment may further include:
[0087] For any decoder other than the last decoder, a coarse target detection box is generated based on the updated Query output by that decoder.
[0088] Based on the center point of the coarsely selected target detection box, the foreground voxels are re-clustered to obtain the updated target cluster.
[0089] For any decoder, updating the input query based on the key and value features belonging to the same target cluster as the input query can include:
[0090] For decoders other than the first one, the input query is updated based on the key and value features of the target cluster to which it belongs.
[0091] For example, in order to improve the accuracy of cluster division and thus the accuracy of object detection, for any decoder other than the last decoder, an object detection box (which can be called a coarse object detection box) can be generated based on the updated query output by that decoder. Then, based on the center point of the coarse object detection box, the foreground voxels are re-clustered to obtain the updated object cluster.
[0092] For example, the moved foreground voxel can be assigned to the nearest center point (i.e., belong to the same target cluster as the center point) based on the distance between the moved foreground voxel and the center point of each coarsely selected target detection box.
[0093] Accordingly, during the process of updating the Query based on the Key features and Value features, for decoders other than the first decoder, the input Query can be updated based on the Key features and Value features of the same updated target cluster as the input Query.
[0094] In one example, a cluster-based Transformer decoder structure includes multiple decoders.
[0095] The target detection method provided in this application embodiment may further include:
[0096] For any decoder other than the last decoder, update the key and value features within the same target cluster based on the updated query output by that decoder;
[0097] For any decoder, updating the input query based on the key and value features belonging to the same target cluster as the input query can include:
[0098] For decoders other than the first one, the input query is updated based on the updated key and value features that belong to the same target cluster as the input query.
[0099] For example, considering that the information contained in the Query feature includes information learned through interaction with the Key and Value features in the mutual attention mechanism, the expressive power of the Key and Value features determines whether the Query feature can learn enough information to predict the detection box.
[0100] Therefore, in order to enhance the expressive power of Key and Value features in the mutual attention layer and further improve the detection performance of the model, the Key and Value features of the mutual attention mechanism no longer remain unchanged, but can be updated according to the Query.
[0101] Correspondingly, for any decoder other than the last one, the key features and value features within the same target cluster are updated based on the updated query output by that decoder. Thus, the key features and value features of the mutual attention mechanism can also be iteratively updated in different decoders, effectively enhancing the expressive power of the key features and value features.
[0102] Furthermore, for decoders other than the first one, the input query can be updated based on the updated key and value features that belong to the same target cluster as the input query.
[0103] As an example, updating the key and value features within the same target cluster based on the updated query output by the decoder can include:
[0104] For Key features that belong to the same target cluster as the updated Query, the updated Query and the current Key features are concatenated, and the concatenated features are fine-tuned using a fully connected layer to obtain the updated Key features; wherein, the updated Value features are the same as the updated Key features.
[0105] For example, for any Key feature, the updated Key feature can be obtained by concatenating the Query within the target cluster to which the Key feature belongs with the Key feature, and then fine-tuning the concatenated feature using a fully connected layer.
[0106] Since the key and value features are the same, the value feature is also updated accordingly.
[0107] To enable those skilled in the art to better understand the technical solutions provided in the embodiments of this application, the technical solutions provided in the embodiments of this application are described below with reference to specific examples.
[0108] In this embodiment, a query-based detection algorithm is used to achieve target detection, that is, a set of query vectors are initialized, and the Transformer structure is used to decode the query to obtain the final detection result.
[0109] To address the problems existing in traditional query-based object detection algorithms, this embodiment proposes a cluster-based Transformer structure for 3D object detection.
[0110] like Figure 3 As shown, the main framework of the cluster-based Transformer 3D object detector can include two parts: cluster generation (left half of the figure) and cluster-based Transformer structure (right half of the figure).
[0111] To generate clusters, the input point cloud data can be voxelized first, and a 3D sparse convolution backbone with a symmetric structure can be used to extract voxel features. Then, foreground and background classification is performed on all non-empty voxels to obtain foreground voxels. The target center point offset is predicted for all foreground voxels. A pseudo heatmap of the center point is generated using these predicted offsets, and the foreground voxels are clustered to obtain different target clusters.
[0112] Having obtained the target cluster in the manner described above, the proposed cluster-based Transformer structure is used to generate detection boxes.
[0113] For example, the obtained center coordinates of the target cluster can be used to initialize the query, and several decoder layers can be used to decode the query to obtain the detection results. Furthermore, the Query2Key strategy can be used to enhance the expressive power of the key features to improve network performance.
[0114] The implementation details of the above implementation principles will be explained below.
[0115] 1. Cluster generation
[0116] The target can be viewed as multiple different target clusters (also called instance clusters), and the final detection result can be obtained based on these target clusters. Therefore, the initial target clusters can be obtained first.
[0117] An example of obtaining clusters is as follows: For the input point cloud signal p∈R n×c Where n is the number of point clouds and c is the number of feature channels of the point clouds.
[0118] The point cloud is voxelized, and a backbone consisting of 3D sparse convolutions is used to extract sparse voxel features V∈R. n′×c′ , where n′ and c′ represent the number of voxels and the number of voxel feature channels, respectively.
[0119] Based on the obtained voxel features V, a multilayer perceptron is used to classify non-empty voxels as foreground and background. Furthermore, the offset of each foreground voxel relative to its target center point is predicted. Using the predicted offset, the foreground voxels are moved closer to the target center point. Finally, the moved foreground voxels are projected onto the BEV view to obtain I∈R. H×W×C Where H, W, and C represent the height, width, and number of categories of the BEV graph, respectively.
[0120] In the BEV graph I, the value of each grid represents the number of foreground voxels that fall within that grid after offset. The larger the number within a grid, the more likely the target's center point is to fall within that grid. Therefore, I can be considered a pseudo-heatmap of the center points. Points with higher responses in the graph (such as the grid with the largest number of foreground voxels in a local area, like a 3x3 region) represent potential target center points. Based on the distances between the multiple target center points obtained in I and each moved foreground voxel, each moved foreground voxel can be assigned to the nearest center point, thus obtaining different target clusters C = {c1, c2, ..., c...}. j ...c k}, and assign a unique ID to different clusters for differentiation.
[0121] 2. Cluster-based Transformer Decoder Structure
[0122] When multiple target clusters are obtained as described above, a cluster-based Transformer structure can be used to generate detection results, avoiding the need for BEV space transformation of features and maintaining feature sparsity. The designed Transformer structure specifically includes a cluster-based Query initialization and a cluster-based Transformer Decoder structure.
[0123] 2.1 Cluster-based Query Initialization
[0124] Based on the obtained multiple target clusters, the coordinates of the cluster center P are obtained by averaging the coordinates of multiple moved foreground voxels in the same target cluster. i =(x i y i , z i The coordinates of the cluster center can be approximated as the coordinates of the target center point.
[0125] The obtained cluster center coordinates can be encoded to obtain the initial query. This initialization method can ensure that the initial query has explicit physical meaning, which is beneficial to the convergence of the model.
[0126] For example, the specific encoding function used can be two fully connected layers, as shown below:
[0127]
[0128] Where F1 and F2 represent two fully connected layers, Let N represent the i-th initial query, where N is the number of target clusters and C is the number of feature channels of the query.
[0129] For example, C can be set to 128.
[0130] 2.2 Cluster-based Transformer decoder structure
[0131] In obtaining a meaningful initial query (i.e.) In the case of a cluster-based Transformerdecoder structure, the initial query can be decoded to obtain the target detection box.
[0132] Cluster-based Transformer decoder structures can be as follows: Figure 4 As shown, each decoder can include a masked mutual attention layer and a self-attention layer.
[0133] Considering that the correlation with the actual target is usually low, the actual target detection box information it possesses is usually less. Therefore, unlike the standard decoder structure, Figure 4 The decoder structure swaps the order of the self-attention layer and the mutual attention layer. First, the Query is updated using the key and value features that belong to the same target cluster as the input Query to obtain more information needed for the target detection boxes. Then, the self-attention mechanism is used to further update the Query.
[0134] For the mutual attention mechanism, the extracted foreground voxel features V∈R can be... n′×c′ As key and value features, the query feature obtains the necessary information for the target detection box through interaction with the key and value features. Furthermore, to limit the interaction between the query and the key and value features to the same target cluster (i.e., the query interacts with the key and value features within the same target cluster), a dynamic attention mask A is introduced. l ∈R n×m Where l represents the number of decoder layers, defined as follows:
[0135]
[0136] Where id() represents the ID number of the target cluster to which the Query and Key belong. The above formula means that if a Query and Key feature belong to the same target cluster, the value of the attention mask is set to 0; otherwise, it is set to negative infinity.
[0137] Based on the above attention mask The proposed masked mutual attention mechanism can be represented as follows:
[0138]
[0139] Where D represents a normalization constant. Unlike standard mutual attention, the key and value features of the mutual attention layer in this embodiment are also updated along with the decoder layer. The update is based on the Query2Key strategy provided in this application embodiment, and the specific implementation will be described below.
[0140] To achieve better detection performance, a cascaded structure of multiple Transformer Decoders was used for iterative updates of the query features. For each decoder's output query features, a two-layer fully connected network was used to generate object detection boxes. The object detection boxes for each decoder were fine-tuned based on the results of the previous decoder. Furthermore, the object detection boxes generated by each decoder were used to adjust the attention mask to redefine the mapping between key and query features (i.e., whether key and query features belong to the same target cluster), ensuring that the interaction range of mutual attention is within the same target cluster (interacting with query, key, and value features within the same target cluster).
[0141] For example, the specific adjustment method for the attention mask can be as follows: using the target center point (the center point of the target detection box) predicted by the updated Query output by the decoder, the foreground voxels are re-divided into clusters to obtain the updated target clusters, and then the attention mask is updated according to the updated target clusters.
[0142] 2.3 Query2Key Strategy
[0143] Considering that the information contained in the Query feature includes information learned through interaction with the Key and Value features in the mutual attention mechanism, the expressive power of the Key and Value features determines whether the Query feature can learn enough information to predict the detection box.
[0144] Therefore, in order to enhance the expressive power of Key and Value features in the mutual attention layer and further improve the detection performance of the model, the Key and Value features of the mutual attention mechanism no longer remain unchanged, but can be updated according to the Query.
[0145] For example, the Query2Key strategy can be used to enhance the expressive power of Key and Value features in the mutual attention layer, thereby further improving the detection performance of the model.
[0146] One implementation method is as follows:
[0147] After a decoder at a certain layer (not the last decoder), the updated query feature Q is obtained. l Q l Compared with the Key feature K before the update l-1 The clusters are concatenated according to their corresponding cluster IDs (i.e., the updated Query is concatenated with each Key feature of the same target cluster to update each Key feature separately), and a fully connected layer is used to fine-tune the concatenated features to obtain the updated Key feature K. l The updated Key feature K l These will serve as the key and value features for the mutual attention mechanism in the next decoder. In this way, the key and value features of the mutual attention mechanism can be iteratively updated in different decoders.
[0148] In the process of concatenating the updated Query and Key features, the updated Query can be copied to the same number of Key features in the target cluster to which it belongs, and then concatenated with each Key feature according to its feature dimension.
[0149] For example, suppose the target cluster to which the updated Query belongs contains 100 Key features. Before concatenation, the Query can be copied into 100 copies, and then concatenated with the Key features according to their feature dimensions.
[0150] As can be seen, by clustering the foreground voxels corresponding to the input point cloud data to obtain multiple target clusters, and using a cluster-based Transformer structure to generate target detection boxes, it is possible to avoid transforming the features into BEV space, thus avoiding the introduction of a large number of empty features and ensuring the sparsity of features (which can be called fully sparse point cloud target detection), thereby improving the efficiency of target detection. In the process of generating cluster-based target detection boxes, the initial query is obtained by encoding the center coordinates of the target clusters, ensuring that the initialization of the query has explicit physical meaning. During the update of the query using the mutual attention mechanism, effective interaction between the query and the key and value features of the same target cluster is restricted, improving the convergence speed and detection accuracy of the model. Finally, the Query2Key strategy effectively enhances the expressive power of the key features, further improving the detection performance of the Transformer model.
[0151] The method provided in this application has been described above. The apparatus provided in this application is described below:
[0152] Please see Figure 5This is a schematic diagram of the structure of a target detection device provided in an embodiment of this application, as shown below. Figure 5 As shown, the target detection device may include:
[0153] The voxel feature extraction unit 510 is used to voxelize the input point cloud data and extract voxel features to obtain the voxel features of the input point cloud data.
[0154] The classification unit 520 is used to classify the foreground and background of the non-empty voxels corresponding to the input point cloud data based on the voxel features.
[0155] Determining unit 530 is used to determine the offset of each foreground voxel relative to its target center point;
[0156] Clustering unit 540 is used to cluster foreground voxels based on the offset of each foreground voxel relative to its target center point to obtain the clustered target cluster;
[0157] The target detection unit 550 is used to generate a target detection box based on the target cluster and using a cluster-based target detection structure.
[0158] In some embodiments, the clustering unit 540 clusters foreground voxels based on the offset of each foreground voxel relative to its target center point, including:
[0159] For any foreground voxel, based on the offset of the foreground voxel relative to the target center point, the foreground voxel is moved to a position closer to its target center point;
[0160] The moved foreground voxels are converted to a BEV map, and the target center point is determined based on the number of foreground voxels in each grid of the BEV map.
[0161] For any moved foreground voxel, based on the distance between the moved foreground voxel and each target center point, the moved foreground voxel and the nearest target center point are assigned to the same target cluster.
[0162] In some embodiments, the target detection structure is a Transformer structure;
[0163] The target detection unit 550 generates a target detection box based on the target cluster and using a cluster-based target detection structure, including:
[0164] The center coordinates of the target cluster are encoded to obtain the initial query vector Query;
[0165] The initial query is decoded using a cluster-based Transformer decoding layer structure to obtain the target detection box.
[0166] In some embodiments, each decoder in the cluster-based Transformer architecture includes a mutual attention layer and a self-attention layer;
[0167] The target detection unit 550 uses a cluster-based Transformer decoding layer decoder structure to decode the initial query, including:
[0168] For any decoder, the input query is updated based on the key and value features of the target cluster to which it belongs; where the key and value features are the voxel features of the foreground voxels; in the cluster-based Transformer structure, the input of the first decoder is the initial query, and the input of non-first decoders is the updated query output by the previous decoder.
[0169] Based on the updated Query output by the last decoder in the cluster-based Transformer structure, a target detection box is generated.
[0170] In some embodiments, the target detection unit updates the input query based on the key and value features of the target cluster to which the input query belongs, including:
[0171] The input query is updated according to the following formula:
[0172]
[0173] Where D is the normalization constant, Q l For the updated Query, Q l-1 Let K be the query before the update, K be the key feature, V be the value feature, and A be the attention mask. When the query and key features belong to the same target cluster, A is 0. When the query and key features belong to different target clusters, A is negative infinity.
[0174] In some embodiments, the cluster-based Transformer decoding layer decoder structure includes multiple decoders;
[0175] The target detection unit 550 is also used to generate a coarse target detection box for any decoder other than the last decoder, based on the updated Query output by that decoder.
[0176] The clustering unit 540 is also used to re-cluster the foreground voxels based on the center point of the coarsely selected target detection box to obtain an updated target cluster;
[0177] For any decoder, the target detection unit 550 updates the input query based on the key and value features of the target cluster to which the input query belongs, including:
[0178] For decoders other than the first one, the input query is updated based on the key and value features of the target cluster to which it belongs.
[0179] In some embodiments, the cluster-based Transformer decoding layer decoder structure includes multiple decoders;
[0180] The target detection unit 550 is also used to update the Key features and Value features within the same target cluster based on the updated Query output by any decoder other than the last decoder.
[0181] For any decoder, the target detection unit 550 updates the input query based on the key and value features of the target cluster to which the input query belongs, including:
[0182] For decoders other than the first one, the input query is updated based on the updated key and value features that belong to the same target cluster as the input query.
[0183] In some embodiments, the target detection unit 550 updates the Key and Value features within the same target cluster based on the updated Query output by the decoder, including:
[0184] For Key features that belong to the same target cluster as the updated Query, the updated Query and the current Key features are concatenated, and the concatenated features are fine-tuned using a fully connected layer to obtain the updated Key features; wherein, the updated Value features are the same as the updated Key features.
[0185] This application provides an electronic device including a processor and a memory, wherein the memory stores machine-executable instructions that can be executed by the processor, and the processor executes the machine-executable instructions to implement the target detection method described above.
[0186] Please see Figure 6 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. The electronic device may include a processor 601 and a memory 602 storing machine-executable instructions. The processor 601 and the memory 602 can communicate via a system bus 603. Furthermore, by reading and executing the machine-executable instructions corresponding to the target detection logic in the memory 602, the processor 601 can execute the target detection method described above.
[0187] The memory 602 mentioned in this document can be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, etc. For example, machine-readable storage media can be: RAM (Random Access Memory), volatile memory, non-volatile memory, flash memory, storage drives (such as hard disk drives), solid-state drives, any type of storage disk (such as optical discs, DVDs, etc.), or similar storage media, or combinations thereof.
[0188] In some embodiments, a machine-readable storage medium, such as Figure 6 The memory 602 in the machine-readable storage medium stores machine-executable instructions, which, when executed by a processor, implement the target detection method described above. For example, the storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, or optical data storage device.
[0189] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0190] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A target detection method, characterized in that, include: The input point cloud data is voxelized, and voxelized features are extracted to obtain the voxel features of the input point cloud data. Based on the voxel features, the non-empty voxels corresponding to the input point cloud data are classified as foreground and background, and the offset of each foreground voxel relative to its target center point is determined. Based on the offset of each foreground voxel relative to its target center point, the foreground voxels are clustered to obtain the clustered target clusters; Based on the target cluster, a target detection box is generated using a cluster-based target detection structure; The target detection structure is a Transformer structure; The step of generating target detection boxes based on the target cluster using a cluster-based target detection structure includes: The center coordinates of the target cluster are encoded to obtain the initial query vector Query; The initial query is decoded using a cluster-based Transformer decoding layer structure to obtain the target detection box.
2. The method according to claim 1, characterized in that, The clustering of foreground voxels based on their offset relative to their target center point includes: For any foreground voxel, based on the offset of the foreground voxel relative to the target center point, the foreground voxel is moved to a position closer to its target center point; The moved foreground voxels are converted to a bird's-eye view BEV image, and the target center point is determined based on the number of foreground voxels in each pixel grid in the BEV image. For any moved foreground voxel, based on the distance between the moved foreground voxel and each target center point, the moved foreground voxel and the nearest target center point are assigned to the same target cluster.
3. The method according to claim 1, characterized in that, In the cluster-based Transformer architecture, each decoder includes a mutual attention layer and a self-attention layer; The decoding of the initial query using a cluster-based Transformer decoding layer structure includes: For any decoder, the input query is updated based on the key and value features of the target cluster to which it belongs; where the key and value features are the voxel features of the foreground voxels; in the cluster-based Transformer structure, the input of the first decoder is the initial query, and the input of non-first decoders is the updated query output by the previous decoder. Based on the updated Query output by the last decoder in the cluster-based Transformer structure, a target detection box is generated.
4. The method according to claim 3, characterized in that, The update of the input query based on the key and value features belonging to the same target cluster as the input query includes: The input query is updated according to the following formula: Where D is the normalization constant, For the updated query, Let K be the query before the update, K be the key feature, V be the value feature, and A be the attention mask. When the query and key features belong to the same target cluster, A is 0. When the query and key features belong to different target clusters, A is negative infinity.
5. The method according to claim 3, characterized in that, The cluster-based Transformer decoding layer decoder structure includes multiple decoders; The method further includes: For any decoder other than the last decoder, a coarse target detection box is generated based on the updated Query output by that decoder. Based on the center point of the coarsely selected target detection box, the foreground voxels are re-clustered to obtain the updated target clusters; For any decoder, the input query is updated based on its key and value features, which belong to the same target cluster as the input query. This includes: For decoders other than the first one, the input query is updated based on the key and value features of the target cluster to which it belongs.
6. The method according to claim 3, characterized in that, The cluster-based Transformer decoding layer decoder structure includes multiple decoders; The method further includes: For any decoder other than the last decoder, update the key and value features within the same target cluster based on the updated query output by that decoder; For any decoder, the input query is updated based on its key and value features, which belong to the same target cluster as the input query. This includes: For decoders other than the first one, the input query is updated based on the updated key and value features that belong to the same target cluster as the input query.
7. The method according to claim 6, characterized in that, The step of updating the Key and Value features within the same target cluster based on the updated Query output by the decoder includes: For Key features that belong to the same target cluster as the updated Query, the updated Query and the current Key features are concatenated, and the concatenated features are fine-tuned using a fully connected layer to obtain the updated Key features; wherein, the updated Value features are the same as the updated Key features.
8. A target detection device, characterized in that, include: The voxel feature extraction unit is used to voxelize the input point cloud data and extract voxel features to obtain the voxel features of the input point cloud data. A classification unit is used to classify the non-empty voxels corresponding to the input point cloud data into foreground and background based on the voxel features. The determination unit is used to determine the offset of each foreground voxel relative to its target center point; Clustering units are used to cluster foreground voxels based on the offset of each foreground voxel relative to its target center point, so as to obtain the clustered target clusters. The target detection unit is used to generate a target detection box based on the target cluster and using a cluster-based target detection structure. The target detection structure is a Transformer structure; The target detection unit generates target detection boxes based on the target clusters and using a cluster-based target detection structure, including: The center coordinates of the target cluster are encoded to obtain the initial query vector Query; The initial query is decoded using a cluster-based Transformer decoding layer structure to obtain the target detection box.
9. The apparatus according to claim 8, characterized in that, The clustering unit clusters foreground voxels based on the offset of each foreground voxel relative to its target center point, including: For any foreground voxel, based on the offset of the foreground voxel relative to the target center point, the foreground voxel is moved to a position closer to its target center point; The moved foreground voxels are converted to a bird's-eye view BEV image, and the target center point is determined based on the number of foreground voxels in each pixel grid in the BEV image. For any moved foreground voxel, based on the distance between the moved foreground voxel and each target center point, the moved foreground voxel and the nearest target center point are assigned to the same target cluster; And / or, In the cluster-based Transformer architecture, each decoder includes a mutual attention layer and a self-attention layer; The target detection unit uses a cluster-based Transformer decoding layer decoder structure to decode the initial query, including: For any decoder, the input query is updated based on the key and value features of the target cluster to which it belongs; where the key and value features are the voxel features of the foreground voxels; in the cluster-based Transformer structure, the input of the first decoder is the initial query, and the input of non-first decoders is the updated query output by the previous decoder. Based on the updated Query output by the last decoder in the cluster-based Transformer structure, a target detection box is generated; The target detection unit updates the input query based on the key and value features of the target cluster to which the input query belongs, including: The input query is updated according to the following formula: Where D is the normalization constant, For the updated query, Here is the query before the update, K is the key feature, V is the value feature, and A is the attention mask. When the query and key features belong to the same target cluster, A is 0. When the query and key features belong to different target clusters, A is negative infinity. The cluster-based Transformer decoding layer decoder structure includes multiple decoders; The target detection unit is also used to generate a coarse target detection box for any decoder other than the last decoder, based on the updated Query output by that decoder. The clustering unit is also used to re-cluster the foreground voxels based on the center point of the coarsely selected target detection box to obtain an updated target cluster; For any decoder, the target detection unit updates the input query based on the key and value features of the target cluster to which the input query belongs, including: For non-first decoders, the input query is updated based on the key and value features of the same updated target cluster as the input query. The cluster-based Transformer decoding layer decoder structure includes multiple decoders; The target detection unit is also used to update the Key and Value features within the same target cluster based on the updated Query output by any decoder other than the last decoder. For any decoder, the target detection unit updates the input query based on the key and value features of the target cluster to which the input query belongs, including: For decoders other than the first decoder, the input query is updated based on the updated key and value features of the target cluster to which it belongs. The target detection unit updates the key and value features within the same target cluster based on the updated query output by the decoder, including: For Key features that belong to the same target cluster as the updated Query, the updated Query and the current Key features are concatenated, and the concatenated features are fine-tuned using a fully connected layer to obtain the updated Key features; wherein, the updated Value features are the same as the updated Key features.
10. An electronic device, characterized in that, The method includes a processor and a memory, the memory storing machine-executable instructions that can be executed by the processor, the processor executing the machine-executable instructions to implement the method as described in any one of claims 1-7.
11. A machine-readable storage medium, characterized in that, The machine-readable storage medium stores machine-executable instructions, which, when executed by a processor, implement the method as described in any one of claims 1-7.