3D object detection method based on cylinder sequence attention and hole expansion convolution
By combining column sequence attention and dilated convolution, the problems of coarse column feature encoding and insufficient receptive field in 3D object detection are solved, achieving efficient and accurate real-time object detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YANSHAN UNIV
- Filing Date
- 2024-04-29
- Publication Date
- 2026-06-26
AI Technical Summary
Existing 3D target detection methods suffer from coarse and limited information in the column feature encoding stage, and insufficient receptive field in the backbone network feature extraction stage, making it difficult to meet real-time and accuracy requirements.
We employ a column sequence attention feature encoder and a dilated convolutional network. The column sequence attention module enhances the inter-point correlation information, and the sparse and dense dilated convolutional modules increase the receptive field. By combining sparse and dense dilated convolutional blocks, we construct an efficient 3D object detection model.
It improves detection accuracy and robustness, enables stable real-time target detection in complex scenes, and enhances the model's representational ability and detection performance.
Smart Images

Figure CN118429655B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of target detection technology, specifically relating to a 3D target detection method based on column sequence attention and dilated convolution. Background Technology
[0002] 3D object detection based on LiDAR point clouds has a wide range of applications, such as autonomous driving perception and robotics, and has therefore attracted much attention. Similar to 2D object detection, 3D object detection requires accurately locating the target position in a specified three-dimensional space and determining the target category. However, there are also differences between the two detection tasks. In 2D object detection, the scene to be detected is usually dense and regular RGB image data, which can be directly processed using CNN. In contrast, the scene to be detected in 3D object detection consists of sparse and irregular point cloud data, which makes 3D object detection more challenging.
[0003] Numerous studies have achieved significant success in 3D object detection using point cloud data. These methods can be categorized into raw point-based methods and rule-based methods. Previous research generally suggests that raw point-based methods can fully utilize complete raw point cloud information to achieve better performance. However, due to the complexity of processing raw point cloud data, these models have lower detection speeds and are difficult to apply to large-scale point cloud scenes. Rule-based methods, on the other hand, process point cloud data into voxel or cylinder representations. Although this process loses some original information, the rule-based data can utilize convolutional modules to improve the efficiency of feature extraction, thereby increasing the model's detection speed. Considering the real-time requirements of autonomous driving perception modules, cylinder-based detection methods are more suitable.
[0004] Column-based detection methods typically include the following basic steps: column feature encoding, backbone network feature extraction, neck network feature aggregation, and detection head execution. However, in the column feature encoding stage, the average or maximum / minimum method is usually used to extract the original point cloud features inside the column as column features, without mining and learning the original point cloud features, resulting in relatively coarse column features. In the backbone network feature extraction stage, previous methods typically used sparse convolution for multi-level feature extraction, but due to the limitations of column size and sparse convolution kernel size, the receptive field of such a backbone network is difficult to meet the requirements of accurate detection.
[0005] To address the issues of coarse encoding methods and limited encoded information in the column feature encoding stage, a self-attention mechanism can be employed to establish the correlation information between points within the scene. However, directly applying self-attention to the global scene presents several problems: 1. High computational cost; 2. Leading to a large number of redundant attention features. To solve these problems, attention modeling can be performed within a local region to obtain the attention feature information between points within the column. However, this operation faces the challenge of parallelizing attention modeling for variable-length sequences.
[0006] To address the issue of insufficient receptive field during the feature extraction stage of the backbone network, VoxelNext can be used to compensate by increasing the number of downsampling layers. However, this approach leads to excessively low feature map resolution, which is detrimental to prediction results. Furthermore, increasing the number of downsampling layers also increases computational overhead. Alternatively, VoTr and SST can be used with Transformers as the backbone network to increase the model's receptive field. However, this approach requires designing complex structures to query a large number of non-empty samples and performing attention modeling on these samples, significantly increasing the model size and complexity.
[0007] Therefore, there is a need for a 3D object detection method based on column sequence attention and dilated convolution that can solve the problems of coarse encoding methods and limited encoding information, effectively increase the receptive field of the model, and maintain the simplicity and efficiency of the column model. Summary of the Invention
[0008] The purpose of this invention is to provide a 3D target detection method based on column sequence attention and dilated convolution. The column sequence attention feature encoder solves the problems of coarse encoding and limited encoding information in the column feature encoding stage, and the dilated convolutional network solves the problem of insufficient receptive field in the backbone network feature extraction stage. The combination of the two can obtain a detection model with higher detection accuracy and stronger robustness while meeting the real-time requirements.
[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0010] A 3D object detection method based on column sequence attention and dilated convolution includes the following steps:
[0011] Step S1: Data preprocessing: Crop the point cloud scenes of all folders in the training dataset to a fixed size. Set the detection ranges in the X, Y and Z axes of the point cloud scene to [0, 70.4]m, [-40, 40]m and [-3, 1]m respectively. Then generate the corresponding bin file for the cropped dataset.
[0012] Step S2: Build and train the network model: First, prepare the point cloud data for training, including point cloud scene data and target label data; then build the network model and set the hyperparameters of each module; then feed the training data into the network model for training. During the entire training process, the model is optimized by reducing the network's loss function to obtain the best-performing network model weights; finally, save the network model weights.
[0013] Step S3: Model Testing: Verify the model's effectiveness using test set data. Test the target detection performance by loading the weights of the best-performing network model from Step S2 to achieve accurate and real-time 3D target detection.
[0014] A further improvement to the technical solution of the present invention is that, in step S1, data preprocessing includes the following steps:
[0015] Step S101: Crop the point cloud scene data in the KITTI dataset required for training;
[0016] Step S102: Set the detection ranges in the X, Y and Z axis directions to [0, 70.4]m, [-40, 40]m and [-3, 1]m respectively, set the cylinder size to (0.05m, 0.05m), and set the maximum number of sampling points inside each cylinder to 5, so as to obtain a BEV feature map with a resolution of [1600, 1480].
[0017] A further improvement to the technical solution of the present invention is that: in step S2, constructing the network model and training the model includes the following steps:
[0018] Step S201: Construct a network model, which includes a column sequence attention feature encoder for column feature encoding, a dilated convolutional network for feature extraction in a large receptive field region, a neck network for feature fusion, and a classification and regression sub-network for target classification and regression; the column sequence attention feature encoder includes a column sequence attention module and a column feature soft aggregation module; the dilated convolutional network includes a sparse dilated convolutional module and a dense dilated convolutional module;
[0019] Step S202: The column sequence attention module achieves self-attention through a two-level cross-attention approach, using the attention information between points inside the column to enhance the features of the original point cloud data;
[0020] Step S203: The point cloud features output by the column sequence attention module are refined by the column feature soft aggregation module. The weight information of each point inside the column is obtained by using scattering SoftMax, and the point cloud features are weighted accordingly to finally obtain more representative column feature information.
[0021] Step S204: Use multiple SDEC layers composed of sparsely dilated convolutional modules to process sparse 2D feature data and extract features from sparse feature maps within a large receptive field.
[0022] Step S205: After the sparse feature map is densified, the DDEC layer, which is composed of dense dilated convolutional blocks, is used to extract features from the dense feature map in a large receptive field, and finally a level 2 dense feature map is obtained.
[0023] Step S206: Input the level 2 dense feature map into the neck network for feature fusion. Adjust the number of channels and size of the feature map through transposed convolution and ordinary convolution operations. Finally, stack the processed feature maps to obtain a feature map with strong representation capabilities for use by the detection head. The detection head obtains the loss of category, position and orientation between the predicted result and the real result, thereby optimizing the model parameters.
[0024] Step S207: Input the processed point cloud dataset and train the network model.
[0025] A further improvement of the technical solution of the present invention is that: in step S202, the point cloud data first passes through the column sequence attention module, and the cropped training set point cloud data is sent into the column sequence attention module of the column sequence attention feature encoder. The column sequence attention module contains two levels of cross attention, wherein the first level of cross attention is the encoder and the second level of cross attention is the decoder.
[0026] In the encoder, the input point cloud feature information Fin is first obtained. Fin consists of point cloud feature information and point cloud position information relative to the grid. The point cloud feature information is obtained by MLP initialization encoding of the original point cloud feature information F = (x, y, z, r). The point cloud position information relative to the grid is obtained by PE position encoding of the offset information R = (x', y', z') of the upper left corner of the grid where each point is located. The position encoding operation PE used here is similar to the position encoding operation used in DeTR. The operation process is as follows:
[0027] F in =MLP(F)+PE(R)
[0028] R = (P i -(C i ×P size )) / P size
[0029] In the formula, P i C represents the global offset distance of the i-th point relative to the detection scene. i P represents the grid index of the i-th point. size Indicates the size of the column grid;
[0030] Point cloud features F in The K and V vectors are obtained through MLP. The learnable fixed-length virtual point set features Fv are used as the Q vector for cross-attention modeling to obtain the attention feature information between the virtual point set features and the points inside the cylinder. The virtual point set features are composed of a single linear layer and are regarded as virtual point cloud points with several fixed-dimensional feature dimensions. The cross-attention uses the scatter_softmax function to calculate the weight of each point. This function is a core library function of CUDA and can efficiently and in parallel implement scattering-related operations.
[0031] Using the column index of each point as an index, a softmax operation is used to restrict the attention weights of each virtual point to the original point within each column, resulting in attention feature information with the same feature dimension, called Cross-Attention Feature (CAF). The operation process is as follows:
[0032] Q = F v K = MLP(F in ),V=MLP(F in )
[0033] CAF = CrossAttention(Q,K,V)
[0034] CrossAttention = SoftMax scatter (QK T V
[0035] In the formula, Q, K, and V represent the query vector, key vector, and value vector used for attention modeling, respectively; F v F represents the characteristics of a set of virtual points. in Represents the original point cloud features; MLP represents a linear layer; CrossAttention represents the cross-attention operation; CAF represents the obtained cross-attention features; SoftMax represents the original point cloud features. scatter Indicates the scattering SoftMax operation;
[0036] The column sequence attention module decoder uses the cross-attention feature (CAF) as the K and V vectors, and the original point cloud feature information as the Q vector, to perform cross-attention modeling again, thereby indirectly obtaining the self-attention feature information (SF) between points within each column of the original point cloud. The operation process is as follows:
[0037] Q = MLP(F in K = CAF, V = CAF
[0038] SF = CrossAttention(Q,K,V)
[0039]
[0040] In the formula, Q, K, and V represent the query vector, key vector, and value vector used for attention modeling, respectively; MLP represents a linear layer; CAF represents cross-attention features with the same feature dimension obtained by the encoder; and F... in represents the original point cloud features; SF represents the self-attention features obtained by the decoder; CrossAttention represents the cross-attention operation; d represents the feature sequence length; and SoftMax represents the ordinary SoftMax operation.
[0041] A further improvement of the technical solution of the present invention is that: in step S203, the point feature information with strong representation ability obtained in step S202 is sent to the cylinder feature soft aggregation module of the cylinder sequence attention feature encoder.
[0042] The point cloud features of the column sequence attention module are used as input to the column feature soft aggregation module. The weight information of each point inside the corresponding column is obtained by using SoftMax scattering. Then, the point cloud features of the column sequence attention module are multiplied with the weight information of each point to complete the weighting operation. The operation process is as follows:
[0043] W P =SoftMax scatter (F PSA )
[0044] F PFSA =F PSA *W P
[0045] PF PFSA =Add scatter (F PFSA )
[0046] In the formula, F PSA W represents the point cloud features of the attention module for the column sequence. p SoftMax represents the weight information of each point within its respective cylinder. scatter F represents the scattering SoftMax operation. PFSA This represents the weighted feature information of each point cloud. (Add) scatter PF represents the scattering addition operation. PFSA This represents the final columnar feature.
[0047] A further improvement of the technical solution of the present invention is that: in step S204, the column feature information obtained in step S203 is sent to the sparse dilated convolution module of the dilated convolution network to extract sparse features in a large receptive field area. The sparse dilated convolution mainly solves the problem of insufficient receptive field in the backbone network during the feature extraction of sparse feature maps.
[0048] A structure similar to Inception is used to combine the convolution results of convolution kernels with different dilation rates. This preserves the receptive field of the original model to cover small targets, while also making full use of the large dilation rate to expand the receptive field. This structure is dilated convolution, and the receptive field and actual kernel size of the dilated convolution block are shown below:
[0049] RF l+1 =RF l +(K-1)×S i
[0050]
[0051] In the formula, RF l+1 Represents the receptive field of the current layer, RF l S represents the receptive field of the previous layer, K represents the actual kernel size, n represents the number of convolutional blocks with different dilation rates, and S represents the number of convolutional blocks combined. i This represents the product of the step sizes of all previous layers;
[0052] The actual kernel size increases with the increase of the dilation rate d of the dilated convolution. The dilated convolution combines n convolution blocks with different dilation rates to further increase the actual kernel size.
[0053] SDEC processes sparse 2D feature data and consists of four sparse dilated convolutional layers. Each layer includes one sparse convolutional block and two sparse dilated convolutional blocks. The sparse convolutional block consists of a regular 2D sparse convolution with a stride of 2, batch normalization, and a ReLU activation function. It is mainly used for downsampling sparse 2D feature maps. The sparse dilated convolutional block consists of two layers of multiple parallel sub-manifold sparse convolutions.
[0054] The input sparse feature map is first subjected to sparse convolution operations in different receptive field ranges. Then, the multi-size receptive field convolution feature map is combined with the original input feature map to obtain the dilated feature map. This operation is the dilated convolution processing. Next, the dilated feature map is subjected to dilated convolution processing again to further extract sparse features.
[0055] A further improvement of the technical solution of the present invention is that: in step S205, the sparse feature map obtained in step S204 is made denser, and then fed into the dense dilated convolution of the dilated convolutional network to extract features from the dense feature map within a large receptive field.
[0056] DDEC, with a structure similar to SDEC, is used to process dense 2D feature data. DDEC blocks are composed of traditional 2D convolutions. In DDEC, a 2D convolution block with a stride of 2 is first used to downsample the ordinary 2D feature map. Then, two dense dilated convolution blocks are used to extract local and large-scale receptive features from the ordinary 2D feature map. The dense dilated convolution block adopts the same structure as the sparse dilated convolution block, replacing the ordinary convolution with sub-stream sparse convolution, and performing 2D convolution operations in different receptive field ranges.
[0057] A further improvement of the technical solution of the present invention is that: in step S206, the level 2 dense feature map generated in step S205 is sent into the neck network for feature fusion to obtain dense feature maps of the backbone network with downsampling step sizes of 8 and 16, and their tensor dimensions are [batch_size,256,w / 8,h / 8] and [batch_size,256,w / 16,h / 16], respectively;
[0058] For a feature map with a downsampling stride of 8, a transposed convolution is used to reduce its channel count to 128 dimensions without changing the feature map size. After processing, the feature map tensor dimension is [batch_size, 128, w / 8, h / 8]. For a feature map with a downsampling stride of 16, a regular convolution is first used for feature update without changing the number of channels or the feature size. Then, a transposed convolution is used to halve the feature channel dimension and double the feature map size. After processing, the feature map tensor dimension is also [batch_size, 128, w / 8, h / 8].
[0059] Finally, the two processed feature maps are stacked along the channel dimension to obtain a feature map with tensor dimensions of [batch_size, 256, w / 8, h / 8].
[0060] A further improvement to the technical solution of the present invention is that, in step S207, the model training content is as follows:
[0061] The detection head and loss function employ a classic anchor-box-based detection head; the loss function includes a classification loss L composed of Focalloss. cls The regression loss L is composed of smoothed L1 loss. reg and the directional loss L, which is composed of cross-entropy loss. dir ;
[0062] Among them, the regression loss L reg Calculate the deviation between the predicted target center point, the predicted bounding box dimensions, and the predicted offset angle and the true label, i.e., [x a ,y a ,z a ,l a ,w a ,h a ,θ a ] and [x g ,y g ,z g ,l g ,w g ,h g ,θ g The regression value between [ ] is defined by the following formula:
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069] Δθ=sin(θ g -θ a )
[0070] The regression loss formula is defined as follows:
[0071]
[0072] The classification loss formula is defined as follows:
[0073] L cls =-α t (1-P t ) γ log(P t )
[0074] In the formula, P t This represents the confidence level obtained by the model for the predicted box. α and γ represent hyperparameters, set to α = 0.25 and γ = 2, respectively.
[0075] The overall training loss function is shown below:
[0076] L = β cls L cls +β reg Lreg +β dir L dir
[0077] Set the percentage (β) of each of the three losses in the total loss. cls =1, β reg =2 and β dir =0.2.
[0078] The technological advancements achieved by this invention due to the adoption of the above technical solutions are as follows:
[0079] This invention's 3D target detection method utilizes a column sequence attention feature encoder in the column feature encoding stage. On one hand, the column sequence attention module enriches the features of each original point by leveraging inter-point attention feature information, overcoming the limitation of traditional column encoders in terms of encoding information. On the other hand, a column feature soft aggregation module encodes the point feature information inside the column in a more refined manner, resulting in more powerful column feature information and addressing the coarse encoding problem of traditional column encoders. In the backbone network feature extraction stage, a dilated convolutional network is used to aggregate feature information within a larger receptive field in sparse and dense feature maps through sparse and dense dilated convolutions, compensating for the insufficient receptive field of traditional convolutional backbone networks.
[0080] This invention combines the deep overall abstract feature information output by the backbone network with the shallow local fine information, providing the detection head with more powerful feature information. In comparative testing, this invention has a more accurate detection effect compared with other methods, and can accurately and stably detect targets in many complex scenarios.
[0081] The column sequence attention feature encoder used in this invention consists of two modules: a column sequence attention module and a column feature soft aggregation module. The column sequence attention module is composed of an encoder and a decoder. The encoder completes the task of unifying the sequence size, and the decoder indirectly obtains the self-attention features between points inside the column. The column feature soft aggregation module makes full use of the attention feature information between points to aggregate the features of points inside the column into column features in a more refined way. It is mainly used to solve the problems of coarse encoding method and limited encoding information in the column feature encoding stage.
[0082] The dilated convolutional network used in this invention mainly consists of two types of convolutional modules: sparse dilated convolution and dense dilated convolution. The sparse dilated convolution and dense dilated convolution adopt a structure similar to Inception, using multiple parallel 2D convolutions with different dilation rates to aggregate large-scale receptive field feature information. The overall network structure is based on ResNet-18 but improved. Specifically, the first four basic convolutional blocks in ResNet-18 are replaced with sparse dilated convolutions, and the last basic convolutional block is replaced with a dense dilated convolution block. This is mainly used to solve the problem of insufficient receptive field in the feature extraction stage of the backbone network.
[0083] The attention weights obtained by the column column sequence attention feature encoder used in this invention during the cross-attention calculation process are obtained by using the scatter_softmax scattering operation. This function, as a core CUDA library function, can efficiently and in parallel implement scattering-related operations. Through the scattering softmax operation, the attention weights can be restricted to each column. Attached Figure Description
[0084] Figure 1 This is a flowchart of the 3D target detection system of the present invention;
[0085] Figure 2 This is the overall structure diagram of the network model of this invention;
[0086] Figure 3 This is a flowchart illustrating the structure of the column sequence attention module of this invention;
[0087] Figure 4 This is a flowchart illustrating the structure of the column feature soft aggregation module of the present invention;
[0088] Figure 5 This is a schematic diagram of the receptive field of the hole-expanded convolution of the present invention;
[0089] Figure 6 This is a structural diagram of the hollow expansion convolution block of the present invention. Detailed Implementation
[0090] The present invention will be further described in detail below with reference to embodiments:
[0091] like Figure 1 and Figure 2 As shown, a 3D object detection method based on attention and dilated convolution includes the following steps:
[0092] Step S1: Data preprocessing: Crop the point cloud scenes of all folders in the training dataset to a fixed size. Set the detection ranges in the X, Y and Z axes of the point cloud scene to [0, 70.4]m, [-40, 40]m and [-3, 1]m respectively. Then generate the corresponding bin file for the cropped dataset.
[0093] Step S2: Build and train the network model: First, prepare the point cloud data for training, including point cloud scene data and target label data; then build the network model and set the hyperparameters of each module; then feed the training data into the network model for training. During the entire training process, the model is optimized by reducing the network's loss function to obtain the best-performing network model weights; finally, save the network model weights.
[0094] Step S3: Model Testing: Verify the model's effectiveness using test set data. Test the target detection performance by loading the weights of the best-performing network model from Step S2 to achieve accurate and real-time 3D target detection.
[0095] Specifically, the operation steps of step S1 are as follows:
[0096] Step S101: Crop the point cloud scene data in the KITTI dataset required for training;
[0097] Step S102: Set the detection ranges in the X, Y and Z axis directions to [0, 70.4]m, [-40, 40]m and [-3, 1]m respectively, set the cylinder size to (0.05m, 0.05m), and set the maximum number of sampling points inside each cylinder to 5, so as to obtain a BEV feature map with a resolution of [1600, 1480].
[0098] The steps for step S2 are as follows:
[0099] Step S201: Construct the network model. The overall network structure is as follows: Figure 2 As shown, the network model consists of four parts: a column sequence attention feature encoder for column feature encoding, a dilated convolutional network for feature extraction in a large receptive field region, a neck network for feature fusion, and a classification and regression sub-network for target classification and regression; as shown. Figure 3 and Figure 4 As shown, the column sequence attention feature encoder includes a column sequence attention module and a column feature soft aggregation module; the dilated convolution receptive field is as follows: Figure 5 As shown, Figure 6 As shown, the dilated convolutional network includes sparse dilated convolutional modules and dense dilated convolutional modules;
[0100] Step S202: The column sequence attention module achieves self-attention through a two-level cross-attention approach, using the attention information between points inside the column to enhance the features of the original point cloud data;
[0101] Specifically, the point cloud data first passes through the column sequence attention module. The cropped training set point cloud data is then fed into the column sequence attention module of the column sequence attention feature encoder. This column sequence attention module mainly addresses the problem of limited encoding information during column feature encoding. Previously, the column encoder had the problem of limited encoding information, which mainly manifested in that the encoder only encoded the original inherent information of the point cloud without mining and expanding the point cloud information. Considering that the self-attention mechanism can obtain the attention information between samples, the self-attention mechanism is used to mine the attention feature information between the original point clouds in the local area to enrich and enhance the original point cloud features.
[0102] The column sequence attention module contains two levels of cross attention, where the first level of cross attention is the encoder and the second level of cross attention is the decoder. The original point cloud feature data obtains cross attention features through the first level of cross attention in the encoder, completing the task of unifying the column sequence length. The cross attention features indirectly obtain the self-attention feature information between points through the second level of cross attention in the decoder. The column sequence attention module can enhance the point features without changing the feature dimension.
[0103] In the encoder, the input point cloud feature information Fin is first obtained. Fin consists of point cloud feature information and point cloud position information relative to the grid. The former is obtained by initializing and encoding the original point cloud feature information F = (x, y, z, r) through MLP, and the latter is obtained by position encoding the offset information R = (x', y', z') of each point corresponding to the top left corner of its grid through PE. The position encoding operation PE used here is similar to the position encoding operation used in DeTR, and the operation process is as follows:
[0104] F in =MLP(F)+PE(R)
[0105] R = (P i -(C i ×P size )) / P size
[0106] In the formula, P i C represents the global offset distance of the i-th point relative to the detection scene. i P represents the grid index of the i-th point. size Indicates the size of the column grid;
[0107] Next, point cloud features F inThe K and V vectors are obtained through MLP. The learnable fixed-length virtual point set features Fv are used as the Q vector for cross-attention modeling to obtain the attention feature information between the virtual point set features and the points inside the cylinder. Here, the virtual point set features are composed of a single linear layer and can be regarded as virtual point cloud points with several fixed-dimensional feature dimensions. The cross-attention uses the scatter_softmax function to calculate the weight of each point. This function is a core library function of CUDA and can efficiently and in parallel implement scattering-related operations.
[0108] Using the column index of each point as an index, the scattering softmax operation restricts the attention weights of each virtual point to the original point within each column, resulting in attention features with the same feature dimension, called cross-attention features (CAF). These CAF features with the same feature dimension are then used as the column sequence features, where the cross-attention feature dimension is the sequence length. This solves the problem of inconsistent attention modeling sequence lengths caused by using the original point cloud features as the column sequence features. The operation process is as follows:
[0109] Q = F v K = MLP(F in ),V=MLP(F in )
[0110] CAF = CrossAttention(Q,K,V)
[0111] CrossAttention = SoftMax scatter (QK T V
[0112] In the formula, Q, K, and V represent the query vector, key vector, and value vector used for attention modeling, respectively; F v F represents the characteristics of a set of virtual points. in Represents the original point cloud features; MLP represents a linear layer; CrossAttention represents the cross-attention operation; CAF represents the obtained cross-attention features; SoftMax represents the original point cloud features. scatter Indicates the scattering SoftMax operation;
[0113] The column sequence attention module decoder uses cross-attention features (CAF) as K and V vectors, and the original point cloud feature information as Q vector, to perform cross-attention modeling again, thereby indirectly obtaining the self-attention feature information (SF) between points within each column of the original point cloud. The operation process is as follows:
[0114] Q = MLP(F in K = CAF, V = CAF
[0115] SF = CrossAttention(Q,K,V)
[0116]
[0117] In the formula, Q, K, and V represent the query vector, key vector, and value vector used for attention modeling, respectively; MLP represents a linear layer; CAF represents cross-attention features with the same feature dimension obtained by the encoder; and F... in represents the original point cloud features; SF represents the self-attention features obtained by the decoder; CrossAttention represents the cross-attention operation; d represents the feature sequence length; SoftMax represents the ordinary SoftMax operation.
[0118] On the one hand, the decoder further enhances the cross-attention features to indirectly obtain the self-attention features between points within the column sequence; on the other hand, it restores the dimension of the self-attention features to the dimension of the original input point cloud features, and updates the original input point cloud features while keeping the original input dimension unchanged. This operation enables the column sequence attention module to be stacked multiple times without changing the dimension of the input data features, just like the Transformer, thereby continuously enhancing the data representation capability.
[0119] Step S203: The point cloud features output by the column sequence attention module are refined by the column feature soft aggregation module. The weight information of each point inside the column is obtained by using scattering SoftMax, and the point cloud features are weighted accordingly to finally obtain more representative column feature information.
[0120] Specifically, the point feature information with strong representational capabilities obtained in step S202 is sent to the column feature soft aggregation module of the column sequence attention feature encoder. The column feature soft aggregation module mainly solves the problem of coarse encoding method in the column feature encoding stage. The column feature soft aggregation module makes full use of the attention feature information between points to aggregate the point features inside the column into column features in a more refined way to complete the column feature aggregation task in a more refined way.
[0121] The point cloud features of the column sequence attention module are used as input to the column feature soft aggregation module. The weight information of each point inside the corresponding column is obtained by using SoftMax scattering. Then, the point cloud features of the column sequence attention module are multiplied with the weight information of each point to complete the weighting operation. The operation process is as follows:
[0122] W P =SoftMax scatter (F PSA )
[0123] F PFSA =FPSA *W P
[0124] PF PFSA =Add scatter (F PFSA )
[0125] In the formula, F PSA W represents the point cloud features of the attention module for the column sequence. p SoftMax represents the weight information of each point within its respective cylinder. scatter F represents the scattering SoftMax operation. PFSA This represents the weighted feature information of each point cloud. (Add) scatter PF represents the scattering addition operation. PFSA This represents the final columnar feature;
[0126] The point cloud features of the column sequence attention module represent the degree of attention between points inside the column, with the foreground points having higher feature strength. This weighting method reduces the contribution of background points to the column features and increases the contribution of foreground points to the column features. Finally, the column feature soft aggregation module aggregates the information of all points inside the column in a scattering and addition manner to obtain the column feature information.
[0127] The column feature soft aggregation module performs refined processing on the inter-point attention information provided by the column sequence attention module. Through a filtering mechanism, it further refines and integrates the point cloud feature information, making full use of the expressive point cloud feature information provided by the column sequence attention module to obtain column feature information with strong representation capabilities, effectively solving the problem of coarse aggregation method.
[0128] Step S204: Use multiple SDEC layers composed of sparsely dilated convolutional modules to process sparse 2D feature data and extract features from sparse feature maps within a large receptive field.
[0129] Specifically, the column feature information obtained in step S203 is fed into the sparse dilated convolution module of the dilated convolutional network to extract sparse features in a large receptive field area. The sparse dilated convolution mainly solves the problem of insufficient receptive field in the backbone network during the feature extraction of sparse feature maps.
[0130] Previously, in the fields of semantic segmentation and object detection, such as Deeplabv3 and YOLOF models, dilated convolution was widely used to effectively expand the receptive field of the model and capture a wider range of contextual information. This type of convolution expands the receptive field without adding extra parameters by introducing a stride into the standard convolution kernel, while maintaining the same computational complexity as standard convolution. However, as the stride increases, the convolution module will ignore the features in the central region of the convolution kernel. Previous studies have shown that in order to effectively improve the detection performance of the model, the size of the receptive field needs to match the size of the target. Therefore, the convolution module should not only have a sufficient receptive field, but also be able to preserve the convolution features in the central region.
[0131] Since convolutions with a dilation rate of 1 can process local information in the center, while convolutions with a dilation rate greater than 1 can increase the receptive field, an Inception-like structure is used to combine the convolution results of kernels with different dilation rates. This preserves the original model's receptive field to cover smaller targets while fully utilizing the expanded receptive field of convolutions with larger dilation rates. This structure is called dilated convolution, and the receptive field of dilated convolution is as follows: Figure 5 As shown, the receptive field of the dilated convolution and the actual kernel size of the dilated convolution block are as follows:
[0132] RF l+1 =RF l +(K-1)×S i
[0133]
[0134] In the formula, RF l+1 Represents the receptive field of the current layer, RF l S represents the receptive field of the previous layer, K represents the actual kernel size, n represents the number of convolutional blocks with different dilation rates, and S represents the number of convolutional blocks combined. i This represents the product of the step sizes of all previous layers;
[0135] The actual kernel size increases with the increase of the dilation rate d of the dilated convolution. The dilated convolution combines n convolution blocks with different dilation rates to further increase the actual kernel size.
[0136] SDEC processes sparse 2D feature data and consists of four sparse dilated convolutional layers. Each layer includes one sparse convolutional block and two sparse dilated convolutional blocks. The sparse convolutional block consists of a regular 2D sparse convolution with a stride of 2, batch normalization, and a ReLU activation function. It is mainly used for downsampling sparse 2D feature maps. The sparse dilated convolutional block consists of two layers of multiple parallel sub-manifold sparse convolutions.
[0137] The input sparse feature map is first subjected to sparse convolution operations in different receptive field ranges. Then, the multi-size receptive field convolution feature map is combined with the original input feature map to obtain the dilated feature map. This operation is dilated convolution processing. Next, the dilated feature map is subjected to dilated convolution processing again to further extract sparse features.
[0138] Regarding the width of the convolutional block, dilated convolutions with different dilation rates are used to form an expanded structure, aggregating local and large-scale receptive field feature information. Regarding the depth of the convolutional block, multiple multi-size sparse feature extractions are performed to further learn the features within the large-scale receptive field.
[0139] Step S205: After the sparse feature map is densified, the DDEC layer, which is composed of dense dilated convolutional blocks, is used to extract features from the dense feature map in a large receptive field, and finally a level 2 dense feature map is obtained.
[0140] Specifically, the sparse feature map obtained in step S204 is made denser and then fed into the dense dilated convolution of the dilated convolutional network to extract features from the dense feature map within a large receptive field.
[0141] The DDEC structure is similar to SDEC. Since it processes dense 2D feature data, the DDEC block is composed of traditional 2D convolutions. In DDEC, a 2D convolution block with a stride of 2 is first used to downsample the ordinary 2D feature map. Then, two dense dilated convolution blocks are used to extract local and large-scale receptive features from the ordinary 2D feature map. The dense dilated convolution block adopts the same structure as the sparse dilated convolution block, replacing the ordinary convolution with sub-stream sparse convolution, and performing 2D convolution operations in different receptive field ranges.
[0142] Step S206: Input the level 2 dense feature map into the neck network for feature fusion. Adjust the number of channels and size of the feature map through transposed convolution and ordinary convolution operations. Finally, stack the processed feature maps to obtain a feature map with strong representation capabilities for use by the detection head. The detection head obtains the loss of category, position and orientation between the predicted result and the real result, thereby optimizing the model parameters.
[0143] Specifically, the level 2 dense feature map generated in step S205 is fed into the neck network for feature fusion to obtain dense feature maps of the backbone network with downsampling strides of 8 and 16, and their tensor dimensions are [batch_size,256,w / 8,h / 8] and [batch_size,256,w / 16,h / 16], respectively.
[0144] For a feature map with a downsampling stride of 8, a transposed convolution is used to reduce its channel count to 128 dimensions without changing the feature map size. After processing, the feature map tensor dimension is [batch_size, 128, w / 8, h / 8]. For a feature map with a downsampling stride of 16, a regular convolution is first used for feature update without changing the number of channels or the feature size. Then, a transposed convolution is used to halve the feature channel dimension and double the feature map size. After processing, the feature map tensor dimension is also [batch_size, 128, w / 8, h / 8].
[0145] Finally, the two processed feature maps are stacked in the channel dimension to obtain a feature map with tensor dimension [batch_size,256,w / 8,h / 8].
[0146] The neck network aggregates deep, overall abstract feature information generated by the backbone network as well as shallow, local, and fine information, providing the detection head with more powerful feature information.
[0147] Step S207: Input the processed point cloud dataset and train the network model;
[0148] Specifically, the detection head and loss function employ a classic anchor-box-based detection head; the loss function includes a classification loss L composed of Focal loss. cls The regression loss L is composed of smoothed L1 loss. reg and the directional loss L, which is composed of cross-entropy loss. dir ;
[0149] Among them, the regression loss L reg Calculate the deviation between the predicted target center point, the predicted bounding box dimensions, and the predicted offset angle and the true label, i.e., [x a ,y a ,z a ,l a ,w a ,h a ,θ a ] and [x g ,y g ,z g ,l g ,w g ,h g ,θ g The regression value between [ ] is defined by the following formula:
[0150]
[0151]
[0152]
[0153]
[0154]
[0155]
[0156] Δθ=sin(θ g -θ a )
[0157] The regression loss formula is defined as follows:
[0158]
[0159] The classification loss formula is defined as follows:
[0160] L cls =-α t (1-P t ) γ log(P t )
[0161] In the formula, P t This represents the confidence level obtained by the model for the predicted box. α and γ represent hyperparameters, set to α = 0.25 and γ = 2, respectively.
[0162] The overall training loss function is shown below:
[0163] L = β cls L cls +β reg L reg +β dir L dir
[0164] Set the percentage (β) of each of the three losses in the total loss. cls =1, β reg =2 and β dir =0.2.
[0165] In summary, the cylinder-based 3D object detector proposed in this application can efficiently and quickly complete scene perception tasks, meeting the basic requirements of real-time detection in autonomous driving perception modules. Due to the inherent characteristics of the cylinder model, its detection accuracy is not as high as that of voxel-based and point-based 3D object detection models. Therefore, a cylinder sequence attention encoder and a dilated convolutional network are proposed for the cylinder model. The former solves the problems of rigid encoding methods and limited encoding information in the cylinder encoding stage, while the latter solves the problem of insufficient receptive field of the backbone network. Specifically, the cylinder sequence attention encoder uses the cylinder sequence attention module (PSA) to obtain the attention information between points in the local region of the cylinder, and uses the cylinder feature soft aggregation module (PFSA) to refine and aggregate the point information inside the cylinder; the dilated convolutional network uses dilated convolution to obtain sparse and dense feature information in a large receptive field.
[0166] Experiments on the KITTI dataset validated the model's performance and the effectiveness of its modules. The average accuracy in the car category was 81.48%, which is 3.12% higher than the baseline model, while the inference time only increased by about 10ms.
[0167] Specifically, the dataset provided by the KITTI official website was used for training and testing. The training effect of the method was tested according to the evaluation tool of the KITTI dataset. As shown in Table 1, the results of different algorithms on the KITTI test dataset are compared. As can be seen from the comparison results in Table 1, the 3D object detection algorithm PAEN proposed in this application has achieved the best performance in the BEV index for vehicle categories in the simple, medium and hard levels of detection. The improvement in model performance also confirms the effectiveness of the components proposed in this application.
[0168] Table 1. PAEN's Target Detection Results for Vehicle Category BEV Indicators in the KITTI Test Set
[0169]
[0170] Table 2 shows the comparison results of the inference time between the model of this application and previous real-time detection models. Experiments were conducted on an NVIDIA RTX4090. Compared to the PointPillar and Pillarnet cylinder models, the model achieved better accuracy, while the inference speed only increased by about 10ms. Compared to the SECOND voxel model, the model of this application achieved better accuracy and faster inference speed.
[0171] Table 2 shows the comparison results of PAEN's inference speed on the KITTI test set.
[0172]
[0173] In the field of 3D object detection, previous processing-based models only employed simple aggregation strategies in the regularized feature encoding module. For example, they used the mean value of features within a grid as the grid feature, or a method similar to max pooling used the maximum feature value of each dimension of features within a grid as the feature of each dimension of the grid. These methods are called hard feature aggregation. While they can accomplish the task of regularized feature encoding, the obtained features are relatively coarse because they suffer from two problems: rigid aggregation and limited aggregation information. The rigidity of the aggregation method is reflected in the fact that the aggregated information is entirely derived linearly from the original point information, without any mining or learning, such as using gate units to filter information or using nonlinear operations to learn features. The limitation of the aggregated information is reflected in the fact that the aggregated information is inherent to the original points, without expansion or extension, such as information on the number of points, point density, and attention information between points. Furthermore, in the subsequent feature extraction module, the model directly processes the encoded grid features, while the implicit information of the original points is permanently ignored in the subsequent feature extraction process. This invention proposes a column sequence attention feature encoder. On the one hand, it obtains the attention feature information between points inside the column through the column sequence attention module of the column sequence attention feature encoder, which solves the problem of limited encoding information in traditional column encoders. On the other hand, it uses a column feature soft aggregation module to make full use of the attention feature information between points for column feature encoding in a more refined form, which solves the rigidity problem of the encoding method of traditional column encoders.
[0174] To address the issue of insufficient receptive field in the feature extraction stage of the backbone network, previous studies have shown that the size of the receptive field needs to match the target size to effectively improve the model's detection performance. Convolutions with a dilation rate of 1 can process central local information, while convolutions with a dilation rate greater than 1 can increase the receptive field. Therefore, our proposed method uses an Inception-like structure to combine the convolution results of kernels with different dilation rates. This preserves the original model's receptive field to cover smaller targets while fully utilizing the large dilation rate to expand the receptive field. We call this structure dilated convolution, which consists of four dense sparse dilated convolutional layers and one dense dilated convolutional layer, extracting feature information within a large receptive field from the sparse and dense feature maps, respectively. The dilated convolutional network formed by stacking multiple layers of dilated convolutions effectively solves the problem of insufficient receptive field in the backbone network.
[0175] It is understood that the present invention has been described through some embodiments, and those skilled in the art will recognize that various changes or equivalent substitutions can be made to these features and embodiments without departing from the spirit and scope of the invention. Furthermore, under the teachings of the present invention, these features and embodiments can be modified to adapt to specific situations and materials without departing from the spirit and scope of the invention. Therefore, the present invention is not limited to the specific embodiments disclosed herein, and all embodiments falling within the scope of the claims of this application are within the protection scope of the present invention.
Claims
1. A 3D object detection method based on column sequence attention and dilated convolution, characterized in that: Includes the following steps: Step S1: Data preprocessing: Crop the point cloud scenes of all folders in the training dataset to a fixed size. Set the detection ranges in the X, Y and Z axes of the point cloud scene to [0, 70.4]m, [-40, 40]m and [-3, 1]m respectively. Then generate the corresponding bin file for the cropped dataset. Step S2: Construct the network model and train it. Step S201: Construct a network model, which includes a column sequence attention feature encoder for column feature encoding, a dilated convolutional network for feature extraction in a large receptive field region, a neck network for feature fusion, and a classification and regression sub-network for target classification and regression; the column sequence attention feature encoder includes a column sequence attention module and a column feature soft aggregation module; the dilated convolutional network includes a sparse dilated convolutional module and a dense dilated convolutional module; Step S202: The column sequence attention module achieves self-attention through a two-level cross-attention approach, using the attention information between points inside the column to enhance the features of the original point cloud data; Step S203: The point cloud features output by the column sequence attention module are refined by the column feature soft aggregation module. The weight information of each point inside the column is obtained by using scattering SoftMax, and the point cloud features are weighted accordingly to finally obtain more representative column feature information. Step S204: Use multiple SDEC layers composed of sparsely dilated convolutional modules to process sparse 2D feature data and extract features from sparse feature maps within a large receptive field. Step S205: After the sparse feature map is densified, the DDEC layer, which is composed of dense dilated convolutional blocks, is used to extract features from the dense feature map in a large receptive field, and finally a level 2 dense feature map is obtained. Step S206: Input the level 2 dense feature map into the neck network for feature fusion. Adjust the number of channels and size of the feature map through transposed convolution and ordinary convolution operations. Finally, stack the processed feature maps to obtain a feature map with strong representation capabilities for use by the detection head. The detection head obtains the loss of category, position and orientation between the predicted result and the real result, thereby optimizing the model parameters. Step S207: Input the processed point cloud dataset and train the network model; Step S3: Model Testing: Verify the model's effectiveness using test set data. Test the target detection performance by loading the weights of the best-performing network model from Step S2 to achieve accurate and real-time 3D target detection.
2. The 3D target detection method based on column sequence attention and dilated convolution according to claim 1, characterized in that: In step S1, the data preprocessing includes the following steps: Step S101: Crop the point cloud scene data in the KITTI dataset required for training; Step S102: Set the detection ranges in the X, Y and Z axis directions to [0, 70.4]m, [-40, 40]m and [-3, 1]m respectively, set the cylinder size to (0.05m, 0.05m), and set the maximum number of sampling points inside each cylinder to 5, so as to obtain a BEV feature map with a resolution of [1600, 1480].
3. The 3D target detection method based on column sequence attention and dilated convolution according to claim 2, characterized in that: In step S202, the point cloud data first passes through the column sequence attention module. The cropped training set point cloud data is sent to the column sequence attention module of the column sequence attention feature encoder. The column sequence attention module contains two levels of cross attention, where the first level of cross attention is the encoder and the second level of cross attention is the decoder. In the encoder, the input point cloud feature information Fin is first obtained. Fin consists of point cloud feature information and point cloud position information relative to the grid. The point cloud feature information is obtained by initializing and encoding the original point cloud feature information F=(x,y,z,r) through MLP. The point cloud position information relative to the grid is obtained by the offset information of the upper left corner of the grid corresponding to each point. The location is obtained through PE encoding, and the operation process is as follows: In the formula, This represents the global offset distance of the i-th point relative to the detection scene. This represents the grid index of the i-th point. Indicates the size of the column grid; Point cloud features The K and V vectors are obtained through MLP, and the features of a learnable set of fixed-length virtual points are used. Cross-attention modeling is performed using Q vectors to obtain the attention feature information between virtual point set features and each point inside the cylinder. The virtual point set features are composed of a single linear layer and are regarded as virtual point cloud points with several fixed feature dimensions. Cross-attention uses the scatter_softmax function to calculate the weight of each point. This function, as a core library function of CUDA, can efficiently and in parallel implement scattering-related operations. Using the column index of each point as an index, a softmax operation is used to restrict the attention weights of each virtual point to the original point within each column, resulting in attention feature information with the same feature dimension, called Cross-Attention Feature (CAF). The operation process is as follows: In the formula, Q, K, and V represent the query vector, key vector, and value vector used for attention modeling, respectively; F represents the characteristics of a set of virtual points. in Represents the original point cloud features; MLP represents a linear layer; CrossAttention represents a cross-attention operation; CAF represents the obtained cross-attention features; SoftMax scatter Indicates the scattering SoftMax operation; The column sequence attention module decoder uses the cross-attention feature (CAF) as the K and V vectors, and the original point cloud feature information as the Q vector, to perform cross-attention modeling again, thereby indirectly obtaining the self-attention feature information (SF) between points within each column of the original point cloud. The operation process is as follows: In the formula, SF represents the self-attention feature obtained by the decoder, d represents the feature sequence length, and SoftMax represents the ordinary SoftMax operation.
4. The 3D target detection method based on column sequence attention and dilated convolution according to claim 3, characterized in that: In step S203, the point feature information with strong representational capabilities obtained in step S202 is sent to the cylinder feature soft aggregation module of the cylinder sequence attention feature encoder. The point cloud features of the column sequence attention module are used as input to the column feature soft aggregation module. The weight information of each point inside the corresponding column is obtained by using SoftMax scattering. Then, the point cloud features of the column sequence attention module are multiplied with the weight information of each point to complete the weighting operation. The operation process is as follows: In the formula, This represents the point cloud features of the attention module in the column sequence. This indicates the weight information of each point within its respective cylinder. This indicates the scattering SoftMax operation. This represents the weighted feature information of each point cloud. This indicates a scattering addition operation. This represents the final columnar feature.
5. The 3D target detection method based on column sequence attention and dilated convolution according to claim 4, characterized in that: In step S204, the column feature information obtained in step S203 is fed into the sparse dilated convolution module of the dilated convolutional network to extract sparse features in a large receptive field area. The Inception structure is used to combine the convolution results of convolution kernels with different dilation rates. This structure is a dilated convolution, and the receptive field and actual kernel size of the dilated convolution block are shown below: In the formula, Indicates the receptive field of the current layer. K represents the receptive field of the previous layer, K represents the actual kernel size, and n represents the number of convolutional blocks with different dilation rates. This represents the product of the step sizes of all previous layers; The actual kernel size varies with the dilated convolution dilation rate. As the dilation rate increases, the actual convolution kernel size increases by combining n convolutional blocks with different dilation rates. SDEC processes sparse 2D feature data and consists of four sparse dilated convolutional layers. Each layer includes one sparse convolutional block and two sparse dilated convolutional blocks. The sparse convolutional block consists of a regular 2D sparse convolution with a stride of 2, batch normalization, and a ReLU activation function. It is mainly used for downsampling sparse 2D feature maps. The sparse dilated convolutional block consists of two layers of multiple parallel sub-manifold sparse convolutions. The input sparse feature map is first subjected to sparse convolution operations in different receptive field ranges. Then, the multi-size receptive field convolution feature map is combined with the original input feature map to obtain the dilated feature map. This operation is the dilated convolution processing. Next, the dilated feature map is subjected to dilated convolution processing again to extract sparse features.
6. The 3D target detection method based on column sequence attention and dilated convolution according to claim 5, characterized in that: In step S205, the sparse feature map obtained in step S204 is densified and then fed into the dense dilated convolution of the dilated convolutional network to extract features from the dense feature map within a large receptive field. DDEC, with the same structure as SDEC, is used to process dense 2D feature data. DDEC blocks are composed of traditional 2D convolutions. In DDEC, a 2D convolution block with a stride of 2 is first used to downsample the ordinary 2D feature map. Then, two dense dilated convolution blocks are used to extract local and large-scale receptive features from the ordinary 2D feature map. The dense dilated convolution block adopts the same structure as the sparse dilated convolution block, replacing the ordinary convolution with sub-stream sparse convolution, and performing 2D convolution operations in different receptive field ranges.
7. The 3D target detection method based on column sequence attention and dilated convolution according to claim 6, characterized in that: In step S206, the level 2 dense feature map generated in step S205 is fed into the neck network for feature fusion to obtain dense feature maps of the backbone network with downsampling steps of 8 and 16, and their tensor dimensions are [batch_size,256,w / 8,h / 8] and [batch_size,256,w / 16,h / 16], respectively. For a feature map with a downsampling stride of 8, a transposed convolution is used to reduce its channel count to 128 dimensions without changing the feature map size. After processing, the feature map tensor dimension is [batch_size, 128, w / 8, h / 8]. For a feature map with a downsampling stride of 16, a regular convolution is first used for feature update without changing the number of channels or the feature size. Then, a transposed convolution is used to halve the feature channel dimension and double the feature map size. After processing, the feature map tensor dimension is also [batch_size, 128, w / 8, h / 8]. Finally, the two processed feature maps are stacked along the channel dimension to obtain a feature map with tensor dimensions of [batch_size, 256, w / 8, h / 8].
8. The 3D target detection method based on column sequence attention and dilated convolution according to claim 7, characterized in that: In step S207, the model training content is as follows: The detection head and loss function employ a classic anchor-box-based detection head; the loss function includes a classification loss consisting of Focal loss. Regression loss consisting of smoothed L1 loss and the directional loss consisting of cross-entropy loss. ; Among them, regression loss Calculate the deviation between the predicted target center point, the predicted bounding box dimensions, and the predicted offset angle and the true label. and The regression value between them is defined by the following formula: The regression loss formula is defined as follows: The classification loss formula is defined as follows: In the formula, This indicates the confidence level obtained by the model for the predicted bounding box. and This represents hyperparameters, which are set separately. ; The overall training loss function is shown below: Set the percentage of each of the three losses in the total loss. as well as .