A far small target point cloud data feature enhancement method based on an attention mechanism
By enhancing the point cloud features of distant small targets using an attention-based mechanism, the problem of insufficient accuracy and speed in the detection of distant small targets in autonomous driving scenarios is solved, and more efficient 3D target detection is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHEAST GASOLINEEUM UNIV
- Filing Date
- 2023-07-14
- Publication Date
- 2026-06-30
AI Technical Summary
Existing 3D target detection algorithms lack the accuracy and speed for detecting distant and small targets in autonomous driving scenarios, especially when point cloud data is disordered, sparse, or occluded, making accurate identification difficult.
An attention-based approach is adopted, which enhances the point cloud features of distant small targets by using techniques such as voxelization preprocessing, sparse convolutional networks, and dual-channel attention modules. It also integrates point cloud, voxel, and bird's-eye view features to improve detection accuracy and speed.
It effectively improves the accuracy and speed of distant small target detection, reduces computational resource consumption, and enhances the overall performance of 3D target detection.
Smart Images

Figure SMS_104 
Figure SMS_108 
Figure SMS_109
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of three-dimensional target detection, and more specifically, to a method for enhancing the features of point cloud data of distant small targets. Background Technology
[0002] In recent years, 3D object detection algorithms have achieved significant progress in the field of computer vision, and have been widely applied in various fields such as autonomous driving, augmented reality, and intelligent robotics. In autonomous driving, systems need to perceive their surroundings; 3D object detection algorithms can accurately detect pedestrians and vehicles on the road, facilitating informed decision-making and preventing traffic accidents. In augmented reality, 3D object detection algorithms acquire the positional information of objects in the surrounding environment and place virtual objects based on their contextual relationships, thereby enhancing the user's visual experience. Therefore, 3D object detection algorithms possess significant practical value and broad application prospects.
[0003] As a fundamental function of autonomous driving systems, 3D object detection is a crucial component of environmental perception, primarily detecting vehicles, pedestrians, and other traffic participants in a scene. Currently, 3D object detection algorithms for autonomous driving scenarios can be broadly categorized into two types based on input modality: depth image-based and point cloud-based methods. Depth image-based methods rely on depth cameras to acquire 3D information about the target by combining the distance between the object and the camera with RGB images. However, depth cameras often produce inaccurate depth data at long distances or in outdoor conditions, failing to meet our detection requirements. Point cloud data, obtained from LiDAR reflections, contains information such as object position and reflection intensity, providing precise spatial location information. It is also more stable and unaffected by weather or lighting conditions. Therefore, point cloud data is more suitable for 3D object detection. This function provides spatial obstacle distribution information for subsequent planning and control of autonomous vehicles, directly impacting driving safety; errors could have very serious consequences.
[0004] Currently, both domestic and international researchers have conducted some research on point cloud-based 3D target detection algorithms. Point cloud-based methods primarily rely on lidar to capture point cloud data for localization and identification. LiDAR uses laser beams to detect targets and acquire point clouds containing rich information such as 3D coordinates and reflection intensity. This effectively obtains depth information of the detected target and can also effectively address issues such as changes in lighting and adverse weather conditions. However, the acquired point cloud data often exhibits characteristics such as disorder, sparseness, and rotation invariance, which can pose challenges to detection. Furthermore, due to the long distance at which lidar data is acquired, the collected point cloud data is often incomplete, frequently exhibiting common problems such as object occlusion, small targets at distant distances, and low data collection volume. These factors all contribute to a decrease in the accuracy and speed of 3D target detection. Existing 3D target detection algorithms are relatively mature for detecting near-range 3D targets. To further improve the accuracy of 3D target detection, greater emphasis must be placed on addressing the difficulty of detecting small, distant targets in autonomous driving scenarios.
[0005] See [1] Tian Feng, Jiang Wenwen, Liu Fang, et al. 3D target detection method based on hybrid voxels and original point clouds [J]. Journal of Chongqing University of Technology (Natural Science), 2022, 36(11): 108-117. and [2] Zhao Shixiang. Research on 3D target detection algorithm based on attention mechanism [D]. Xi'an University of Electronic Science and Technology 022. DOI: 10.27389 / d.cnki.gxadu.2022.002951. 3D target detection technology mainly includes traditional methods based on manually designed features and learning methods based on deep learning. Methods based on manually designed features are often only applicable to specific scenes, and the detection accuracy is low in scenes with complex terrain. Some algorithms based on deep learning still have shortcomings. The method of using only convolutional neural networks is prone to problems such as missed detection and false detection of distant and small targets during the detection process. Since these algorithms only consider the local features of the point cloud, they cannot obtain complete target features, resulting in poor detection effect. Furthermore, since targets that are far from the point cloud acquisition device often cannot obtain sufficient point cloud features, and algorithms that only consider the target features do not learn the neighborhood information of such targets, the detection effect of such targets is poor.
[0006] In summary, accurate and rapid 3D target detection is essential for autonomous driving. However, the inherent disorder and sparsity of point cloud data present challenges. Furthermore, the continuous movement of radar data acquisition equipment in autonomous driving scenarios inevitably leads to issues such as object occlusion, small distant targets, and low-quality data acquisition, all of which negatively impact the accuracy and speed of 3D target detection. Therefore, a method is needed to enhance the features of point cloud data for small distant targets, thereby improving the accuracy and speed of 3D target detection in autonomous driving scenarios. Summary of the Invention
[0007] This disclosure proposes a method, electronic device, and storage medium for feature enhancement of distant small target point cloud data based on an attention mechanism, which can solve the prior art problems pointed out in the background art.
[0008] Basic Solution 1:
[0009] A method for feature enhancement of point cloud data of distant small targets based on an attention mechanism, the method comprising:
[0010] Raw point cloud data is collected using the LiDAR of autonomous vehicles;
[0011] The collected raw point cloud data is preprocessed into voxels to obtain preprocessed voxels;
[0012] The preprocessed voxels are extracted by the voxel feature extractor and then input into the sparse convolutional network to obtain multi-scale semantic voxel features.
[0013] The multi-scale semantic voxel features after sparse convolution are converted into a feature bird's-eye view map, which is then input into the region candidate network to generate the initial target classification and candidate regions.
[0014] The raw point cloud data collected by the lidar is divided into equally proportioned far and small target regions, and then a parallel random farthest point sampling algorithm is used to obtain the point cloud set of the far and small target regions.
[0015] Linear projection and topological feature extraction are performed on the point cloud set of the far-small target region to obtain a local feature sequence containing the neighborhood geometric information of each key point in the far-small target point cloud region. Then, the local feature sequence containing the neighborhood spatial information of the far-small target point cloud set is input into the dual-channel attention module to obtain the overall spatial structure information.
[0016] The extracted local feature sequences are recovered by global pooling. The recovered point cloud sequences are then input into a dual-channel attention module, and attention cross-computation is used to perform feature enhancement operations on distant and small target point clouds to obtain enhanced distant and small target point cloud features.
[0017] Multi-scale semantic voxel features and enhanced distant small target point cloud features are fused to obtain the final fused features, which are used to refine the initial target classification and candidate region anchor boxes to obtain detection results.
[0018] The process of performing voxelization preprocessing on the collected raw point cloud data to obtain preprocessed voxels follows the specific path below:
[0019] Based on the actual distribution of the original point cloud data, the scene space is divided into three-dimensional voxels;
[0020] For point cloud data, the vehicle's forward direction is taken as the X-axis, the left and right directions as the Y-axis, and the direction perpendicular to the XY plane as the Z-axis. Let the range of the target scene on the three axes be L, in meters. Calculate the difference between the maximum and minimum values of the point cloud data coordinates in the X, Y, and Z directions respectively. Then, determine the length, width, and height of the initial voxel based on the three differences. After the calculation is completed, the initial voxel of the target scene is obtained.
[0021] The process involves extracting preprocessed voxels using a voxel feature extractor and then inputting them into a sparse convolutional network to obtain multi-scale semantic voxel features. The specific path is as follows:
[0022] First, the voxel feature extractor is used to directly calculate the average value of the point-by-point features within the voxel. Then, the element-wise max pooling operation is used to obtain the local clustered features of each voxel. The obtained features are expanded and then the expanded features are connected with the point-by-point features. The obtained voxel features are input into a three-dimensional sparse convolutional network to obtain voxel features. The specific three-dimensional sparse convolution operation is shown in the following formula (1):
[0023] Formula (1)
[0024] in , This represents the output after a 3D sparse convolution operation, where j represents the output index and m represents the output channel. 'l' represents the filter element, and 'l' represents the input channel. A matrix representing a set of sparse data. Let represent the regular matrix, and k represent the kernel offset.
[0025] The process of converting sparsely convolved multi-scale semantic voxel features into a feature bird's-eye view map, inputting it into a region candidate network, and generating initial target classifications and candidate regions follows the specific path below:
[0026] The multi-scale semantic voxel feature data after sparse convolution is downsampled on the Z-axis, thereby converting the sparse data into a dense feature map, that is, the three-dimensional data is reshaped into an image similar to two-dimensional data; the RPN detection framework is used to generate the initial target classification and candidate region anchor boxes, with one three-dimensional anchor box for each class, using the average three-dimensional size of the target in that class, as shown in the specific regression target calculation formula (2):
[0027] Formula (2)
[0028] Where x, y, z are the coordinates of the center point; w, l, h are the length, width, and height of the anchor frame; t represents the encoded value, g represents the truth value, and a represents the anchor frame.
[0029] The process involves dividing the raw point cloud data collected by the lidar into proportionally proportioned far-small target regions, and then applying a parallel random farthest point sampling algorithm to obtain the point cloud set of the far-small target regions. The specific path is as follows:
[0030] For the input raw point cloud set , Select This point serves as the key to the next step: randomly select one point. Use it as the starting point and write it into the key point set. In the middle; then use the rest Calculate the sum of each point. Distance between points, select the farthest point. Write key point set Middle; then select the remaining Individual point calculation and key point set The distance to each point in the set is used as the shortest distance to the keypoints. distance, Select the furthest point and write it into the keypoint set. At this point, only A point, if Then the selection is complete; if Repeat the above steps until the desired result is selected. A starting point. Thus, from the point cloud... Medium sampling Key points Based on the point cloud distribution of autonomous driving scenarios, Set to 2048, and use keypoints to represent the entire 3D scene, as shown in formula (3):
[0031] Formula (3)
[0032] Where h represents the multilayer perceptual feature extraction layer, max( ) represents the symmetric max pooling operation, and γ represents the feature extraction of higher layers.
[0033] The linear projection and topological feature extraction operations on the point cloud set of the far-small target region are performed to obtain a local feature sequence containing the neighborhood geometric information of each key point in the far-small target point cloud region. Then, the local feature sequence containing the neighborhood spatial information of the far-small target point cloud set is input into a dual-channel attention module to obtain the overall spatial structure information. The specific path is as follows:
[0034] The KNN (K Nearest Neighbors) algorithm is used to extract the topological features of key points and their neighbors, resulting in the feature sequence of key points in the far-small target region. This is used to learn the structural information of each key point in the local neighborhood of the point cloud of the distant small target region; then the key point sequence Linear projection is a high-dimensional vector, embedding This yields a local feature sequence containing the neighborhood geometric information of each key point in the point cloud region of the distant and small targets. The local feature sequence containing the neighborhood spatial information of the point cloud set of far and small targets. Input the dual-channel attention module to calculate the local neighborhood feature correlation of each key point in the known area of the far-small target, thereby obtaining the overall spatial structure information, channel attention and spatial attention features, as shown in formulas (4) and (5):
[0035] Formula (4)
[0036] Formula (5)
[0037] in, , These represent channel attention computation and spatial attention computation, respectively. , These represent the feature vectors calculated by channel attention and the feature vectors calculated by spatial attention, respectively. From this, we can obtain the output features of the overall spatial structure correlation of distant and small targets, which includes the refined structural features of the point cloud region of distant and small targets and the structural correlation information of the missing point cloud set.
[0038] The extracted local feature sequences are then used to reconstruct the point cloud sequence through global pooling. The reconstructed point cloud sequence is then input into a dual-channel attention module, where attention cross-calculation is used to enhance the point cloud features of distant and small targets, resulting in an enhanced point cloud feature sequence. The specific path is as follows:
[0039] First, the point cloud sequence is recovered through global pooling. Then, dual-channel attention is calculated on the input features to obtain the neighborhood structure features of the key points of distant and small targets. Then, it is compared with the output features of the feature extraction part. Attention cross-computation is performed to obtain the structural correlation between the coarse neighborhood structural features of the far and small target key points and the refined structural features of the far and small target regions. The local structural details of the far and small target key points are then fused to obtain the refined global structural features of the far and small targets. The calculation is shown in formula (6):
[0040] Formula (6)
[0041] The attention cross-calculation is shown in formula (7):
[0042] Formula (7)
[0043] Where w is the projection matrix output by the dual-channel attention module, and H represents the number of attention subspaces, H=2.
[0044] The process involves fusing multi-scale semantic voxel features and enhanced distant small target point cloud features to obtain the final fused features, which are then used to refine the initial target classification and candidate region anchor boxes to obtain detection results. The specific path is as follows:
[0045] To fuse multi-scale semantic voxel features and enhanced distant and small target point cloud features, the process first involves inputting both the multi-scale semantic voxel feature map and the enhanced distant and small target point cloud features into a convolutional layer. These are then passed to a top-down path, where deconvolution upsamples the feature map and concatenates it with the feature map from the convolutional layer. Next, the feature maps from the top-down path are transformed to the same size and merged through stacking to obtain the fused features. Finally, two 1×1 convolutional layers generate the detection result, as shown in the figure. Figure 7 As shown.
[0046] This disclosure also has two other applications:
[0047] An electronic device includes at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to implement the method described in Scheme 1.
[0048] A computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the method described in Scheme 1.
[0049] The above-described at least one technical solution adopted in one or more embodiments of this specification can achieve the following beneficial effects:
[0050] First, the method disclosed herein is a feature enhancement approach for point cloud data of distant small targets. On one hand, it integrates features from point clouds, voxels, and bird's-eye views to effectively improve the refinement of suggestions. On the other hand, it utilizes a dual-channel attention mechanism to enhance the features of point cloud data of distant small targets, obtaining more effective features of distant small targets, thereby improving the accuracy of 3D target detection. Simultaneously, by using a random parallel farthest point sampling algorithm, it reduces the consumption of computational resources, thereby improving the speed of 3D target detection.
[0051] In summary, the method presented in this disclosure enhances the target features by employing a dual-channel attention mechanism to obtain the correlation features between distant and small target point clouds and the global attention features of the point clouds. Then, it connects the neighborhood structure features, refined structure features, and local structural details of key points of the distant and small target point clouds together to obtain the enhanced global structure features of distant and small targets, which can effectively improve the accuracy of the 3D target detection algorithm.
[0052] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure.
[0053] Other features and aspects of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description
[0054] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the specification, serve to illustrate the technical solutions of this disclosure.
[0055] Figure 1 This is a flowchart of feature enhancement for a distant small target point cloud according to a specific embodiment of this disclosure;
[0056] Figure 2 This is a diagram of a 3D target detection network architecture based on attention mechanism for enhancing the features of point cloud data of distant and small targets, according to a specific embodiment of this disclosure.
[0057] Figure 3 This is a diagram of the voxel feature extraction network structure according to a specific embodiment of this disclosure;
[0058] Figure 4 This is a point cloud feature extraction image according to a specific embodiment of this disclosure;
[0059] Figure 5 This is a channel attention network structure diagram according to a specific embodiment of this disclosure;
[0060] Figure 6 This is a spatial attention network structure diagram according to a specific embodiment of this disclosure;
[0061] Figure 7 This is a detection example diagram under a specific embodiment of this disclosure;
[0062] Figure 8 This is a block diagram of an electronic device 800 according to an exemplary embodiment;
[0063] Figure 9 This is a block diagram illustrating an electronic device 1900 according to an exemplary embodiment; Detailed Implementation
[0064] Various exemplary embodiments, features, and aspects of this disclosure will now be described in detail with reference to the accompanying drawings. The same reference numerals in the drawings denote elements that have the same or similar functions. Although various aspects of the embodiments are shown in the drawings, they are not necessarily drawn to scale unless specifically indicated otherwise.
[0065] Furthermore, to better illustrate this disclosure, numerous specific details are set forth in the following detailed description. Those skilled in the art should understand that this disclosure can be practiced without certain specific details. In some instances, methods, means, components, and circuits well known to those skilled in the art have not been described in detail in order to highlight the main points of this disclosure. A specific embodiment is given below, intended to further describe the technical solutions provided in this disclosure in conjunction with the accompanying drawings and embodiments. It should be noted that the following embodiments are intended to facilitate understanding of this disclosure and do not constitute any limitation thereof.
[0066] Figure 1 A flowchart illustrating a method for three-dimensional target detection according to embodiments of the present disclosure is shown, such as... Figure 1 As shown, the method for 3D target detection includes: Step S101: Collecting raw point cloud data using the LiDAR of an autonomous vehicle, and performing voxelization preprocessing on the collected raw point cloud data to obtain preprocessed voxels; Step S102: Extracting the preprocessed voxels by a voxel feature extractor, and inputting them into a sparse convolutional network to obtain multi-scale semantic voxel features; Step S103: Converting the multi-scale semantic voxel features after sparse convolution into a feature bird's-eye view, and inputting it into a region candidate network to generate an initial target classification and candidate regions; Step S104: Dividing the raw point cloud data collected by the LiDAR into equally proportioned far-small target regions, and then using a parallel random farthest point sampling algorithm to obtain a set of point clouds for the far-small target regions; Step S105: Processing the point cloud set of the far-small target regions... Step S106: The extracted local feature sequences are combined with linear projection and topological feature extraction to obtain local feature sequences containing neighborhood geometric information of key points in the far-small target point cloud region. These local feature sequences, containing neighborhood spatial information of the far-small target point cloud set, are then input into a dual-channel attention module to obtain overall spatial structure information. Step S107: The extracted local feature sequences are then used to recover the point cloud sequence through global pooling. The recovered point cloud sequence is then input into the dual-channel attention module, and attention cross-computation is used to enhance the far-small target point cloud features, resulting in enhanced far-small target point cloud features. Step S108: The multi-scale semantic voxel features and the enhanced far-small target point cloud features are fused to obtain the final fused features, which are used to refine the initial target classification and candidate region anchor boxes to obtain detection results. This method can enhance the feature information of far-small targets in autonomous driving scenarios, effectively improve the suggestion refinement effect, reduce the consumption of computing resources, shorten the detection time, and thus improve the overall speed and accuracy of 3D target detection.
[0067] Step S101: Collect raw point cloud data using the LiDAR of the autonomous vehicle, perform voxelization preprocessing on the collected raw point cloud data, and obtain preprocessed voxels.
[0068] In the embodiments of this disclosure and other possible embodiments, devices such as lidar, color camera, grayscale camera, GPS navigation system and optical lens are used to acquire video images to be processed. The video images to be processed include video frames at multiple times, corresponding point cloud data and related parameter files.
[0069] In the embodiments of this disclosure and other possible embodiments, the method for preprocessing the collected raw point cloud data by voxelization includes: based on the actual distribution of the raw point cloud data, for the point cloud data, with the vehicle's forward direction as the X-axis, the left and right directions as the Y-axis, and the direction perpendicular to the XY plane as the Z-axis, and assuming the range of the detected target scene on the three axes is L (in meters), firstly, based on the actual scene dataset and the actual situation of the target of interest, the point cloud on the Z×X×Y coordinate axes is cropped according to the range of [3,1]×[40,40]×[0,70.4]m. When dividing into voxels, voxelization is performed according to H=0.4×L=0.2×W=0.2m;
[0070] Step S102: The preprocessed voxels are extracted by the voxel feature extractor and input into the sparse convolutional network to obtain multi-scale semantic voxel features;
[0071] In embodiments of this disclosure and other possible embodiments, the preprocessed voxels extracted by the voxel feature extractor are input into a sparse convolutional network to obtain multi-scale semantic voxel features, such as... Figure 3 As shown, the process includes: using a voxel feature extractor to directly calculate the average value of the point-by-point features within the voxel using the features of the non-empty voxel; then using element-wise max pooling to obtain the local clustered features of each voxel and expanding the obtained features; then concatenating the expanded features and the point-by-point features together, and inputting the obtained voxel features into a three-dimensional sparse convolutional network to obtain multi-scale semantic voxel features.
[0072] In embodiments of this disclosure and other possible embodiments, the voxel feature extractor directly calculates the average value of the point-by-point features within the voxel using the non-empty voxel features. The voxel feature extractor comprises two VFE (Voxel Feature Encoding) layers and one FCN (Fully Connected Network) layer. The VFE layers take the point cloud data within the same voxel as input and extract features. Then, the fully connected FCN layer, consisting of a linear layer, a BatchNorm layer, and a ReLU layer, extracts the point cloud features, directly calculating the average value of the point-by-point features within the voxel using the features of the non-empty voxels.
[0073] In the embodiments of this disclosure and other possible embodiments, the step of inputting the acquired voxel features into a three-dimensional sparse convolutional network to obtain multi-scale semantic voxel features includes: inputting the acquired voxel features into a three-dimensional sparse convolutional network to obtain voxel features, and the specific three-dimensional sparse convolution operation is shown in formula (8):
[0074] Formula (8)
[0075] in , This represents the output after a 3D sparse convolution operation, where j represents the output index and m represents the output channel. 'l' represents the filter element, and 'l' represents the input channel. A matrix representing a set of sparse data. Let represent the regular matrix, and k represent the kernel offset.
[0076] Step S103: Convert the multi-scale semantic voxel features after sparse convolution into a feature bird's-eye view map, input it into the region candidate network, and generate the initial target classification and candidate regions.
[0077] In embodiments of this disclosure and other possible embodiments, the step of converting sparsely convolved multi-scale semantic voxel features into a feature bird's-eye view includes: downsampling the feature data on the Z-axis to convert the sparse data into a dense feature map, i.e., the three-dimensional data is reshaped into an image similar to two-dimensional data.
[0078] In embodiments of this disclosure and other possible embodiments, the input region candidate network generates initial target classifications and candidate regions by: using the feature bird's-eye view image to generate initial target classifications and candidate region anchor boxes using the RPN detection framework. Each class has A three-dimensional anchor box is used, employing the average three-dimensional size of the target category. The specific regression target calculation is shown in formula (9):
[0079] Formula (9)
[0080] Where x, y, z are the coordinates of the center point; w, l, h are the length, width, and height of the anchor frame; t represents the encoded value, g represents the truth value, and a represents the anchor frame.
[0081] Step S104: Divide the raw point cloud data collected by the lidar into equally proportioned far and small target regions, and then use the parallel random farthest point sampling algorithm to obtain the point cloud set of the far and small target regions.
[0082] In the embodiments of this disclosure and other possible embodiments, the step of dividing the raw point cloud data collected by the lidar into proportionally proportioned far-small target regions, and then applying a parallel random farthest point sampling algorithm to obtain a point cloud set of far-small target regions, includes the input raw point cloud set. , Select This point serves as the key to the next step: randomly select one point. Use it as the starting point and write it into the key point set. In the middle; then use the rest Calculate the sum of each point. Distance between points, select the farthest point. Write key point set Middle; then select the remaining Individual point calculation and key point set The distance to each point in the set is used as the shortest distance to the keypoints. distance, Select the furthest point and write it into the keypoint set. At this point, only A point, if Then the selection is complete; if Repeat the above steps until the desired result is selected. A starting point. Thus, from the point cloud... Medium sampling Key points Based on the point cloud distribution of autonomous driving scenarios, Set to 2048, using keypoints to represent the entire 3D scene. The specific calculation is shown in formula (10):
[0083] Formula (10)
[0084] Where h represents the multilayer perceptual feature extraction layer, max( ) represents the symmetric max pooling operation, and γ represents the feature extraction of higher layers. The specific network structure for the point cloud feature extraction part is as follows: Figure 4 As shown:
[0085] Step S105: Perform linear projection and topological feature extraction operations on the point cloud set of the far-small target region to obtain a local feature sequence containing the neighborhood geometric information of each key point in the far-small target point cloud region. Then, input the local feature sequence containing the neighborhood spatial information of the far-small target point cloud set into the dual-channel attention module to obtain the overall spatial structure information.
[0086] In embodiments of this disclosure and other possible embodiments, the step of performing linear projection and topological feature extraction operations on the point cloud set of the far-small target region to obtain a local feature sequence containing the neighborhood geometric information of each key point in the far-small target point cloud region includes: using KNN to extract the topological features of key points and neighboring points to obtain the feature sequence of key points in the far-small target region. This is used to learn the structural information of each key point in the local spatial neighborhood of the point cloud of the distant small target region. Then, the key point sequence is... Linear projection is a high-dimensional vector, embedding This yields a local feature sequence containing the neighborhood geometric information of each key point in the point cloud region of the distant and small targets. The specific calculation is shown in formula (11):
[0087] Formula (11)
[0088] Among them, L( ) indicates linear projection calculation. This represents the weight matrix.
[0089] In embodiments of this disclosure and other possible embodiments, the step of inputting a local feature sequence containing neighborhood spatial information of a set of far-small target point clouds into a dual-channel attention module to obtain overall spatial structure information includes: Channel attention mainly seeks the "important" parts of the input feature vector. To improve the efficiency of channel attention computation, the spatial dimension of the input vector is compressed, and then max pooling and average pooling are used to aggregate the spatial information of the feature vector. The information is then input into a shared network layer composed of multilayer perceptrons, and then element-wise addition is used to output the merged feature vector. Spatial attention utilizes the internal spatial relationships of the feature vectors processed by channel attention to generate spatial attention. Spatial attention mainly focuses on the specific location of data information and is a supplement to channel attention. The feature vectors processed by channel attention are sequentially max pooled and average pooled to generate two three-dimensional feature vectors, which are then input into a convolutional layer to generate spatial attention features. The network structures of channel attention and spatial attention are as follows: Figure 5 and Figure 6 As shown:
[0090] Channel attention and spatial attention features are calculated as shown in formulas (12) and (13):
[0091] Formula (12)
[0092] Formula (13)
[0093] in, , These represent channel attention computation and spatial attention computation, respectively. , These represent the feature vectors calculated using channel attention and spatial attention, respectively. This yields the output features of the overall spatial structure correlation of distant and small targets, which include refined structural features of the point cloud region of the distant and small targets and information on the structural correlation of the missing point cloud set.
[0094] Step S106: The extracted local feature sequence is recovered into a point cloud sequence through global pooling. Then, the recovered point cloud sequence is input into a dual-channel attention module, and attention cross-computation is used to perform feature enhancement operation on the distant and small target point cloud to obtain the enhanced distant and small target point cloud features.
[0095] In the embodiments of this disclosure and other possible embodiments, the step of recovering the point cloud sequence from the extracted local feature sequence through global pooling includes: performing max pooling on the feature sequence extracted in the previous part to extract global information, and reshaping it into a sequence of far and small target key point centers containing the missing point cloud set. and feature sequences that contain correlations with global feature structures Then Embedded The input feature sequence, which constitutes the feature enhancement part of the small target, is then fed into the subsequent neural network for further processing. The main feature is reinforced.
[0096] In the embodiments of this disclosure and other possible embodiments, the step of inputting the recovered point cloud sequence into a dual-channel attention module and using attention cross-computation to perform a feature enhancement operation on the distant and small target point cloud to obtain enhanced distant and small target point cloud features includes: inputting features to perform dual-channel attention calculation, the calculation process being similar to formulas (12) and (13), to obtain coarse neighborhood structure features of distant and small target key points. Then, it is compared with the output features of the feature extraction part. Attention cross-computation is performed to obtain the structural correlation between the coarse neighborhood structural features of the far and small target key points and the refined structural features of the far and small target regions. The local structural details of the far and small target key points are then fused to obtain the refined global structural features of the far and small targets. The specific calculation is shown in formula (14):
[0097] Formula (14)
[0098] The attention cross-calculation is represented by the formula (15):
[0099] Formula (15)
[0100] Where w is the projection matrix output by the dual-channel attention module, and H represents the number of attention subspaces, H=2.
[0101] Step S107: Fuse the multi-scale semantic voxel features and the enhanced far-small target point cloud features to obtain the final fused features, which are used to refine the initial target classification and candidate region anchor boxes to obtain detection results.
[0102] In the embodiments of this disclosure and other possible embodiments, the fusion of multi-scale semantic voxel features and enhanced far-small target point cloud features first involves inputting the multi-scale semantic voxel feature map and the enhanced far-small target point cloud features into a convolutional layer. Then, these features are passed to a top-down path, where deconvolution upsamples the feature map and concatenates it with the feature map from the convolutional layer. Next, the feature maps from the top-down path are transformed to the same size and merged through stacking to obtain the fused features. Finally, two 1×1 convolutional layers generate the detection result, as shown in the specific detection result. Figure 7 As shown. In Figure 7 In (a) and (b), it can be seen that the targets are relatively sparse and far from the lidar point cloud data acquisition device, indicating that this disclosure also has good detection results for small targets at a relatively long distance. Therefore, this algorithm has good detection capabilities when applied to autonomous driving scenarios.
[0103] Those skilled in the art will understand that, in the above-described method of the specific implementation, the order in which each step is written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of each step should be determined by its function and possible internal logic.
[0104] Experiments were conducted on this disclosure using a 64-bit Ubuntu 18.0.04 operating system. All models were trained and tested using four GeForce RTX 2080 Ti GPUs and an Intel i7 CPU. The model parameter configurations are shown in the table below. When dividing the point cloud region, the range was [0, 70.4m] on the X-axis, [-40, 40m] on the Y-axis, and [-3, 1m] on the Z-axis. During model training, the adam_onecycle optimizer was used, which is an end-to-end training process with a one-cycle learning rate adjustment strategy added to the adam optimizer. The batch size was 16, the initial learning rate was set to 0.01, and the learning rate decay used a cosine annealing strategy, with a total of 100 iterations. All experiments were conducted on the PyTorch 1.7 deep learning framework, with Python version 3.6.
[0105]
[0106] Table 1 Model Parameter Configuration Table
[0107] Performance evaluation criteria: In order to determine whether the detection box is correct, we calculate the volume overlap between the detection box and the target box by evaluating the confidence level and IoU. The specific calculation method is shown in formula (16):
[0108] Formula (16)
[0109] During the detection process, four scenarios may occur, namely the comparison between the detection result and the ground truth box. Specifically, TP (TruePositive) indicates that the prediction result is consistent with the label, TN (True Negative) indicates that the location is correctly predicted as background, FP (False Positive) indicates that the prediction result is inconsistent with the label, and FN (False Negative) indicates that there is a real target at the location but the detection model did not predict it.
[0110] By counting the number of the four cases above, the accuracy and recall of 3D object detection can be calculated. Accuracy refers to the proportion of true positive samples among all data identified as positive samples, and recall is the proportion of correctly identified positive samples out of all positive samples. The specific calculation methods are as follows: Formula (17) and Formula (18)
[0111] Formula (17)
[0112] Formula (18)
[0113] Table 2 shows the improved network model of this disclosure, compared with the basic network. The basic network is PV-RCNN, which simply uses point cloud data to supplement feature information for 3D object detection. The improved network adds an attention-based point cloud feature enhancement module for distant and small targets to the basic network. Table 2 shows that the attention-based point cloud feature enhancement for distant and small targets has an average improvement of 1.40% compared to the basic network in the vehicle and bicycle categories. Therefore, the improved network has a better improvement effect than the basic network.
[0114]
[0115] Table 3 shows the comparative experiments of the improved network model with other relevant classic algorithms on the KITTI dataset, with the detection categories including vehicles and bicycles. These include the SECOND and PointPillars algorithms, which primarily use voxels for detection; the PointRCNN and STD algorithms, which directly use the original point cloud for object detection; the 3DSSD algorithm, which modifies the point cloud sampling method; the Point-GNN algorithm, which encodes the point cloud scene into a pillar graph structure for computation; and the PV-RCNN network, the basis of this disclosure.
[0116]
[0117] Table 3
[0118] As shown in Table 3 above, the improved network model incorporates a feature enhancement module using distant small target point cloud data, resulting in higher accuracy compared to other 3D target detection algorithms mentioned in the experiments. This fully demonstrates the effectiveness of the algorithm. When detecting vehicles and bicycles (i.e., distant small targets), the model's detection accuracies are 78.46% and 59.17%, respectively. Compared to the 3DSSD algorithm, the proposed model shows improvements of 3.91% and 2.27% for vehicles and bicycles, respectively. Compared to the PV-RCNN algorithm, the proposed model shows improvements of 1.64% and 1.52% for vehicles and bicycles, respectively. Therefore, the model's accuracy is higher than other 3D target detection algorithms mentioned in the experiments, fully demonstrating the effectiveness of the improved model.
[0119] The actual detection results using the improved network model are as follows: Figure 7 As shown in the figures, the upper half of each figure represents a camera image of the real-world scene, and the lower half represents the real-world detection results from the algorithm presented in this paper. During detection, green bounding boxes represent detected vehicles, yellow bounding boxes represent detected bicycles, and blue bounding boxes represent detected pedestrians. The side of the bounding box cube with intersecting lines indicates the direction of travel of the target. As can be seen from the detection results, in... Figure 7 (a) and Figure 7 In (b), the targets are relatively sparse and far from the lidar point cloud data acquisition device. This algorithm has good detection results for small targets at a greater distance. Therefore, this algorithm has good detection capability when facing autonomous driving scenarios.
[0120] This disclosure also proposes a computer-readable storage medium storing computer program instructions that, when executed by a processor, implement the aforementioned target tracking method and / or the aforementioned behavior detection method. The computer-readable storage medium may be a non-volatile computer-readable storage medium.
[0121] This disclosure also proposes an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured for the aforementioned target tracking method and / or the aforementioned behavior detection method. The electronic device may be provided as a terminal, a server, or other type of device.
[0122] Figure 8 This is a block diagram illustrating an electronic device 800 according to an exemplary embodiment. For example, the electronic device 800 may be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness equipment, personal digital assistant, or other terminal.
[0123] Reference Figure 8 The electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input / output (I / O) interface 812, a sensor component 814, and a communication component 816.
[0124] Processing component 802 typically controls the overall operation of electronic device 800, such as operations associated with display, telephone calls, data communication, camera operation, and recording operations. Processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the methods described above. Furthermore, processing component 802 may include one or more modules to facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802.
[0125] Memory 804 is configured to store various types of data to support the operation of electronic device 800. Examples of such data include instructions for any application or method operating on electronic device 800, contact data, phonebook data, messages, pictures, videos, etc. Memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0126] Power supply component 806 provides power to various components of electronic device 800. Power supply component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to electronic device 800.
[0127] Multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 808 includes a front-facing camera and / or a rear-facing camera. When the electronic device 800 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
[0128] Audio component 810 is configured to output and / or input audio signals. For example, audio component 810 includes a microphone (MIC) configured to receive external audio signals when electronic device 800 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 804 or transmitted via communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
[0129] I / O interface 812 provides an interface between processing component 802 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.
[0130] Sensor assembly 814 includes one or more sensors for providing state assessments of various aspects of electronic device 800. For example, sensor assembly 814 can detect the on / off state of electronic device 800, the relative positioning of components such as the display and keypad of electronic device 800, changes in position of electronic device 800 or a component of electronic device 800, the presence or absence of user contact with electronic device 800, orientation or acceleration / deceleration of electronic device 800, and temperature changes of electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 814 may also include an accelerometer, gyroscope, magnetometer, pressure sensor, or temperature sensor.
[0131] Communication component 816 is configured to facilitate wired or wireless communication between electronic device 800 and other devices. Electronic device 800 can access wireless networks based on communication standards, such as WiFi, 2G, or 3G, or combinations thereof. In one exemplary embodiment, communication component 816 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 816 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0132] In an exemplary embodiment, the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the methods described above.
[0133] In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 804 including computer program instructions that can be executed by a processor 820 of an electronic device 800 to perform the above-described method.
[0134] Figure 9 This is a block diagram illustrating an electronic device 1900 according to an exemplary embodiment. For example, the electronic device 1900 may be provided as a server. (Refer to...) Figure 7The electronic device 1900 includes a processing component 1922, which further includes one or more processors, and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 1922 is configured to execute instructions to perform the methods described above.
[0135] Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input / output (I / O) interface 1958. Electronic device 1900 can operate on an operating system stored in memory 1932, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, or similar.
[0136] In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions that can be executed by a processing component 1922 of an electronic device 1900 to perform the above-described method.
[0137] This disclosure can be a system, method, and / or computer program product. A computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of this disclosure.
[0138] Computer-readable storage media can be tangible devices capable of holding and storing instructions for use by an instruction execution device. Computer-readable storage media can be, for example—but not limited to—electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination of the foregoing. The computer-readable storage media used herein are not to be construed as transient signals themselves, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., light pulses through fiber optic cables), or electrical signals transmitted through wires.
[0139] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0140] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.
[0141] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0142] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0143] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0144] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0145] The embodiments described above are exemplary and not exhaustive, and are not limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, their practical application, or improvements to technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A method for feature enhancement of point cloud data of distant and small targets based on an attention mechanism, the method comprising: Raw point cloud data is collected using the LiDAR of autonomous vehicles; The collected raw point cloud data is preprocessed into voxels to obtain preprocessed voxels; The preprocessed voxels are extracted by the voxel feature extractor and then input into the sparse convolutional network to obtain multi-scale semantic voxel features. The multi-scale semantic voxel features after sparse convolution are converted into a feature bird's-eye view map, which is then input into the region candidate network to generate the initial target classification and candidate regions. The raw point cloud data collected by the lidar is divided into proportionally proportioned far-small target regions. Then, a parallel random farthest point sampling algorithm is applied to obtain the point cloud set of the far-small target regions. The specific path is as follows: For the input raw point cloud set , Select One point is chosen as the key point for the next step; a point is randomly selected. Use it as the starting point and write it into the key point set. In the middle; then use the rest Calculate the sum of each point. Distance between points, select the farthest point. Write key point set Middle; then select the remaining Individual point calculation and key point set The distance to each point in the set is used as the shortest distance to the keypoints. distance, Select the furthest point and write it into the keypoint set. At this point, only A point, if Then the selection is complete; if Repeat the above steps until the desired result is selected. A starting point; thus, from the point cloud Medium sampling Key points ; Based on the point cloud distribution in autonomous driving scenarios, Set to 2048, and use keypoints to represent the entire 3D scene, as shown in formula (3): Official (3) Where h represents the multilayer perceptual feature extraction layer, max( ) represents the max pooling operation of the symmetric method, and γ represents the feature extraction of higher layers; Linear projection and topological feature extraction operations are performed on the point cloud set of the far-small target region to obtain a local feature sequence containing the neighborhood geometric information of each key point in the far-small target point cloud region. Then, the local feature sequence containing the neighborhood spatial information of the far-small target point cloud set is input into the dual-channel attention module to obtain the overall spatial structure information. The extracted local feature sequences are recovered by global pooling. The recovered point cloud sequences are then input into a dual-channel attention module, and attention cross-computation is used to perform feature enhancement operations on distant and small target point clouds to obtain enhanced distant and small target point cloud features. Multi-scale semantic voxel features and enhanced distant small target point cloud features are fused to obtain the final fused features, which are used to refine the initial target classification and candidate region anchor boxes to obtain detection results.
2. The method for feature enhancement of distant small target point cloud data based on attention mechanism according to claim 1, characterized in that, The process of performing voxelization preprocessing on the collected raw point cloud data to obtain preprocessed voxels follows the specific path below: Based on the actual distribution of the original point cloud data, the scene space is divided into three-dimensional voxels; For point cloud data, the vehicle's forward direction is taken as the X-axis, the left and right directions as the Y-axis, and the direction perpendicular to the XY plane as the Z-axis. Let the range of the target scene on the three axes be L, in meters. Calculate the difference between the maximum and minimum values of the point cloud data coordinates in the X, Y, and Z directions respectively. Then, determine the length, width, and height of the initial voxel based on the three differences. After the calculation is completed, the initial voxel of the target scene is obtained.
3. The method for feature enhancement of distant small target point cloud data based on attention mechanism according to claim 2, characterized in that, The process involves extracting preprocessed voxels using a voxel feature extractor and then inputting them into a sparse convolutional network to obtain multi-scale semantic voxel features. The specific path is as follows: First, the voxel feature extractor is used to directly calculate the average value of the point-by-point features within the voxel. Then, the element-wise max pooling operation is used to obtain the local clustered features of each voxel. The obtained features are expanded and then the expanded features are connected with the point-by-point features. The obtained voxel features are input into a three-dimensional sparse convolutional network to obtain voxel features. The specific three-dimensional sparse convolution operation is shown in the following formula (1): Official (1) in , This represents the output after a 3D sparse convolution operation, where j represents the output index and m represents the output channel. 'l' represents the filter element, and 'l' represents the input channel. A matrix representing a set of sparse data. Let represent the regular matrix, and k represent the kernel offset.
4. The method for feature enhancement of distant small target point cloud data based on attention mechanism according to claim 3, characterized in that, The process of converting sparsely convolved multi-scale semantic voxel features into a feature bird's-eye view map, inputting it into a region candidate network, and generating initial target classifications and candidate regions follows the specific path below: The multi-scale semantic voxel feature data after sparse convolution is downsampled on the Z-axis, thereby converting the sparse data into a dense feature map, that is, the three-dimensional data is reshaped into an image similar to two-dimensional data; the RPN detection framework is used to generate the initial target classification and candidate region anchor boxes, with one three-dimensional anchor box for each class, using the average three-dimensional size of the target in that class, as shown in the specific regression target calculation formula (2): Official (2) Where x, y, z are the coordinates of the center point; w, l, h are the length, width, and height of the anchor frame; t represents the encoded value, g represents the truth value, and a represents the anchor frame.
5. The method for feature enhancement of distant small target point cloud data based on attention mechanism according to claim 4, characterized in that, The linear projection and topological feature extraction operations on the point cloud set of the far-small target region are performed to obtain a local feature sequence containing the neighborhood geometric information of each key point in the far-small target point cloud region. Then, the local feature sequence containing the neighborhood spatial information of the far-small target point cloud set is input into a dual-channel attention module to obtain the overall spatial structure information. The specific path is as follows: Using the KNN (K Nearest Neighbors) algorithm, the topological features of key points and their neighbors are extracted to obtain the feature sequence of key points in the far-small target region. This is used to learn the structural information of each key point in the local neighborhood of the point cloud of the distant small target region; then the key point sequence Linear projection is a high-dimensional vector, embedding This yields a local feature sequence containing the neighborhood geometric information of each key point in the point cloud region of the distant and small targets. The local feature sequence containing the neighborhood spatial information of the point cloud set of far and small targets. Input the dual-channel attention module to calculate the local neighborhood feature correlation of each key point in the known area of the far-small target, thereby obtaining the overall spatial structure information, channel attention and spatial attention features, as shown in formulas (4) and (5): Official (4) Official (5) in, , These represent channel attention computation and spatial attention computation, respectively. , These represent the feature vectors calculated by channel attention and the feature vectors calculated by spatial attention, respectively. From this, we can obtain the output features of the overall spatial structure correlation of distant and small targets, which includes the refined structural features of the point cloud region of distant and small targets and the structural correlation information of the missing point cloud set.
6. The method for feature enhancement of distant small target point cloud data based on attention mechanism according to claim 5, characterized in that, The extracted local feature sequences are then used to reconstruct the point cloud sequence through global pooling. The reconstructed point cloud sequence is then input into a dual-channel attention module, where attention cross-calculation is used to enhance the point cloud features of distant and small targets, resulting in an enhanced point cloud feature sequence. The specific path is as follows: First, the point cloud sequence is recovered through global pooling. Then, dual-channel attention is calculated on the input features to obtain the neighborhood structure features of the key points of distant and small targets. Then, it is compared with the output features of the feature extraction part. Attention cross-computation is performed to obtain the structural correlation between the coarse neighborhood structural features of the far and small target key points and the refined structural features of the far and small target region. The local structural details of the far and small target key points are then fused to obtain the refined global structural features of the far and small targets. The calculation is shown in formula (6): Official (6) The attention cross-calculation is shown in formula (7): Official (7) Where w is the projection matrix output by the dual-channel attention module, and H represents the number of attention subspaces, H=2.
7. A method for feature enhancement of distant small target point cloud data based on an attention mechanism according to claim 6, characterized in that, The process involves fusing multi-scale semantic voxel features and enhanced distant small target point cloud features to obtain the final fused features, which are then used to refine the initial target classification and candidate region anchor boxes to obtain detection results. The specific path is as follows: To fuse multi-scale semantic voxel features and enhanced far-small target point cloud features, the multi-scale semantic voxel feature map and the enhanced far-small target point cloud features are first input into a convolutional layer, and then passed to a top-down path. The feature map is upsampled by deconvolution and connected to the feature map from the convolutional layer. Then the feature maps from the top-down path are transformed to the same size and merged by stacking to obtain the fused features. Finally, the detection results are generated using two 1×1 convolutional layers.
8. A computer device, the device comprising: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to implement the method according to any one of claims 1 to 7.
9. A computer-readable storage medium storing computer-executable instructions, characterized in that, The computer-executable instructions are set as follows: When the computer-executable instructions are executed by a processor, they enable the implementation of the method described in any one of claims 1 to 7.