Roadbed settlement monitoring method based on unmanned aerial vehicle and laser radar data fusion
By using multimodal data fusion technology with drones equipped with lidar and cameras, the problems of long time consumption and low accuracy in ground settlement monitoring during railway construction have been solved. This has enabled high-precision settlement monitoring and risk warning, which is applicable to complex scenarios and reduces the risk of collapse.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAN UNIV OF TECH
- Filing Date
- 2023-05-24
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies for monitoring ground settlement in railway construction rely on manual methods, which are time-consuming, have low accuracy, are difficult to adapt to complex construction scenarios, and pose a risk of collapse.
Using a drone equipped with LiDAR and a camera, point cloud and image features are extracted through point-based networks and SE-ResNet-50 networks. Combined with the AFF-Attention Feature Fusion Module and AF-SSD Target Detection Network, multimodal data fusion is achieved to generate a settlement monitoring map.
It improves the accuracy of ground subsidence monitoring, reduces labor costs, is applicable to complex scenarios, shortens detection time, reduces the risk of collapse, and enhances the level of informatization.
Smart Images

Figure CN116630267B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of multimodal data fusion technology and relates to a method for monitoring roadbed settlement based on the fusion of UAV and lidar data. Background Technology
[0002] In recent years, multimodal fusion technology has developed rapidly, especially in fields such as autonomous driving and healthcare. The fusion of cameras and LiDAR sensors significantly improves the reliability and usability of data, a fact well-proven in research on autonomous driving in automobiles. In my country's railway construction sector, there are numerous needs for construction quality control. Currently, these needs are monitored solely using single sensors and manual methods, which are time-consuming, labor-intensive, and have low accuracy. Therefore, there is an urgent need to find a dynamic perception technology adapted to the complex construction scenarios in railway infrastructure development, providing regular ground settlement monitoring to ensure construction safety, improve the informatization level of the railway construction industry, and increase production efficiency. Summary of the Invention
[0003] The purpose of this invention is to provide a roadbed settlement monitoring method based on the fusion of UAV and lidar data. This method can effectively reduce the risk of landslides and collapses caused by ground settlement during the construction of large railways, simplify the workflow of ground settlement monitoring, and improve monitoring accuracy.
[0004] The technical solution adopted in this invention is a roadbed settlement monitoring method based on the fusion of UAV and lidar data, which specifically includes the following steps:
[0005] Step 1: Debug and calibrate the parameters of the optical camera and lidar;
[0006] Step 2: Use drones equipped with lidar and optical cameras to collect and store image frame data and point cloud data along the railway construction line;
[0007] Step 3: Perform feature extraction on the point cloud data and image frame data collected in Step 2;
[0008] Step 4: Input the image features extracted in Step 3 into the neck network to achieve feature fusion at different levels;
[0009] Step 5: Fuse the point cloud features extracted in Step 3 with the image features processed in Step 4;
[0010] Step 6: Input the features processed in step 5 into the target detection network to perform target detection.
[0011] The invention is further characterized by:
[0012] The specific process of step 3 is as follows:
[0013] The process of extracting point cloud features from point cloud data using a point-based network is as follows:
[0014] The sampling point is selected using the farthest point sampling method. A spherical neighborhood is defined for each sampling point. Grouped point cloud data is obtained based on the spherical neighborhood. The grouped point cloud data is sent to the feature extraction layer for feature extraction. The local feature dimension of each neighborhood is unified through the max pooling operation. The local features are concatenated into global features, and the global feature vector is output.
[0015] The process of feature extraction from image data using the SE-ResNet-50 network is as follows:
[0016] Image data collected by the drone is input into the SE-ResNet-50 network for image feature extraction. After the image data is convolved by each convolution module of SE-ResNet-50, residual processing and pooling processing are performed to obtain multi-level features of different dimensions.
[0017] The specific process of step 4 is as follows:
[0018] Step 4.1: In the neck network, the image features of each layer are fused from top to bottom.
[0019] Step 4.2: Element-wise addition and fusion of the feature maps of the same level output in Step 4.1 and Step 3.2.
[0020] The specific process of step 5 is as follows:
[0021] Step 5.1: The input to the attention feature fusion module is the features of the point cloud data sampling points and the corresponding image features, with dimensions of N×C1 and N×C2 respectively. First, the two features are input into the fully connected layers FC1 and FC2 respectively, and the dimensions of the two features are adjusted to be unified to N×C3.
[0022] Step 5.2: Add the dimensions obtained in Step 5.1 element by element to obtain the comprehensive features;
[0023] Step 5.3: Input the comprehensive features obtained in step 5.2 into the third fully connected layer FC3 for matching and output attention scores;
[0024] Step 5.4: Using the following formula (1), the sigmoid output has an N×1 dimension attention factor. Multiply the attention factor by the image features to obtain the attention-weighted image features:
[0025]
[0026] Step 5.5: After the weighted image features are concatenated with the point cloud features, the fused features of the point cloud and the image are obtained as the output.
[0027] The specific process of step 6 is as follows:
[0028] Step 6.1: Generate candidate center points;
[0029] Step 6.2: In the detection head, bounding box regression is performed on each candidate center point to predict the displacement deviation of the candidate center point relative to the true center, the target category, the size of the bounding box, the orientation of the bounding box angle, and the positions of the eight corner points of the bounding box, thus obtaining the predicted bounding box vector (x). p y p , z p , l p w p h p θ p ),
[0030] Among them, (x p y p z p ) represent the three-dimensional coordinates of the prediction box in the lidar coordinate system, l p w p h p θ represents the length, width, and height dimensions of the prediction box, respectively. p The orientation of the predicted bounding box is rotated relative to the z-axis; the loss between the predicted bounding box and the ground truth is calculated, and the loss is optimized to train the network;
[0031] Step 6.3: Calculate the Euclidean distance l1 between each candidate center point and the true center point, and use...
[0032] l mask1 The center point is selected based on its distance from the threshold. The calculation process is shown in equation (2).
[0033]
[0034] Calculate the centrality l² between the center points after the first step of filtering and the truth label boxes, and use the centrality threshold l. mask2 The selection process involves filtering the data. A centrality greater than the threshold indicates that the current candidate center point is closer to the center of the true value label box, while a centrality less than the threshold indicates that the candidate center point is further away from the center of the true value label box. The calculation process is shown in equation (3).
[0035]
[0036] In the formula, f, b, l, r, t, and d are the distances between the candidate center point and the six faces of the truth label box (front, back, left, right, top, and bottom), respectively. Candidate center points with a centrality greater than the threshold are associated with the corresponding labels, thereby allowing the calculation of the loss.
[0037] The beneficial effects of this invention are as follows: This invention uses a drone equipped with a camera and lidar to collect data along the construction site. The collected image and point cloud data are processed. The point cloud data undergoes feature extraction using a point-based network, while the image data undergoes feature extraction using an SE-ResNet-50 network. The multi-level image features from different dimensions after feature extraction are processed through a neck layer and then input together with the corresponding level of point cloud features into a fusion module. This fusion is performed using an AFF-attention feature fusion module, and finally, the fused features are transmitted to a detection head. An anchor-free AF-SSD target detection network is used to predict and generate bounding boxes for target detection. A settlement amplitude visualization map is generated by comparing multi-period point cloud data of the target elements, and this map is fused with the real-world image to generate the final ground settlement analysis map of the construction scene. This invention's drone-based multi-modal fusion ground settlement monitoring method using cameras and lidar can reduce labor costs in quality inspection during the construction of large-scale railway infrastructure and is applicable to complex scenarios where manual monitoring is impossible, without considering environmental or weather conditions. This technology can improve the accuracy of ground settlement monitoring and accurately locate locations where ground settlement is likely to exceed the ground settlement threshold, reducing the risk of collapse at construction sites. Applying this technology to quality inspection and risk early warning during the construction of large-scale railway infrastructure can reduce time and labor costs, improve detection accuracy and risk prediction rates, and enhance the informatization level of quality inspection. Attached Figure Description
[0038] Figure 1 This is the overall flowchart of the roadbed settlement monitoring method based on the fusion of UAV and lidar data of the present invention;
[0039] Figure 2 This is a flowchart of point-based network feature extraction in the roadbed settlement monitoring method based on UAV and lidar data fusion in this invention;
[0040] Figure 3 This is a flowchart of the process of using the SE-ResNet-50 network to extract image features and the neck network to fuse feature maps of different sizes in the roadbed settlement monitoring method based on UAV and lidar data fusion in this invention.
[0041] Figure 4This is a flowchart of the AFF-Attention Feature Fusion Module fusing image and point cloud features in the roadbed settlement monitoring method based on UAV and lidar data fusion in this invention;
[0042] Figure 5 This is a graph showing the correspondence between settlement values and chromatograms in the roadbed settlement monitoring method based on the fusion of UAV and lidar data in this invention.
[0043] Figure 6 This invention relates to a roadbed settlement monitoring method based on the fusion of UAV and lidar data, which visualizes settlement amplitude and integrates real-world images. Detailed Implementation
[0044] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments.
[0045] This invention relates to a roadbed settlement monitoring method based on the fusion of UAV and LiDAR data. First, the camera and LiDAR sensors mounted on the UAV platform are calibrated to eliminate errors caused by distortion and temporal / spatial asynchrony. Then, a flight path is defined. Next, the UAV flies along the construction line, collecting data. Feature extraction is performed on the collected point cloud data and image data separately; the point cloud data uses a point-based network, and the image data uses an SE-ResNet-50 network. Then, an AFF-attention feature fusion module is used for feature fusion. The fused data is input into an anchor-free AF-SSD target detection network to generate bounding boxes. The image data after target recognition is stitched together to obtain a real-world image of the railway construction site. Then, the point cloud data from multiple periods are stitched together, and the difference is calculated to obtain settlement data. The data is then color-coded according to the numerical values to obtain a visualization of the ground settlement, which is then layer-fused with the real-world image.
[0046] Example 1
[0047] This invention relates to a roadbed settlement monitoring method based on the fusion of UAV and lidar data, the process of which is as follows: Figure 1 As shown, the specific steps include:
[0048] Step 1: This invention uses two types of sensors: an optical camera and a lidar. Before use, both sensors need to be debugged and their parameters calibrated. The specific calibration steps are as follows:
[0049] Step 1.1: Due to its imaging principle, the camera cannot avoid distortion. This invention adopts the Zhang Zhengyou calibration method, which uses the camera to take pictures of the black and white checkerboard from different directions to obtain calibration data, thereby calibrating the camera's intrinsic parameters.
[0050] Step 1.2: In order to ensure that the image frame data and point cloud data correspond one-to-one and avoid redundancy, it is necessary to synchronize the time of the camera and the LiDAR, and adopt a data soft synchronization method based on timestamp data neighbor matching.
[0051] Step 1.3: There are certain spatial differences between the camera and the LiDAR. To avoid errors, joint calibration of the two is required. First, the calibration features of the LiDAR coordinate system are extracted; second, the calibration features of the camera coordinate system are extracted; and finally, the optimized parameters are solved to complete the joint calibration.
[0052] Step 1.4: In an open scene, a point cloud image is obtained by scanning the calibration board using a lidar scanner. This point cloud image is then cropped to obtain the region of interest (ROI) where the calibration board is located. Within this region, the M-estimator sample consensus algorithm is used to perform planar fitting on the point cloud within the region, resulting in the planar fitting equation for the calibration board's point cloud, and thus the normal vector of the calibration board plane. The interior points obtained by the fitting algorithm are projected onto the fitted plane. Based on the Y-axis coordinate values, extreme points are identified for each scanning line beam. The left and right boundary points of the calibration board are obtained from these extreme points, and the boundary lines are fitted using straight lines to obtain the equations for the boundary lines. At this point, the two boundary lines for the upper half of the calibration board have been obtained. The same method is used to obtain the two boundary lines for the lower half. The intersection points of the four line equations are used to obtain the four corner points of the calibration board in the radar coordinate system, and the center point of the calibration board is obtained from these corner points.
[0053] Step 1.5: In an open scene, use a camera to capture an image of the calibration board. In the image data matched with the point cloud data, use the corner detection function in the OpenCV library to identify the pixel coordinates of the corners. Then, using the pixel coordinates of the corners and the camera intrinsic parameters obtained in Step 1.1 as input, use the N-point perspective transformation algorithm to solve for the rotation and translation parameters of the calibration board relative to the camera coordinate system. Then, use the rotation and translation parameters to find the coordinates of the center point and the plane normal vector of the calibration board in the camera coordinate system, and further obtain the coordinates of the four corner points of the calibration board.
[0054] Step 1.6: Repeat steps 1.4-1.5 to obtain multiple sets of corner point and normal vector features (at least six sets) from the LiDAR and camera, and solve for the 11 unknown parameters in the rotation and translation matrix. To reduce errors, collect multiple sets of data for iterative optimization until the target loss function is less than a set threshold.
[0055] Step 2: Use a drone equipped with lidar and camera to fly along the railway construction line, collect and store image frame data and point cloud data along the railway construction line.
[0056] Step 3: Use a point-based network to extract point cloud features from the point cloud data, and use an SE-ResNet-50 network to extract features from the image data.
[0057] The specific processing steps are as follows:
[0058] Step 3.1: Input the point cloud data collected by the UAV into a point-based network for point cloud feature extraction. The process is as follows: Figure 2 As shown. In the process of selecting sampling points in the Point-based network, the sampling method adopted in this invention is the farthest point sampling method. At the same time, in order to cover all foreground targets as much as possible without ignoring background points, this invention sets up three modules in the Point-based network for sampling point selection and feature extraction (the difference between the three modules lies in the distance metric, and the sampling method used is the farthest point sampling method). In the sampling layer of the first module, Euclidean distance (Euclidean distance is calculated as shown in formula (1)) is used as the distance metric. In the sampling layer of the second module, Euclidean distance and feature distance (feature distance is calculated as shown in formula (2)) are fused (fusion method is shown in formula (3)) as the distance metric. In the sampling layer of the third module, a hybrid distance metric is used, with Euclidean distance and feature distance each accounting for half.
[0059]
[0060] L f (A, B) = ||f a -f b ||2 (2);
[0061] D(A, B) = λL f (A, B) + βL d (A, B) (3);
[0062] Suppose points A and B are two points in the point cloud data, L d Let L be the Euclidean distance between points A and B. f Let f be the characteristic distance between points A and B, and (x1, y1, z1) and (x2, y2, z2) be the coordinates of points A and B, respectively. a f b For the feature distance between points A and B, the x, y, and z coordinates of the points are used as the feature vectors of the points in this invention. In formula (3), D(A, B) represents the fused distance between A and B, and λ and β represent the weights of the feature distance and Euclidean distance metrics, respectively.
[0063] All three modules employ the same feature extraction method: a spherical neighborhood is defined for each sampling point; grouped point cloud data is obtained based on the spherical neighborhood; the grouped point cloud data is then fed into the feature extraction layer for feature extraction; max pooling is used to unify the local feature dimensions of each neighborhood; the local features are concatenated into global features, and a global feature vector is output. This feature vector represents the shape and structural change trends of the point cloud image. The final output point cloud feature channel dimensions are (64, 128, 256).
[0064] Step 3.2: Input the image data collected by the UAV into the SE-ResNet-50 network for image feature extraction. The process is as follows: Figure 3 As shown, the image data is convolved by the SE-ResNet-50 convolutional module, and then processed by residual processing and pooling to obtain three feature maps corresponding to different depths (the feature levels range from shallow to deep; shallower levels may focus more on low-level features, such as edges and textures. As the network deepens, the feature extraction module can gradually learn higher-level features, such as shape and object parts). Their channel dimensions are 256, 512, and 1024, respectively.
[0065] Step 4: Input the feature map output by the SE-ResNet-50 image feature extraction network into the neck network (Neck layer, located between the feature extraction network layer and the object detection network layer) to fuse feature maps from different levels and enrich the semantic information in the features.
[0066] The specific steps are as follows:
[0067] Step 4.1, Top-Down: In the neck network, the feature maps containing rich information extracted in Step 3.2 are upsampled and passed down to fuse the higher-level, more abstract (higher dimension, containing richer semantic information) feature maps with the lower-level, higher-resolution (lower dimension, containing less semantic information) feature maps.
[0068] Step 4.2, Lateral Connection, involves element-wise addition and fusion of the feature maps at the same level output from Step 4.1 and Step 3.2. The specific operation is as follows:
[0069] First, the high-dimensional, low-sized feature map generated in step 3.2 is subjected to 1×1 convolution for dimensionality reduction, adjusting the dimension to the same level as the upsampled feature dimension. Then, the dimension-adjusted feature map is added element-wise with the upsampled feature map. Finally, the fused feature map is processed by 3×3 convolution to obtain the final output. In the neck network, the feature dimension output of all layers is fixed at 256.
[0070] Step 5: The AFF-Attention Feature Fusion module adaptively fuses the features of the sampling points in the point cloud data (Step 3.1) with the corresponding image features (Step 4), as follows: Figure 4 As shown, the specific steps are as follows.
[0071] Step 5.1: The input to the AFF-Attention Feature Fusion module is the features of the sampling points and the corresponding image features (steps 3.1 and 3.2), with dimensions N×C1 and N×C2 respectively. (N is the number of sampling points, C1 is the point cloud feature dimension, and C2 is the image feature dimension.) First, the two types of features are input into fully connected layer I and fully connected layer II respectively, and the dimensions of the two types of features are adjusted to be unified to N×C3, where C3 is the fusion feature dimension.
[0072] Step 5.2: Add the unified dimensions element by element to obtain the comprehensive features;
[0073] Step 5.3: The integrated features are fed into the fully connected layer III for matching and output of attention scores. An attention factor of dimension N is output through the sigmoid function (as in formula (4)). The attention factor is multiplied by the image features to obtain the attention-weighted image features. The weighted image features are then concatenated with the point cloud features to obtain the fused features of the point cloud and the image as the output.
[0074]
[0075] The input x to the Sigmoid function is the attention score.
[0076] Step 6: Input the point cloud and image feature fusion data obtained after feature extraction and fusion (steps 3, 4, and 5) into the AF-SSD object detection network for object detection. The labels for object detection are shown in Table 1. The training of the AF-SSD object detection network consists of the following three steps:
[0077] Table 1 Target Detection Labels
[0078] Serial Number Label Name meaning 1 railway station Station, station building 2 track track 3 elevated-track elevated roads and bridges 4 construction site construction site 5 tunnel tunnel
[0079] Example 2
[0080] Based on Example 1, the specific process of step 6 is as follows:
[0081] Step 6.1, in the center candidate point generation module of the AF-SSD network:
[0082] Center points are selected from the sampling points obtained in step 3.1. Among the two types of sampling points obtained based on Euclidean distance and feature distance, only the sampling points based on feature distance are used for processing. The sampling points are shifted towards the center position using the Hough voting mechanism to make them as close as possible to the true center of the object, thus generating candidate center points.
[0083] Step 6.2: During the training process of the AF-SSD object detection network, the label information of each object in the training data is encoded into a seven-dimensional vector (x... g y g , z g , l g w g h g θ g ), (x g y g , z g ) represents the three-dimensional coordinates of the prediction box in the lidar coordinate system, l g w g h g θ represents the length, width, and height dimensions of the prediction box. g Let x be the rotation angle of the predicted bounding box relative to the z-axis. In the detection head of the AF-SSD object detection network, bounding box regression is performed on candidate center points to predict the displacement deviation of the candidate center point relative to the ground truth center, the object category, the size of the bounding box, the orientation of the bounding box, and the positions of the eight corner points of the bounding box, thus obtaining the predicted bounding box vector (x...). p y p , z p , l p w p h p θ p ), (x p y p , z p (l) represents the three-dimensional coordinates of the prediction box in the lidar coordinate system. p w p h p ) represents the length, width, and height dimensions of the prediction box, θ p Let be the rotation angle of the predicted bounding box relative to the z-axis. Calculate the loss between the predicted detection boxes and the ground truth, optimize the loss, and train the network.
[0084] The loss function is calculated as follows:
[0085]
[0086] Formula (5) is used to calculate the position offset regression loss, where G pi Let P be the true coordinates of the center point of the i-th target. piN represents the predicted center point coordinates of the i-th target. p This is to predict the number of center points for the foreground target points.
[0087]
[0088] Formula (6) is used to calculate the center point classification loss, where G ci Let N be the category label of the i-th candidate center point. c This represents the number of candidate center points.
[0089]
[0090] Formula (7) is used for orientation angle regression and classification loss calculation, where P θi G is the angle prediction value. θi For the true value of the angle, P θb_i For angle prediction, G θb_i The orientation of the truth box.
[0091]
[0092] Formula (8) is used to calculate the regression loss at corner locations, where P cp_j For the coordinate prediction of the j-th corner point of the i-th bounding box, G cp_j Let be the true coordinates of the j-th corner point of the i-th bounding box.
[0093] L=λ1L class +λ2[L position +L size +L center +L angle +L corner ]+λ3L shift (9);
[0094] Formula (9) is used to calculate the total loss of the AF-SSD target detection network, where L shift The center point offset loss is used for the center candidate point generation module, where λ1, λ2, and λ3 are the weighting coefficients of each part of the loss.
[0095] Step 6.3: During training, after the network model predicts the position of the center point, it needs to associate the predicted center point with the ground truth center point to calculate the loss. This invention uses the following two steps to associate the center point with the ground truth center point:
[0096] Step 6.3.1, the center point passes through the preset distance threshold l mask1 The filtering process is shown in equation (10). When l1 is less than the distance threshold l mask1 The center point is used to calculate the next step of formula (11).
[0097]
[0098] In the formula, l1 is the Euclidean distance between the candidate center point and the true center point, (x, y, z) are the coordinates of the candidate center point, and (x, y, z) are the coordinates of the true center point. gt y gt , z gt ) represents the coordinates of the center point of the truth value.
[0099] Step 6.3.2: Calculate the centrality l2 between the center points after filtering in Step 6.1 and the truth label boxes, and use the centrality threshold l. mask2 Perform a screening. When the centrality l² is greater than the centrality threshold l mask2 When the current center candidate point is closer to the center of the true value label box, it indicates that the current center candidate point is closer to the center of the true value label box. Conversely, the further away it is from the center of the true value label box, the further away it is from the center. The calculation process is shown in Equation (11).
[0100]
[0101] In the formula, f, b, l, r, t, and d represent the distances between the candidate center point and the six faces (front, back, left, right, top, and bottom) of the ground truth label box, respectively. Candidate center points with a centrality greater than a threshold are associated with their corresponding labels.
[0102] Step 6.4: In practical applications, the point cloud and image feature fusion data, after feature extraction and fusion (steps 3, 4, and 5), are input into the trained AF-SSD object detection network for object detection. Finally, the object detection results are post-processed and decoded, and mapped back to the original image space. This yields visualized image data with object detection labels.
[0103] Example 3
[0104] Based on Example 2, the following steps will continue to be performed:
[0105] Step 7: Use the SIFT method to stitch together the image data with target detection labels obtained after target detection in Step 6, register the point cloud data, and generate a complete real-scene image of the construction site, along with the point cloud image.
[0106] Step 8: Collect point cloud data at fixed intervals to obtain point cloud images of the construction site at different times. Subtract the obtained point cloud images to obtain a point cloud image representing the settlement amplitude.
[0107] Step 9: Use the Open3D method to colorize the point cloud image representing the settlement magnitude based on the ground settlement values. The settlement value-color mapping is as follows: Figure 5 As shown.
[0108] Step 10: Use the layer blending algorithm in OpenCV (PS) to fuse the settlement visualization image obtained in Step 9 with the real-world image obtained in Step 7. The final result is an image that blends the settlement amplitude visualization with the real-world image, as shown below. Figure 6 As shown. This is for further analysis.
[0109] Step 11: Calculate the ground settlement threshold based on the environment of the construction site, and comprehensively analyze the ground settlement situation of the construction site using the fused images from Step 10.
[0110] According to Appendix B of the "TB10621-2014 High-Speed Railway Design Code" regarding the calculation of settlement of soft soil foundations, the thickness of the compressible layer is determined by the additional stress being equal to 0.1 times the self-weight pressure. The total settlement S of the foundation can generally be calculated from the instantaneous settlement St. d With the main consolidation settlement S c The summation is calculated. For peat soils, organic-rich clay, or highly plastic clay strata, the secondary consolidation settlement S may be considered for calculation, depending on the specific circumstances. s .
Claims
1. A method for monitoring roadbed settlement based on the fusion of UAV and lidar data, characterized in that: Specifically, the steps include the following: Step 1: Debug and calibrate the parameters of the optical camera and lidar; Step 2: Use drones equipped with lidar and optical cameras to collect and store image frame data and point cloud data along the railway construction line; Step 3: Feature extraction is performed on the point cloud data and image frame data collected in Step 2; the specific process of Step 3 is as follows: Step 3.1, the process of using a point-based network to extract point cloud features from point cloud data is as follows: sampling points are selected using the farthest point sampling method, a spherical neighborhood is defined for each sampling point, grouped point cloud data is obtained based on the spherical neighborhood, the grouped point cloud data is sent to the feature extraction layer for feature extraction, the local feature dimension of each neighborhood is unified through the max pooling operation, the local features are concatenated into global features, and the global feature vector is output. Step 3.2, the process of using the SE-ResNet-50 network to extract features from image data is as follows: the image data collected by the UAV is input into the SE-ResNet-50 network for image feature extraction. After the image data is convolved by each convolution module of SE-ResNet-50, it is processed by residual processing and pooling to obtain multi-level features of different dimensions. Step 4: Input the image features extracted in Step 3 into the neck network to achieve feature fusion at different levels; the specific process of Step 4 is as follows: Step 4.1: In the neck network, the image features of each layer are fused from top to bottom. Step 4.2: Element-wise add and fuse the feature maps of the same level output from Step 4.1 and Step 3.
2. Step 5: Fuse the point cloud features extracted in Step 3 with the image features processed in Step 4; the specific process of Step 5 is as follows: Step 5.1: The input to the attention feature fusion module is the features of the point cloud data sampling points and the corresponding image features, with dimensions of 1 and 2 respectively. and First, the two types of features are input into the fully connected layer respectively. and In the middle, the dimensions of the two features are adjusted and unified as follows: N is the number of sampling points. For point cloud feature dimensions, For image feature dimensions, To fuse feature dimensions; Step 5.2: Add the dimensions obtained in Step 5.1 element by element to obtain the comprehensive features; Step 5.3: The synthesized features obtained in step 5.2 are fed into the third fully connected layer. Match and output attention scores; Step 5.4: Using the following formula (1), the sigmoid output has an N×1 dimension attention factor. Multiply the attention factor by the image features to obtain the attention-weighted image features: (1); Step 5.5: After concatenating the weighted image features with the point cloud features, the fused features of the point cloud and the image are obtained as the output. Step 6: Input the features processed in Step 5 into the target detection network for target detection. The specific process of Step 6 is as follows: Step 6.1: Generate candidate center points; Step 6.2: In the detection head, bounding box regression is performed on each candidate center point to predict the displacement deviation of the candidate center point relative to the true center, the target category, the size of the bounding box, the orientation of the bounding box angle, and the positions of the eight corner points of the bounding box, thus obtaining the predicted bounding box vector. ),in, , , These represent the three-dimensional coordinates of the prediction box in the lidar coordinate system. , , These represent the length, width, and height dimensions of the prediction box, respectively. The orientation of the predicted bounding box is rotated relative to the z-axis; the loss between the predicted bounding box and the ground truth is calculated, and the loss is optimized to train the network; Step 6.3: Calculate the Euclidean distance between each candidate center point and the true center point. and using distance threshold Filter by distance threshold The next step is to proceed from the center point, and the calculation process is shown in equation (2): (2) Calculate the centrality between the center point after the first step of filtering and the truth label box. And use centrality threshold Perform screening, centrality Greater than the centrality threshold This indicates that the closer the current candidate center point is to the center of the true value label box, the further away it is from the center of the true value label box. The calculation process is shown in equation (3): (3) In the formula, These are the distances between the candidate center point and the six faces (front, back, left, right, top, and bottom) of the truth label box, respectively, representing the centrality. Greater than the centrality threshold The candidate centroids are associated with their corresponding labels, and the loss can be calculated.