A multi-modal fusion high-voltage transmission line inspection method in a complex restricted scene

By employing a multimodal fusion-based high-voltage transmission line inspection method, which utilizes depth completion, voxel selection, and noise reduction processing, the problems of low point cloud quality and noise in image generation are solved, achieving efficient and accurate high-voltage transmission line inspection.

CN118097350BActive Publication Date: 2026-06-19HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2024-01-25
Publication Date
2026-06-19

Smart Images

  • Figure CN118097350B_ABST
    Figure CN118097350B_ABST
Patent Text Reader

Abstract

This application relates to a multimodal fusion-based method for inspecting high-voltage transmission lines in complex and constrained environments. The method includes: acquiring two-dimensional images and three-dimensional point cloud data of the high-voltage transmission line; constructing a high-voltage transmission line detection dataset based on the two-dimensional images and the three-dimensional point cloud data, and dividing the dataset proportionally into a training set and a validation set; constructing a multimodal fusion perception high-voltage transmission line inspection model; the high-voltage transmission line inspection model includes a depth completion module, a voxel selection module, a noise-resistant convolution module, and a multi-task detection head prediction module; training the high-voltage transmission line inspection model based on the training set; inputting the validation set into the trained high-voltage transmission line inspection model, and after filtering by a post-processing module, outputting the target detection results. This method not only enables real-time evaluation in complex and constrained environments but also eliminates the cumbersome operation of manual visual retrieval and reduces the occurrence of misjudgments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of high-voltage transmission line inspection technology, and in particular to a high-voltage transmission line inspection method under complex and constrained scenarios with multimodal fusion. Background Technology

[0002] In recent years, the rise of deep learning has led to increasing attention being paid to multimodal data fusion for detecting 3D objects. Using multimodal fusion-based 3D target detection methods for high-voltage transmission line inspection is more challenging than single 2D image detection or single laser point cloud detection. Current implementation faces several difficulties: 1) The different representations of camera images and LiDAR point clouds make fusion difficult. In the early input stage, when enhancing the LiDAR point cloud, only early fusion of the original points is considered, but the quality of the generated point cloud is low, thus failing to fully utilize image features; 2) To increase the number of corresponding pixels in the point cloud, depth completion is performed. The virtual points generated from the image are usually very dense, resulting in a huge computational burden and serious efficiency problems; 3) Inaccurate depth completion generates a large amount of noise in the virtual points during the completion process. It is difficult to distinguish noise from the background in 3D space, leading to a significant reduction in the localization accuracy of 3D detection. Furthermore, the noise points are non-Gaussian distributed, making them unsuitable for filtering by traditional denoising algorithms. Summary of the Invention

[0003] Based on this, it is necessary to address the problems of low point cloud quality generated from images, excessively dense virtual points leading to computational burden, and a large amount of noise generated in virtual points during the depth completion process. Specifically, it is a high-voltage transmission line inspection method under complex and constrained scenarios using multimodal fusion.

[0004] This invention provides a method for inspecting high-voltage transmission lines in complex and constrained scenarios using multimodal fusion. The method includes:

[0005] S1: Collect two-dimensional maps and three-dimensional point cloud data of high-voltage transmission lines;

[0006] S2: Based on the two-dimensional image and the three-dimensional point cloud data, construct a high-voltage transmission line detection dataset, and divide the dataset into a training set and a validation set according to the proportion;

[0007] S3: Construct a high-voltage transmission line inspection model with multimodal fusion perception; the high-voltage transmission line inspection model includes a depth completion module, a voxel screening module, a noise-resistant convolution module, and a multi-task detection head prediction module;

[0008] The depth completion module is used to perform depth completion on the three-dimensional point cloud data based on the two-dimensional image to obtain fused data;

[0009] The voxel filtering module is used to perform point cloud voxelization on the fused data;

[0010] The noise-resistant convolution module is used to perform noise-resistant processing on the voxelized point cloud data to obtain a fused 3D feature map.

[0011] The multi-task detection head prediction module is used to classify the fused 3D feature map to obtain the category;

[0012] S4: Train the high-voltage transmission line inspection model based on the training set;

[0013] S5: Input the validation set into the trained high-voltage transmission line inspection model, and after filtering by the post-processing module, output the target detection results.

[0014] Preferably, in S2, constructing the high-voltage transmission line detection dataset based on the two-dimensional image and the three-dimensional point cloud data includes:

[0015] The collected 60% of the two-dimensional images and the three-dimensional point cloud data are labeled to construct a training set;

[0016] The remaining two-dimensional image and the three-dimensional point cloud data are used as the validation set.

[0017] The preferred workflow of the deep completion module includes:

[0018] Step 1: Segment the two-dimensional image into multiple instances, and project the laser points onto the two-dimensional image;

[0019] Step 2: Sample the pixels within each instance and perform nearest neighbor association with the pixels projected by the laser point. Calculate the dynamic weight factor for each associated point using the following formula:

[0020] ;

[0021] in, Indicates the first v The dynamic weighting factor of each associated point; Sigmoid(·) represents the sigmoid activation function; Representing two-dimensional image features within three-dimensional space; Indicates the first v Depth information of each associated point; Linear (·) indicates that the input is converted into the output in a linear manner; C d This represents the extraction of two-dimensional image features from three-dimensional space; m v The x-coordinate represents the two-dimensional image features in three-dimensional space; nv Represents the ordinate of a two-dimensional image feature in three-dimensional space;

[0022] Step 3: Adjust the semantic features according to the dynamic weighting factor. The semantic features are used to modify virtual points in three-dimensional space. The calculation formula for adjusting the semantic features is as follows:

[0023] ;

[0024] in, Indicates the first v Semantic features of each associated point;

[0025] Step 4: Using the lidar as the center of the coordinate system, with the east of the geolocation radar as the x-axis, the south of the geolocation radar as the y-axis, and the upward direction of the lidar as the z-axis; the three-dimensional point cloud data and the three-dimensional spatial virtual points are fused and aligned in the coordinate system to obtain the fused data.

[0026] Preferably, the workflow of the voxel screening module includes:

[0027] Step 1: Perform point cloud voxelization on the fused data;

[0028] Step 2: During the point cloud voxelization process, sample the voxels of the virtual points in the three-dimensional space to obtain the voxels of the sampled virtual points in the three-dimensional space;

[0029] Step 3: Discard the voxels and voxels within the range of ±5% of the total number to obtain the filtered voxels;

[0030] Step 4: Integrate the selected voxels and the voxelized 3D point cloud data to obtain the fused voxelized point cloud data.

[0031] Preferably, the workflow of the noise-resistant convolution module includes:

[0032] Step 1: For three-dimensional space, based on the 3D index of the point cloud in the three-dimensional space after the point cloud voxelization, calculate the geometric features of non-empty voxels in the 3×3×3 neighborhood, encode all geometric features in the three-dimensional space, and obtain the geometric features of non-empty voxels in the 3×3 neighborhood.

[0033] Step 2: For two-dimensional space, the fused data after voxelization of the point cloud transforms the grid points back to the original coordinate system in the two-dimensional space, and projects the grid points onto the two-dimensional image plane based on the camera calibration parameters of the LiDAR; the calculation formula is:

[0034] ;

[0035] in, Represents a two-dimensional plane index vector;p (·) indicates projection; Inverse transform representing data augmentation; This represents the inverse transformation of grid points to the original 3D coordinate system; H Represents a three-dimensional index vector;

[0036] Step 3: Based on the 2D index of the point cloud in the two-dimensional space of the fused data after voxelization of the point cloud, calculate the noise perception features of non-empty voxels in the 3×3 neighborhood, encode all the noise perception features in the two-dimensional space, and obtain the noise perception features of non-empty voxels in the 3×3 neighborhood.

[0037] Step 4: Concatenate the non-empty voxel geometric features in the 3×3 neighborhood with the non-empty voxel noise perception features in the 3×3 neighborhood to obtain a noise feature vector;

[0038] Step 5: Filter or remove the part corresponding to the noise feature vector in the fused data after voxelization of the point cloud to obtain the fused 3D feature map.

[0039] Preferably, the workflow of the multi-task detection head prediction module includes:

[0040] Step 1: The multi-task detection head prediction module is equipped with three convolutional blocks; the first layer of each convolutional block downsamples the fused 3D feature map through a convolution with a stride of 2, and extracts features from the downsampled fused 3D feature map through multiple convolutional sequences with a stride of 1 in each convolutional block to obtain the feature map;

[0041] Step 2: Upsample the feature map extracted from each convolutional block and stitch them together to obtain a multi-scale high-resolution feature map;

[0042] Step 3: Use the detection head to classify the multi-scale high-resolution feature map to obtain the category.

[0043] Preferably, training the high-voltage transmission line inspection model based on the training set includes:

[0044] Step 1: Predicted bounding boxes and bounding boxes are set in the training set; the training set is input into the high-voltage transmission line inspection model, and the high-voltage transmission line inspection model is backpropagated through the output of the predicted bounding boxes and the loss function to update its model parameters;

[0045] The loss function is expressed as:

[0046] ;

[0047] in, L Represents the loss function; Ns This indicates the number of detection boxes sampled during the training phase; β 1 represents the first loss coefficient. β 2 is the second loss coefficient. β 3 represents the third loss coefficient; L cls This represents the classification loss function for predicted bounding boxes; L reg This represents the bounding box regression loss function; L dir This represents the directional classification loss function;

[0048] The expression for the bounding box classification loss function is:

[0049] ;

[0050] in, N Indicates the number of categories; y a Indicates the first a The actual category labels for each category; p a Indicates the first a The predicted probability distribution of each category;

[0051] The expression for the bounding box regression loss function is:

[0052] ;

[0053] in, N pos Indicates the number of anchor frames; SmoothL 1(·) represents SmoothL 1. Loss function; dx b Indicates relative to the first b The horizontal displacement correction value of the boundary box coordinates of each anchor frame; day b Indicates relative to the first b Vertical displacement correction value of the boundary box coordinates of each anchor frame; dw b Indicates relative to the first b Width scaling correction value for the bounding box coordinates of each anchor box; dh b Indicates relative to the first b The height scaling correction value for the bounding box coordinates of each anchor box; t x The horizontal displacement correction represents the actual bounding box coordinates; t y The vertical displacement correction represents the actual bounding box coordinates; t wWidth scaling correction representing the actual bounding box coordinates; t h The height scaling correction represents the actual bounding box coordinates;

[0054] The expression for the orientation classification loss function is:

[0055] ;

[0056] in, L ( l , l gt ) represents the angle regression loss function; l This represents the predicted angle value; l gt This represents the actual angle value marked.

[0057] Step 2: Repeat Step 1 until convergence or the maximum number of iterations is reached to obtain the trained high-voltage transmission line inspection model.

[0058] Preferably, the workflow of the post-processing module is as follows:

[0059] Step 1: For each of the output categories, sort the detection boxes from high to low according to the confidence level corresponding to the category of power equipment;

[0060] Step 2: Select the detection box with the highest confidence and add it to the final detection results list;

[0061] Step 3: Traverse the remaining detection boxes and discard those detection boxes whose intersection-union ratio with the traversed detection boxes is greater than the threshold;

[0062] Step 4: Repeat steps 1-3 until all detection boxes have been traversed to obtain the target detection result.

[0063] Preferably, voxels of virtual points in three-dimensional space are sampled, and the sampling formula is as follows:

[0064] ;

[0065] Discarding the voxels and voxels within a defined range, the calculation formula is as follows:

[0066] ;

[0067] in, g This represents the sum of voxels of the sampled three-dimensional virtual points; Indicates the number of voxels within a range of 0-30 meters; Indicates the number of voxels within the range of 30-60; Indicates the number of voxels beyond 60 meters; orIndicates the discarded voxels; i Indicates the proportion of voxels discarded; This indicates the number of voxels discarded within a range of 0-30 meters; This indicates the number of voxels discarded within a 30-60 meter range.

[0068] The preferred cascade calculation formula is:

[0069] ;

[0070] ;

[0071] ;

[0072] ;

[0073] ;

[0074] ;

[0075] in, express[ i , b The noise feature vector within the interval's neighborhood; T represents the transpose; Represents the geometric features of a non-empty voxel within a 3×3 neighborhood; This represents the noise perception feature of a non-empty voxel within a 3×3 neighborhood. This represents encoding all geometric features in three-dimensional space; O ( i+y , j+u , t ) indicates that the point cloud voxelized fused data is located in ( i+ y , j+u ) position and corresponding to the channel t Pixel values; K h Indicates altitude; K w Indicates width; Q Indicates the number of channels; F ( y , u , t , l ) represents a filter F Located in ( y , u ) location and corresponding channel t The l The weight values ​​of each filter; v Indicates the number of valid input features; O iThis represents the voxel features generated by the two-dimensional plane index vector; Represents the set of real numbers; Indicates max pooling; S 合 This represents the feature generated by fusing multiple adjacent voxel features generated from a two-dimensional plane index vector.

[0076] Beneficial effects: This method can not only evaluate in real time in complex and constrained environments, but also eliminates the tedious operation of manual visual retrieval and reduces the occurrence of misjudgments. Attached Figure Description

[0077] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0078] Figure 1 This is a flowchart of a high-voltage transmission line inspection method for complex and constrained scenarios based on multimodal fusion, as described in this application. Detailed Implementation

[0079] To make the above-mentioned objectives, features, and advantages of this application more apparent and understandable, the specific embodiments of this application are described in detail below with reference to the accompanying drawings. Many specific details are set forth in the following description to provide a thorough understanding of this application. However, this application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar modifications without departing from the spirit of this application. Therefore, this application is not limited to the specific embodiments disclosed below.

[0080] like Figure 1 As shown in the figure, this embodiment provides a multi-modal fusion method for inspecting high-voltage transmission lines in complex and constrained scenarios. The method includes:

[0081] S1: Collect two-dimensional diagrams and three-dimensional point cloud data of high-voltage transmission lines.

[0082] S2: Based on the two-dimensional map and the three-dimensional point cloud data, construct a high-voltage transmission line detection dataset, and divide the dataset into a training set and a validation set according to the proportion.

[0083] Specifically, the process of constructing a high-voltage transmission line detection dataset includes:

[0084] The collected 60% of the two-dimensional images and the three-dimensional point cloud data are labeled to construct a training set;

[0085] The remaining two-dimensional image and the three-dimensional point cloud data are used as the validation set.

[0086] S3: Construct a high-voltage transmission line inspection model based on multimodal fusion perception; the high-voltage transmission line inspection model includes a depth completion module, a voxel screening module, a noise-resistant convolution module, and a multi-task detection head prediction module.

[0087] The depth completion module is used to perform depth completion on the three-dimensional point cloud data based on the two-dimensional image to obtain fused data.

[0088] Furthermore, the workflow of the deep completion module includes:

[0089] Step 1: Segment the two-dimensional image into multiple instances, and project the laser points onto the two-dimensional image;

[0090] Step 2: Sample the pixels within each instance and perform nearest neighbor association with the pixels projected by the laser point. Calculate the dynamic weight factor for each associated point using the following formula:

[0091] ;

[0092] in, Indicates the first v The dynamic weighting factor of each associated point; Sigmoid(·) represents the sigmoid activation function; Representing two-dimensional image features within three-dimensional space; Indicates the first v Depth information of each associated point; Linear (·) indicates that the input is converted into the output in a linear manner; C d This represents the extraction of two-dimensional image features from three-dimensional space; m v The x-coordinate represents the two-dimensional image features in three-dimensional space; n v Represents the ordinate of a two-dimensional image feature in three-dimensional space;

[0093] Step 3: Adjust the semantic features according to the dynamic weighting factor. The semantic features are used to modify the virtual points in three-dimensional space. The formula for adjusting the semantic features is as follows:

[0094] ;

[0095] in, Indicates the first v Semantic features of each associated point;

[0096] Step 4: Using the lidar as the center of the coordinate system, with the east of the geolocation radar as the x-axis, the south of the geolocation radar as the y-axis, and the upward direction of the lidar as the z-axis; the three-dimensional point cloud data and the three-dimensional spatial virtual points are fused and aligned in the coordinate system to obtain the fused data.

[0097] The voxel screening module is used to perform point cloud voxelization on the fused data.

[0098] Furthermore, the workflow of the voxel screening module includes:

[0099] Step 1: Perform point cloud voxelization on the fused data;

[0100] Step 2: During the point cloud voxelization process, sample the voxels of the virtual points in the three-dimensional space to obtain the voxels of the sampled virtual points in the three-dimensional space;

[0101] Step 3: Discard the voxels and voxels within the range of ±5% of the total number to obtain the filtered voxels;

[0102] In this embodiment, the voxel selection module is equipped with a drop convolutional layer AO. The default setting of the drop convolutional layer AO during training optimization is to randomly drop voxels within the range of ±5% of the total number of voxels each time the training passes through this layer.

[0103] Step 4: Integrate the selected voxels and the voxelized 3D point cloud data to obtain the fused voxelized point cloud data.

[0104] Furthermore, the sampling formula is:

[0105] ;

[0106] Discarding the voxels and voxels within a defined range, the calculation formula is as follows:

[0107] ;

[0108] in, g This represents the sum of voxels of the sampled three-dimensional virtual points; Indicates the number of voxels within a range of 0-30 meters; Indicates the number of voxels within the range of 30-60; Indicates the number of voxels beyond 60 meters; or Indicates the discarded voxels; i Indicates the proportion of voxels discarded; This indicates the number of voxels discarded within a range of 0-30 meters; This indicates the number of voxels discarded within a 30-60 meter range.

[0109] The noise-resistant convolution module is used to perform noise-resistant processing on the voxelized point cloud data to obtain a fused 3D feature map.

[0110] Furthermore, the workflow of the noise-resistant convolution module includes:

[0111] Step 1: For three-dimensional space, based on the 3D index of the point cloud in the three-dimensional space after the point cloud voxelization, calculate the geometric features of non-empty voxels in the 3×3×3 neighborhood, encode all geometric features in the three-dimensional space, and obtain the geometric features of non-empty voxels in the 3×3 neighborhood.

[0112] Step 2: For two-dimensional space, the fused data after voxelization of the point cloud transforms the grid points back to the original coordinate system in the two-dimensional space, and projects the grid points onto the two-dimensional image plane based on the camera calibration parameters of the LiDAR; the calculation formula is:

[0113] ;

[0114] in, Represents a two-dimensional plane index vector; p (·) indicates projection; Inverse transform representing data augmentation; This represents the inverse transformation of grid points to the original 3D coordinate system; H Represents a three-dimensional index vector;

[0115] Step 3: Based on the 2D index of the point cloud in the two-dimensional space of the fused data after voxelization of the point cloud, calculate the noise perception features of non-empty voxels in the 3×3 neighborhood, encode all the noise perception features in the two-dimensional space, and obtain the noise perception features of non-empty voxels in the 3×3 neighborhood.

[0116] Step 4: Concatenate the non-empty voxel geometric features in the 3×3 neighborhood with the non-empty voxel noise perception features in the 3×3 neighborhood to obtain a noise feature vector;

[0117] Step 5: Filter or remove the part corresponding to the noise feature vector in the fused data after voxelization of the point cloud to obtain the fused 3D feature map.

[0118] Furthermore, the cascading calculation formula is as follows:

[0119] ;

[0120] ;

[0121] ;

[0122] ;

[0123] ;

[0124] ;

[0125] in, express[ i , b The noise feature vector within the interval's neighborhood; T represents the transpose; Represents the geometric features of a non-empty voxel within a 3×3 neighborhood; This represents the noise perception feature of a non-empty voxel within a 3×3 neighborhood. This represents encoding all geometric features in three-dimensional space; O ( i+y , j+u , t ) indicates that the point cloud voxelized fused data is located in ( i+ y , j+u ) position and corresponding to the channel t Pixel values; K h Indicates altitude; K w Indicates width; Q Indicates the number of channels; F ( y , u , t , l ) represents a filter F Located in ( y , u ) location and corresponding channel t The l The weight values ​​of each filter; v Indicates the number of valid input features; O i This represents the voxel features generated by the two-dimensional plane index vector; Represents the set of real numbers; Indicates max pooling; S 合 This represents the feature generated by fusing multiple adjacent voxel features generated from a two-dimensional plane index vector.

[0126] The multi-task detection head prediction module is used to classify the fused 3D feature map to obtain the category.

[0127] Furthermore, the workflow of the multi-task detection head prediction module includes:

[0128] Step 1: The multi-task detection head prediction module is equipped with three convolutional blocks; the first layer of each convolutional block downsamples the fused 3D feature map through a convolution with a stride of 2, and extracts features from the downsampled fused 3D feature map through multiple convolutional sequences with a stride of 1 in each convolutional block to obtain the feature map;

[0129] Step 2: Upsample the feature map extracted from each convolutional block and stitch them together to obtain a multi-scale high-resolution feature map;

[0130] Step 3: Use the detection head to classify the multi-scale high-resolution feature map to obtain the category.

[0131] S4: Train the high-voltage transmission line inspection model based on the training set.

[0132] Specifically, the process includes the following steps:

[0133] Step 1: Predicted bounding boxes and bounding boxes are set in the training set; the training set is input into the high-voltage transmission line inspection model, and the high-voltage transmission line inspection model is backpropagated through the output of the predicted bounding boxes and the loss function to update its model parameters;

[0134] The loss function is expressed as:

[0135] ;

[0136] in, L Represents the loss function; N s This indicates the number of detection boxes sampled during the training phase; β 1 represents the first loss coefficient. β 2 is the second loss coefficient. β 3 represents the third loss coefficient; L cls This represents the classification loss function for predicted bounding boxes; L reg This represents the bounding box regression loss function; L dir This represents the directional classification loss function;

[0137] The expression for the bounding box classification loss function is:

[0138] ;

[0139] in, N Indicates the number of categories; y a Indicates the first a The actual category labels for each category; p aIndicates the first a The predicted probability distribution of each category;

[0140] The expression for the bounding box regression loss function is:

[0141] ;

[0142] in, N pos Indicates the number of anchor frames; SmoothL 1(·) represents SmoothL 1. Loss function; dx b Indicates relative to the first b The horizontal displacement correction value of the boundary box coordinates of each anchor frame; day b Indicates relative to the first b Vertical displacement correction value of the boundary box coordinates of each anchor frame; dw b Indicates relative to the first b Width scaling correction value for the bounding box coordinates of each anchor box; dh b Indicates relative to the first b The height scaling correction value for the bounding box coordinates of each anchor box; t x The horizontal displacement correction represents the actual bounding box coordinates; t y The vertical displacement correction represents the actual bounding box coordinates; t w Width scaling correction representing the actual bounding box coordinates; t h The height scaling correction represents the actual bounding box coordinates;

[0143] The expression for the orientation classification loss function is:

[0144] ;

[0145] in, L ( l , l gt ) represents the angle regression loss function; l This represents the predicted angle value; l gt This represents the actual angle value marked.

[0146] Step 2: Repeat Step 1 until convergence or the maximum number of iterations is reached to obtain the trained high-voltage transmission line inspection model.

[0147] S5: Input the validation set into the trained high-voltage transmission line inspection model, and after filtering by the post-processing module, output the target detection results.

[0148] Furthermore, the workflow of the post-processing module is as follows:

[0149] Step 1: For each of the output categories, sort the detection boxes from high to low according to the confidence level corresponding to the category of power equipment;

[0150] Step 2: Select the detection box with the highest confidence and add it to the final detection results list;

[0151] Step 3: Traverse the remaining detection boxes and discard those detection boxes whose intersection-union ratio with the traversed detection boxes is greater than the threshold; in this embodiment, the threshold is set to 0.7.

[0152] Step 4: Repeat steps 1-3 until all detection boxes have been traversed to obtain the target detection result.

[0153] The inspection method provided in this embodiment has the following beneficial effects:

[0154] 1. This method can not only evaluate in real time in complex and constrained environments, but also eliminates the tedious operation of manual visual retrieval and reduces the occurrence of misjudgments.

[0155] 2. This method integrates multiple modal sensing technologies, which can further improve the ability to perform refined 3D sensing in complex and constrained environments, thereby improving detection accuracy.

[0156] 3. The provided depth completion module solves the problem of the difficulty in integrating different representations of point clouds and images. It uses image features to generate virtual points, performs depth completion on the original point cloud, and then performs fusion and alignment. By making good use of the semantic features of the image, it can provide a more comprehensive scene understanding, which helps to detect high-voltage transmission lines more accurately.

[0157] 4. The provided voxel filtering module addresses the computational burden caused by virtual points generated during depth completion by discarding some useless virtual points generated in the image, thereby reducing the computational burden and improving the computational efficiency and robustness of the model.

[0158] 5. The provided noise-resistant convolution module is designed to address the issue of a large amount of noise generated in virtual points during depth completion. This module removes noise and improves the accuracy of high-voltage transmission line inspection models in complex and constrained scenarios.

[0159] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0160] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A multi-modal fusion complex restricted scene high-voltage transmission line inspection method, characterized in that, include: S1: Collect two-dimensional maps and three-dimensional point cloud data of high-voltage transmission lines; S2: Based on the two-dimensional image and the three-dimensional point cloud data, construct a high-voltage transmission line detection dataset, and divide the dataset into a training set and a validation set according to the proportion; S3: Construct a high-voltage transmission line inspection model with multimodal fusion perception; the high-voltage transmission line inspection model includes a depth completion module, a voxel screening module, a noise-resistant convolution module, and a multi-task detection head prediction module; The depth completion module is used to perform depth completion on the three-dimensional point cloud data based on the two-dimensional image to obtain fused data; The workflow of the deep completion module includes: Step 1: Segment the two-dimensional image into multiple instances, and project the laser points onto the two-dimensional image; Step 2: Sample the pixels within each instance, perform nearest neighbor association between them and the pixels projected by the laser point, and calculate the dynamic weight factor for each associated point. The calculation formula is as follows: ; in, Indicates the first v The dynamic weighting factor of each associated point; Sigmoid(·) represents the sigmoid activation function; Representing two-dimensional image features within three-dimensional space; Indicates the first v Depth information of each associated point; Linear (·) indicates that the input is converted into the output in a linear manner; C d This represents the extraction of two-dimensional image features from three-dimensional space; m v The x-coordinate represents the two-dimensional image features in three-dimensional space; n v Represents the ordinate of a two-dimensional image feature in three-dimensional space; Step 3: Adjust the semantic features according to the dynamic weighting factor. The semantic features are used to modify virtual points in three-dimensional space. The calculation formula for adjusting the semantic features is as follows: ; wherein, represents the semantic feature of the v th associated point; Step 4: Using the lidar as the center of the coordinate system, with the east of the geolocation radar as the x-axis, the south of the geolocation radar as the y-axis, and the upward direction of the lidar as the z-axis; fuse and align the 3D point cloud data with the 3D spatial virtual points in the coordinate system to obtain the fused data; The voxel filtering module is used to perform point cloud voxelization on the fused data; The noise-resistant convolution module is used to perform noise-resistant processing on the voxelized point cloud fusion data to obtain a fused 3D feature map. The multi-task detection head prediction module is used to classify the fused 3D feature map to obtain the category; S4: Train the high-voltage transmission line inspection model based on the training set; S5: Input the validation set into the trained high-voltage transmission line inspection model, and after filtering by the post-processing module, output the target detection results.

2. The method for multi-modal fused complex constrained scenario based high voltage transmission line inspection as claimed in claim 1, wherein, In S2, the construction of the high-voltage transmission line detection dataset based on the two-dimensional image and the three-dimensional point cloud data includes: The collected 60% of the two-dimensional images and the three-dimensional point cloud data are labeled to construct a training set; The remaining two-dimensional image and the three-dimensional point cloud data are used as the validation set.

3. The method for multi-modal fused complex constrained scenario based high voltage transmission line inspection as claimed in claim 1, wherein, The workflow of the voxel screening module includes: Step 1: Perform point cloud voxelization on the fused data; Step 2: During the point cloud voxelization process, sample the voxels of the virtual points in the three-dimensional space to obtain the voxels of the sampled virtual points in the three-dimensional space; Step 3: Discard the voxels and voxels within the range of ±5% of the total number to obtain the filtered voxels; Step 4: Integrate the selected voxels and the voxelized 3D point cloud data to obtain the fused voxelized point cloud data.

4. The method for multi-modal fused complex constrained scenario based high voltage transmission line inspection as claimed in claim 1, wherein, The workflow of the noise-resistant convolution module includes: Step 1: For three-dimensional space, based on the 3D index of the point cloud in the three-dimensional space after the point cloud voxelization, calculate the geometric features of non-empty voxels in the 3×3×3 neighborhood, encode all geometric features in the three-dimensional space, and obtain the geometric features of non-empty voxels in the 3×3 neighborhood. Step 2: For two-dimensional space, the fused data after voxelization of the point cloud transforms the grid points back to the original coordinate system in the two-dimensional space, and projects the grid points onto the two-dimensional image plane based on the camera calibration parameters of the LiDAR; the calculation formula is: ; in, Represents a two-dimensional plane index vector; p (·) indicates projection; Indicates the inverse transform of data augmentation; This represents the inverse transformation of grid points to the original 3D coordinate system; H Represents a three-dimensional index vector; Step 3: Based on the 2D index of the point cloud in the two-dimensional space of the fused data after voxelization of the point cloud, calculate the noise perception features of non-empty voxels in the 3×3 neighborhood, encode all the noise perception features in the two-dimensional space, and obtain the noise perception features of non-empty voxels in the 3×3 neighborhood. Step 4: Concatenate the non-empty voxel geometric features in the 3×3 neighborhood with the non-empty voxel noise perception features in the 3×3 neighborhood to obtain a noise feature vector; Step 5: Filter or remove the part corresponding to the noise feature vector in the fused data after voxelization of the point cloud to obtain the fused 3D feature map.

5. The method for inspecting high-voltage transmission lines in complex and constrained scenarios using multimodal fusion as described in claim 1, characterized in that, The workflow of the multi-task detection head prediction module includes: Step 1: The multi-task detection head prediction module is equipped with three convolutional blocks; the first layer of each convolutional block downsamples the fused 3D feature map through a convolution with a stride of 2, and extracts features from the downsampled fused 3D feature map through multiple convolutional sequences with a stride of 1 in each convolutional block to obtain the feature map; Step 2: Upsample the feature maps extracted from each convolutional block and concatenate them to obtain multi-scale high-resolution feature maps; Step 3: Use the detection head to classify the multi-scale high-resolution feature map to obtain the category.

6. The method for multi-modal fused complex constrained scenario based high voltage transmission line inspection as claimed in claim 1, wherein, The process of training the high-voltage transmission line inspection model based on the training set includes: Step 1: Predicted bounding boxes and bounding boxes are set in the training set; the training set is input into the high-voltage transmission line inspection model, and the high-voltage transmission line inspection model is backpropagated through the output of the predicted bounding boxes and the loss function to update its model parameters; The loss function is expressed as: ; in, L Represents the loss function; N s This indicates the number of detection boxes sampled during the training phase; β 1 represents the first loss coefficient. β 2 is the second loss coefficient. β 3 represents the third loss coefficient; L cls This represents the classification loss function for predicted bounding boxes; L reg This represents the bounding box regression loss function; L dir This represents the directional classification loss function; The expression for the bounding box classification loss function is: ; in, N Indicates the number of categories; y a Indicates the first a The actual category labels for each category; p a Indicates the first a The predicted probability distribution of each category; The expression for the bounding box regression loss function is: ; in, N pos Indicates the number of anchor frames; SmoothL 1(·) represents SmoothL 1. Loss function; dx b Indicates relative to the first b The horizontal displacement correction value of the boundary box coordinates of each anchor frame; dy b Indicates relative to the first b Vertical displacement correction value of the boundary box coordinates of each anchor frame; dw b Indicates relative to the first b Width scaling correction value for the bounding box coordinates of each anchor box; dh b Indicates relative to the first b The height scaling correction value for the bounding box coordinates of each anchor box; t x The horizontal displacement correction represents the actual bounding box coordinates; t y The vertical displacement correction represents the actual bounding box coordinates; t w Width scaling correction representing the actual bounding box coordinates; t h The height scaling correction represents the actual bounding box coordinates; The expression for the orientation classification loss function is: ; in, L ( λ , λ gt ) represents the angle regression loss function; λ This represents the predicted angle value; λ gt This represents the actual angle value marked. Step 2: Repeat Step 1 until convergence or the maximum number of iterations is reached to obtain the trained high-voltage transmission line inspection model.

7. The method for multi-modal fused complex constrained scenario based high voltage transmission line inspection as claimed in claim 1, wherein, The workflow of the post-processing module is as follows: Step 1: For each of the output categories, sort the detection boxes from high to low according to the confidence level corresponding to the category of power equipment; Step 2: Select the detection box with the highest confidence and add it to the final detection results list; Step 3: Traverse the remaining detection boxes and discard those detection boxes whose intersection-union ratio with the traversed detection boxes is greater than the threshold; Step 4: Repeat steps 1-3 until all detection boxes have been traversed to obtain the target detection result.

8. The method for multi-modal fused complex constrained scenario based high voltage transmission line inspection as claimed in claim 3, wherein, The voxels of virtual points in three-dimensional space are sampled using the following formula: ; Discarding the voxels and voxels within a defined range, the calculation formula is as follows: ; in, ζ This represents the sum of voxels of the sampled three-dimensional virtual points; Indicates the number of voxels within a range of 0-30 meters; Indicates the number of voxels within the range of 30-60; Indicates the number of voxels beyond 60 meters; η Indicates the discarded voxels; θ Indicates the proportion of voxels discarded; This indicates the number of voxels discarded within a range of 0-30 meters; This indicates the number of voxels discarded within a 30-60 meter range.

9. The method for multi-modal fused complex constrained scenario based high voltage transmission line inspection as claimed in claim 4, wherein, The cascading calculation formula is: ; ; ; ; ; ; in, express[ i , b The noise feature vector within the interval's neighborhood; T represents the transpose; Represents the geometric features of a non-empty voxel within a 3×3 neighborhood; This represents the noise perception feature of a non-empty voxel within a 3×3 neighborhood. This represents encoding all geometric features in three-dimensional space; O ( i+y , j+u , t ) indicates that the point cloud voxelized fused data is located in ( i+y , j+ u ) position and corresponding to the channel t Pixel values; K h Indicates altitude; K w Indicates width; Q Indicates the number of channels; F ( y , u , t , l ) represents a filter F Located in ( y , u ) location and corresponding channel t The l The weight values ​​of each filter; v Indicates the number of valid input features; O i This represents the voxel features generated by the two-dimensional plane index vector; Represents the set of real numbers; Indicates max pooling; S 合 This represents the feature generated by fusing multiple adjacent voxel features generated from a two-dimensional plane index vector.