Crack detection method and system based on double-path feature extraction and gating fusion
By employing a dual-path feature extraction and gating fusion method, the problem of balancing accuracy and computational load in UAV road surface crack detection was solved, enabling efficient and accurate detection of both small and large cracks, and improving the real-time performance and robustness of UAV detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- EAST CHINA JIAOTONG UNIVERSITY
- Filing Date
- 2026-02-06
- Publication Date
- 2026-06-12
AI Technical Summary
Existing UAV-based methods for detecting road surface cracks struggle to balance maintaining accuracy with reducing computational load. They lack differentiated processing of semantic and detailed features, leading to missed detections of minute cracks and interference from background noise. This results in an imbalance in multi-scale perception capabilities, making it difficult to simultaneously detect both minute and long-distance cracks.
A dual-path feature extraction and gating fusion method is adopted. Semantic and detail features are extracted in parallel through standard convolution and depthwise separable convolution. Combined with gating attention fusion and feature pyramid aggregation, multi-scale detection features are generated to achieve accurate crack detection.
It significantly reduces the number of model parameters and computational load, enhances the stability and robustness of detection, and can simultaneously capture both small and large cracks, thereby improving the accuracy and real-time performance of detection.
Smart Images

Figure CN121661353B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision technology, specifically to a crack detection method and system based on dual-path feature extraction and gating fusion. Background Technology
[0002] Road surface cracks, as a major manifestation of early road surface damage, are a key indicator for assessing road health. With the development of UAV low-altitude remote sensing technology, non-contact automatic inspection combined with YOLO series target detection algorithms has become an industry trend. However, existing methods still have significant shortcomings in UAV embedded platform applications.
[0003] Existing detection models face three main problems when directly applied to UAV crack detection: First, the single-path serial feature extraction architecture struggles to balance maintaining accuracy with reducing computational cost, and excessive lightweighting leads to missed detection of tiny cracks; second, there is a lack of differentiated processing of semantic and detail features, and simple feature fusion methods easily confuse background noise such as shadows and water stains with real cracks; and third, there is an imbalance in multi-scale perception capabilities, making it difficult to simultaneously capture tiny cracks accurately and detect the integrity of long-distance cracks. Summary of the Invention
[0004] This invention provides a crack detection method and system based on dual-path feature extraction and gating fusion, which solves the problem of balancing real-time performance and accuracy in UAV road surface crack detection.
[0005] To achieve the above objectives, the present invention provides the following technical solution:
[0006] This invention relates to a crack detection method based on dual-path feature extraction and gating fusion, comprising:
[0007] S100: Acquire the road surface image to be detected, perform size adjustment and normalization processing on the road surface image to obtain the network input tensor;
[0008] S200: Input the network into a tensor and input it into a feature extraction network. At multiple scale levels, features are extracted in parallel through a first feature extraction path and a second feature extraction path. The first feature extraction path uses standard convolution to extract semantic features and obtains a first feature map. The second feature extraction path uses depthwise separable convolution to extract detail features and obtains a second feature map.
[0009] S300: For each scale level, gated attention fusion is performed on the corresponding first feature map and second feature map; adaptive gate weights are generated through local context extraction and global feature extraction, and the first feature map and second feature map are weighted and fused based on the gate weights to obtain a fused feature map;
[0010] S400: Perform bidirectional feature aggregation on the fused feature map from top to bottom and from bottom to top to obtain multi-scale detection features;
[0011] S500: Input the multi-scale detection features into the detection head for prediction to obtain the prediction result; perform confidence filtering and non-maximum suppression on the prediction result, and output the crack detection result.
[0012] As a preferred embodiment of the present invention, before inputting the network into the tensor feature extraction network, the method further includes:
[0013] The network input tensor is convolutionally downsampled to obtain intermediate features;
[0014] The intermediate features are subjected to max pooling and concatenated convolution operations to obtain the first branch features and the second branch features.
[0015] The first branch features and the second branch features are concatenated along the channel dimension and then fused by convolution to obtain the initial feature map;
[0016] The initial feature map is used as input to the feature extraction network.
[0017] As a preferred embodiment of the present invention, the plurality of scale levels include a first scale level, a second scale level, and a third scale level;
[0018] The feature map resolution of the first scale level is 1 / 8 of the input image, the feature map resolution of the second scale level is 1 / 16 of the input image, and the feature map resolution of the third scale level is 1 / 32 of the input image.
[0019] As a preferred embodiment of the present invention, the step of obtaining the first feature map includes:
[0020] Perform standard convolution downsampling on the input features;
[0021] The downsampled features are extracted using a residual module containing multiple stacked bottleneck units to obtain the first feature map.
[0022] As a preferred embodiment of the present invention, the step of obtaining the second feature map includes:
[0023] Perform depthwise separable convolution downsampling on the input features;
[0024] The downsampled features are extracted by stacking multiple lightweight convolution operators. The outputs of each layer are concatenated along the channel dimension and then aggregated by squeezing convolution and activation convolution. Finally, residual connections are made with the input to obtain the second feature map.
[0025] As a preferred embodiment of the present invention, the gated attention fusion step includes:
[0026] Local context features are obtained by performing local context extraction on the first feature map and the second feature map respectively through depthwise separable convolution and dilated convolution.
[0027] Global context features are obtained by performing global feature extraction on the first feature map and the second feature map respectively through global average pooling and global max pooling.
[0028] After fusing the local context features and global context features, the adaptive gating weights are generated through an activation function;
[0029] The first feature map and the second feature map are cross-gated and weighted by the gating weights to obtain the fused feature map.
[0030] As a preferred embodiment of the present invention, before performing bidirectional feature aggregation from top to bottom and bottom to top on the fused feature map, the method further includes:
[0031] The fused feature map at the third scale level is used to extract multi-scale contextual information through max pooling operations with different kernel sizes and then stitched together.
[0032] Enhance the response weights of key feature channels through a self-attention mechanism.
[0033] As a preferred embodiment of the present invention, the bidirectional feature aggregation of the fused feature map, both top-down and bottom-up, includes:
[0034] After upsampling the fusion feature map of the third scale level, it is concatenated with the fusion feature maps of the second scale level and the first scale level in the channel dimension and then convolved and fused to achieve top-down aggregation.
[0035] After downsampling the fused feature map of the first scale level, it is concatenated with the fused feature maps of the second and third scale levels in the channel dimension and then convolved and fused to achieve bottom-up aggregation.
[0036] As a preferred technical solution of the present invention, inputting the multi-scale detection features into the detection head for prediction includes:
[0037] Predict the crack category confidence level using convolutional layers with classification branches;
[0038] The bounding box parameters of the crack are predicted by the convolutional layer of the regression branch, and the bounding box parameters include the center point coordinates and width and height dimensions.
[0039] This invention also proposes a crack detection system based on dual-path feature extraction and gating fusion, comprising:
[0040] The image preprocessing module is used to acquire the road surface image to be detected, and to perform size adjustment and normalization processing on the road surface image to obtain the network input tensor;
[0041] The dual-path feature extraction module is used to input the network input tensor into the feature extraction network, and extract features in parallel through the first feature extraction path and the second feature extraction path at multiple scale levels; the first feature extraction path uses standard convolution to extract semantic features to obtain a first feature map; the second feature extraction path uses depthwise separable convolution to extract detail features to obtain a second feature map.
[0042] The gated attention fusion module is used to perform gated attention fusion on the corresponding first feature map and second feature map for each scale level; by extracting local context and global features, an adaptive gate weight is generated, and the first feature map and second feature map are weighted and fused based on the gate weight to obtain a fused feature map;
[0043] The feature pyramid aggregation module is used to perform bidirectional feature aggregation on the fused feature map from top to bottom and from bottom to top to obtain multi-scale detection features;
[0044] The detection and post-processing module is used to input the multi-scale detection features into the detection head for prediction and obtain the prediction results; perform confidence filtering and non-maximum suppression on the prediction results and output the crack detection results.
[0045] The beneficial effects of this invention are:
[0046] 1. This invention constructs a dual-path parallel feature extraction structure, which enables the standard convolutional path to maintain stable extraction of deep semantic features, while the lightweight path reduces computational complexity and captures high-frequency texture details through depthwise separable convolution and HGBlock modules. The two work together to significantly reduce the number of model parameters and computational load, while effectively avoiding the problem of loss of micro-crack features caused by excessive lightweighting.
[0047] 2. The gated attention fusion mechanism introduced in this invention generates adaptive gated weights through local context extraction and global feature extraction, selectively weights and fuses dual-path features, and cross-modulates them, thereby enhancing the effective feature response related to the crack target and suppressing the interference of background noise such as shadows and water stains, which significantly improves the detection stability and robustness of the model in complex road environment.
[0048] 3. This invention performs dual-path extraction and gated fusion at three scale levels: P3, P4, and P5. Combined with the bidirectional aggregation mechanism of the feature pyramid, it enables shallow high-resolution features to obtain deep semantic information and deep features to obtain shallow localization information, thereby achieving comprehensive and accurate detection capability for crack targets of different scales, from micro-cracks to large through-cracks. Attached Figure Description
[0049] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0050] Figure 1 This is a flowchart illustrating the crack detection method based on dual-path feature extraction and gating fusion of the present invention.
[0051] Figure 2 This is a schematic diagram of the crack detection system based on dual-path feature extraction and gating fusion according to the present invention;
[0052] Figure 3 This is a flowchart of the dual-path parallel feature extraction structure of the present invention;
[0053] Figure 4 This is a flowchart of the gating attention fusion module of the present invention;
[0054] Figure 5 These are detection effect diagrams of the method of the present invention under different road surface scenarios. Detailed Implementation
[0055] The preferred embodiments of the present invention will be described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are for illustration and explanation only and are not intended to limit the present invention.
[0056] This invention can be implemented on computing devices equipped with GPUs, such as industrial computers with at least 6GB of video memory, airborne embedded computing platforms (such as Jetson Orin NX), or servers. This invention can be implemented using a deep learning inference framework based on tensor computation. Those skilled in the art can replace or equivalently adjust the module structure and computational flow provided by this invention with different convolutional kernel sizes, dilation rates, or attention structures, all of which fall within the scope of protection of this invention.
[0057] Example 1: As Figure 1 As shown, the crack detection method based on dual-path feature extraction and gating fusion includes:
[0058] S100: Acquire the road surface image to be detected, perform size adjustment and normalization processing on the road surface image to obtain the network input tensor;
[0059] Specifically, the first step is to read the road surface image to be detected collected by the drone. To accommodate network input requirements, the image is scaled to a preset size (e.g., 640 pixels) while maintaining its aspect ratio, and the short side is zero-padding or constant padding is applied to ensure a uniform size. In this embodiment, the input resolution is preferably 640×640, but it can also be set to 800×800 or other equivalent sizes according to actual needs.
[0060] Image pixel values are mapped to floating-point numbers in the range [0, 1], and mean-variance normalization is performed. Let the input image have channels... ,coordinate The pixel value at that location is Normalized pixel values The calculation formula is as follows:
[0061] ;
[0062] in, and The target road surface dataset is located in the channel. The mean and standard deviation are calculated on the training dataset. For the pavement crack dataset, the mean and standard deviation of the three color channels are determined based on the pixel distribution characteristics of the specific dataset.
[0063] Change image data format from (Height, width, channel) converted to The format is modified, and a batch dimension B is added to form the final network input tensor. Where B represents the batch size, which is typically set to 1 during the inference phase; 3 represents the three color channels: RGB. and These represent the height and width of the image, respectively.
[0064] The above preprocessing steps convert the original road surface image into a standardized tensor that meets the input requirements of the neural network, laying the foundation for subsequent feature extraction and detection.
[0065] S200: Input the network into a tensor and input it into a feature extraction network. At multiple scale levels, features are extracted in parallel through a first feature extraction path and a second feature extraction path. The first feature extraction path uses standard convolution to extract semantic features and obtains a first feature map. The second feature extraction path uses depthwise separable convolution to extract detail features and obtains a second feature map.
[0066] Furthermore, before inputting the network into the tensor feature extraction network, the method further includes:
[0067] The network input tensor is convolutionally downsampled to obtain intermediate features;
[0068] The intermediate features are subjected to max pooling and concatenated convolution operations to obtain the first branch features and the second branch features.
[0069] The first branch features and the second branch features are concatenated along the channel dimension and then fused by convolution to obtain the initial feature map;
[0070] The initial feature map is used as input to the feature extraction network.
[0071] Specifically, when inputting the network tensor Before being input into the feature extraction network, lightweight initial feature mapping is first performed using the HGStem module. The preprocessed tensor... The input module HGStem replaces the first layer of a traditional convolutional network and retains rich edge information while reducing resolution through a parallel branching structure.
[0072] enter First, it goes through the first convolutional layer. (Step size 2) Perform initial downsampling to obtain intermediate features. .feature The program proceeds to two parallel branches: the pooling branch performs max pooling (kernel size 2, stride 1) and outputs the first branch features. The convolution branch passes through two layers of convolution in sequence. and Output the second branch features The outputs of the two branches are concatenated along the channel dimension and then subjected to a compressed convolution. and The initial feature map is obtained by fusion. :
[0073] ;
[0074] in, This represents a concatenation operation along the channel dimension. The initial feature map... As input to the feature extraction network, it enters the subsequent dual-path parallel feature extraction stage.
[0075] Furthermore, the plurality of scale levels includes a first scale level, a second scale level, and a third scale level;
[0076] The feature map resolution of the first scale level is 1 / 8 of the input image, the feature map resolution of the second scale level is 1 / 16 of the input image, and the feature map resolution of the third scale level is 1 / 32 of the input image.
[0077] Specifically, the data enters the deep extraction stage of the backbone network, where dual-path parallel processing is performed at multiple scale levels. These multiple scale levels include a first scale level P3, a second scale level P4, and a third scale level P5. The feature map resolution of the first scale level P3 is 1 / 8 of the input image, the feature map resolution of the second scale level P4 is 1 / 16 of the input image, and the feature map resolution of the third scale level P5 is 1 / 32 of the input image. This multi-scale design enables the network to simultaneously capture crack targets of different scales, effectively detecting everything from small cracks to large through-cracks.
[0078] Furthermore, the step of obtaining the first feature map includes:
[0079] Perform standard convolution downsampling on the input features;
[0080] The downsampled features are extracted using a residual module containing multiple stacked bottleneck units to obtain the first feature map.
[0081] Furthermore, the step of obtaining the second feature map includes:
[0082] Perform depthwise separable convolution downsampling on the input features;
[0083] The downsampled features are extracted by stacking multiple lightweight convolution operators. The outputs of each layer are concatenated along the channel dimension and then aggregated by squeezing convolution and activation convolution. Finally, residual connections are made with the input to obtain the second feature map.
[0084] Specifically, for each scale level, let the input features be... Features are extracted in parallel through the first feature extraction path (path A) and the second feature extraction path (path B).
[0085] The first feature extraction path uses standard convolution to extract semantic features. For the input features... First, standard convolutional downsampling with a stride of 2 is performed to reduce the spatial resolution of the feature map. Then, the downsampled features are extracted using a C3k2 residual module containing multiple stacked bottleneck units. The C3k2 module contains n bottleneck units, and through dense residual connections, it extracts deep semantic information and overall contour features of the road surface image to obtain the first feature map. The number of bottleneck units, n, is determined based on the feature representation requirements at different scale levels. In this embodiment, the preferred value of n is between 2 and 4. Standard convolutional paths ensure the model's ability to identify the main crack structure and maintain stable extraction of deep semantic features.
[0086] The second feature extraction path uses depthwise separable convolution to extract detailed features. For the input features... First, downsampling is performed using a depthwise separable convolution with a stride of 2. Depthwise separable convolution decomposes standard convolution into two steps: depthwise convolution and pointwise convolution. The computational cost is only a fraction of that of standard convolution. (in The size of the convolution kernel is reduced significantly, greatly reducing the computational load. Subsequently, the downsampled features are extracted using the HGBlock module. The HGBlock module consists of m layers of lightweight convolution operators stacked together, focusing on capturing high-frequency details such as edge gradient changes and fine textures of road surface cracks.
[0087] Let the first The transformation of a lightweight convolution layer is as follows: Then the intermediate feature sequence for:
[0088] ;
[0089] in, As input features, For the first Output features of each layer. After concatenating the outputs of each layer along the channel dimension, the concatenated sequence is processed by squash convolution (SC) and excitation convolution (EC). Perform aggregation and compare with input Perform residual connections to obtain the second feature map. :
[0090] ;
[0091] The squeeze convolution is used to compress the number of channels in the concatenated features, while the activation convolution is used to enhance the response of key features. Residual connections ensure the effective transmission of feature information and avoid the gradient vanishing problem. This lightweight path, through a combination of depthwise separable convolution and HGBlock, significantly reduces computational complexity while maintaining a keen ability to capture minute crack texture features.
[0092] Through the above dual-path parallel design, the corresponding first feature maps are obtained at the three scale levels of P3, P4, and P5, respectively. Second feature map This provides complementary semantic and detail features for subsequent gating attention fusion.
[0093] S300: For each scale level, gated attention fusion is performed on the corresponding first feature map and second feature map; adaptive gate weights are generated through local context extraction and global feature extraction, and the first feature map and second feature map are weighted and fused based on the gate weights to obtain a fused feature map;
[0094] Furthermore, the gated attention fusion step includes:
[0095] Local context features are obtained by performing local context extraction on the first feature map and the second feature map respectively through depthwise separable convolution and dilated convolution.
[0096] Global context features are obtained by performing global feature extraction on the first feature map and the second feature map respectively through global average pooling and global max pooling.
[0097] After fusing the local context features and global context features, the adaptive gating weights are generated through an activation function;
[0098] The first feature map and the second feature map are cross-gated and weighted by the gating weights to obtain the fused feature map.
[0099] Specifically, for each scale level, the first feature map output by the first feature extraction path is... The second feature map output by the second feature extraction path Gated attention fusion is performed. Adaptive gating weights are generated through local context extraction and global feature extraction to achieve selective fusion of features from both paths.
[0100] First, multi-scale context extraction is performed. This is applied to the first feature map. Second feature map Local context extraction and global feature extraction are performed separately.
[0101] Local context extraction utilizes depthwise separable convolution and dilated convolution to extract local features. Depthwise separable convolution operations are performed on the input features to capture the texture details of local cracks. Subsequently, dilated convolution (DilationConv) is introduced to expand the receptive field, obtaining a larger range of contextual information without increasing the number of parameters, thus obtaining local contextual features. The dilation rate of the dilated convolution is determined based on the feature map resolution at different scale levels, preferably ranging from 2 to 4, to effectively expand the receptive field while maintaining the feature map size.
[0102] Global feature extraction involves parallel execution of global average pooling and global max pooling operations. Global average pooling calculates the average value across the spatial dimension for each channel of the feature map, obtaining channel-level statistical features; global max pooling extracts the maximum response value across the spatial dimension for each channel, capturing salient features. The results of the two pooling operations are then fused to obtain global contextual features. These global contextual features contain the environmental semantic information of the entire road surface, helping to distinguish cracks from background noise.
[0103] Secondly, adaptive gating weights are generated. The extracted local and global context features are fused using a 1×1 convolution to integrate channel-dimensional information. Subsequently, an adaptive gating weight map is generated using the Sigmoid activation function. The Sigmoid function maps feature values to the [0, 1] interval, and its output value represents the importance of the feature at the corresponding location. In this embodiment, the Sigmoid activation function is preferably used to generate the gating weights; however, in specific application scenarios, the Softmax normalization function can also be used to normalize the weights of multiple feature branches.
[0104] Next, bidirectional cross-gating modulation is performed. To enhance information flow between the two paths, a bidirectional cross-gating interaction mechanism is constructed. Using the fused gating weight information, the first feature map is modulated accordingly. Second feature map Cross-modulation is performed. Specifically, the semantic features of the first feature map are gated using the texture information of the second feature map, and vice versa. Interactive features. The calculation formula is as follows:
[0105] ;
[0106] in, and These are the first and second feature maps after context extraction, respectively. It is the Sigmoid activation function. This represents element-wise dot product operation. This bidirectional cross-gating mechanism allows semantic features and detail features to guide and enhance each other, highlighting effective features related to the crack target while suppressing background noise interference.
[0107] Finally, the weighted fusion output is completed. Interactive features. After adjusting the channel dimensions through projective convolution and performing residual connections with the original input, the final fused feature map is output. Residual connections ensure the complete transmission of feature information, avoiding the loss of effective information due to gating operations.
[0108] Through the aforementioned gated attention fusion process, fused feature maps are obtained at the P3, P4, and P5 scale levels, respectively. These fused feature maps combine the semantic robustness of the first feature map with the detail richness of the second feature map, enabling accurate identification of crack targets at different scales under complex lighting conditions and background interference.
[0109] S400: Perform bidirectional feature aggregation on the fused feature map from top to bottom and from bottom to top to obtain multi-scale detection features;
[0110] Furthermore, before performing bidirectional feature aggregation from top to bottom and bottom to top on the fused feature map, the method further includes:
[0111] The fused feature map at the third scale level is used to extract multi-scale contextual information through max pooling operations with different kernel sizes and then stitched together.
[0112] Enhance the response weights of key feature channels through a self-attention mechanism.
[0113] Specifically, before performing bidirectional feature aggregation from top to bottom and bottom to top on the fused feature map, deep semantic enhancement processing is first performed on the fused feature map of the third scale level P5.
[0114] The fused feature map from the third-scale layer P5 is sequentially input into the Spatial Pyramid Pooling (SPPF) module and the C2PSA module. The SPPF module extracts multi-scale contextual information through max pooling operations with different kernel sizes. Specifically, a max pooling operation with a kernel size of 5 is performed in three concatenated steps. This concatenation structure achieves multi-scale feature extraction with equivalent receptive fields of 5, 9, and 13. Each pooling operation captures global contextual information with a different receptive field range while maintaining the spatial size of the feature map. The pooling results from multiple scales are then concatenated along the channel dimension to fuse the feature representations from different receptive fields, thereby enhancing the model's ability to understand the overall contour and spatial distribution of the crack target.
[0115] Subsequently, the response weights of key feature channels are further enhanced using the C2PSA module. The C2PSA module utilizes a multi-head self-attention mechanism to model the channel dimensions of the feature map and calculate the correlation between different channels. Through the learned attention weights, the response intensity of key feature channels related to the crack target is adaptively enhanced, while suppressing interference from background noise channels, generating deeper features with stronger semantic expressive power.
[0116] Furthermore, the bidirectional feature aggregation of the fused feature map, both top-down and bottom-up, includes:
[0117] After upsampling the fusion feature map of the third scale level, it is concatenated with the fusion feature maps of the second scale level and the first scale level in the channel dimension and then convolved and fused to achieve top-down aggregation.
[0118] After downsampling the fused feature map of the first scale level, it is concatenated with the fused feature maps of the second and third scale levels in the channel dimension and then convolved and fused to achieve bottom-up aggregation.
[0119] Specifically, first, top-down feature aggregation is performed. The fused feature map of the third-scale layer P5 is upsampled to increase its spatial resolution to match that of the second-scale layer P4. The upsampled P5 feature map is then concatenated with the fused feature map of P4 along the channel dimension, and finally, the feature information from both scales is integrated through a convolutional fusion operation. The fusion formula is:
[0120] ;
[0121] in, Indicates an upsampling operation. For deep features, These are shallow features. This is a convolutional fusion operation. Through this process, deep semantic information from the third-scale layer is transferred to the second-scale layer, enhancing the semantic expressive power of the mid-scale feature map.
[0122] Following the same approach, the fused P4 feature map is upsampled and concatenated with the fused feature map of the first scale layer P3 along the channel dimension, followed by convolutional fusion to achieve a complete top-down feature transfer process. This process allows the shallow high-resolution feature map to acquire rich semantic information from deeper layers, improving the semantic understanding of cracks in small targets.
[0123] Secondly, bottom-up feature aggregation is performed. The fused feature map of the first scale layer P3 is downsampled by a convolution with a stride of 2 to reduce the spatial resolution of the feature map to match that of the second scale layer P4. The downsampled P3 feature map and P4 feature map are then concatenated along the channel dimension, and subsequently integrated through a convolutional fusion operation. This process backpropagates the fine localization information of the shallow high-resolution layer to the mid-level features, enhancing the spatial localization accuracy of the mid-scale features.
[0124] The fused P4 feature map is further downsampled and concatenated with the feature map of the third-scale layer P5 along the channel dimension, followed by convolutional fusion to complete the bottom-up feature aggregation process. This process transmits shallow spatial detail information back to the deeper layers, enabling the deep features to maintain strong semantic expression while possessing more accurate spatial localization capabilities.
[0125] Through the aforementioned top-down and bottom-up bidirectional feature aggregation, the feature maps at the P3, P4, and P5 scale levels fully exchange semantic and spatial localization information, resulting in multi-scale detection features. These multi-scale detection features encompass both deep semantic understanding capabilities and retain shallow spatial localization accuracy, ensuring that the model possesses both accurate classification and precise bounding box localization capabilities when detecting cracks at different scales.
[0126] S500: Input the multi-scale detection features into the detection head for prediction to obtain the prediction result; perform confidence filtering and non-maximum suppression on the prediction result, and output the crack detection result.
[0127] Furthermore, inputting the multi-scale detection features into the detection head for prediction includes:
[0128] Predict the crack category confidence level using convolutional layers with classification branches;
[0129] The bounding box parameters of the crack are predicted by the convolutional layer of the regression branch, and the bounding box parameters include the center point coordinates and width and height dimensions.
[0130] Specifically, the multi-scale detection features obtained in step S400 are input into the detection head for prediction. The multi-scale detection features include feature maps at three levels: the first scale level P3, the second scale level P4, and the third scale level P5, which correspond to the detection of small-scale cracks, medium-scale cracks, and large-scale cracks, respectively.
[0131] A decoupled detection head structure is used to predict feature maps at each scale level. The decoupled detection head separates the classification and regression tasks into two independent prediction branches, which are processed separately by dedicated convolutional layers, avoiding the problem of mutual interference between classification and localization tasks in traditional coupled detection heads.
[0132] For each scale level of the feature map, category prediction is first performed through a classification branch. The classification branch extracts features using a series of convolutional layers, and finally predicts the category confidence of the crack using a 1×1 convolutional layer. The category confidence represents the probability of a crack target existing at each location on the feature map and the confidence level that the target belongs to a specific crack category. The output classification feature map has a dimension of [dimension missing]. ,in For the number of crack types, and For the first The height and width of the feature maps at each scale level.
[0133] Simultaneously, bounding box prediction is performed through a regression branch. This regression branch also employs a series of convolutional layers to extract features, and finally predicts the bounding box parameters of the crack using a 1×1 convolutional layer. These bounding box parameters include the center point coordinates. and width and height dimensions To address the elongated shape and blurred edges typically characteristic of road surface cracks, the regression branch employs the Distributed Focus Loss (DFL) function for optimization training. DFL learns the probability distribution of the distances to the four boundaries of the bounding box, rather than directly regressing deterministic values. This effectively represents the uncertainty of crack edge locations, thereby improving the accuracy of locating irregular crack targets. The output regression feature map has a dimension of [missing information]. ,in The number of channels for the bounding box parameters.
[0134] The detectors at the three scale levels output prediction results at their respective scales, resulting in the original prediction tensor containing class confidence and bounding box parameters.
[0135] Subsequently, the prediction results are post-processed, including confidence filtering and non-maximum suppression.
[0136] First, perform confidence threshold filtering. Set the confidence threshold. Iterate through all predicted candidate boxes and calculate the class confidence score for each candidate box. Remove boxes with a class confidence score below a threshold. The candidate boxes are selected, and the valid candidate boxes with higher confidence are retained. The confidence threshold is... The value is determined based on the accuracy and recall requirements of the actual application scenario, and is preferably set to 0.25 in this embodiment. Confidence filtering can quickly filter out a large number of low-quality invalid prediction boxes, reducing the computational burden of subsequent non-maximum suppression.
[0137] Secondly, the Non-Maximum Suppression (NMS) algorithm is applied to remove redundant detection boxes. For the remaining candidate boxes of the same category, the Intersection over Union (IoU) ratio between any two candidate boxes is calculated. IoU is defined as the ratio of the intersection area to the union area of two bounding boxes, used to measure the degree of overlap between the two detection boxes. If the IoU between two candidate boxes is greater than a set NMS threshold, then... If the two candidate boxes detect the same crack target, the candidate box with higher confidence is retained, while redundant candidate boxes with lower confidence are suppressed and removed. The NMS threshold is... The value is determined based on the density and overlap of crack targets; in this embodiment, a value of 0.45 is preferred. Non-maximum suppression eliminates duplicate detection of the same target, ensuring that each crack target corresponds to only one optimal prediction box.
[0138] Finally, the retained bounding box coordinates are mapped back from the network feature map scale to the coordinate system of the original input image. Since the image was scaled and padded during preprocessing, the normalized bounding box coordinates need to be adjusted according to the scaling ratio and padding parameters. Convert to pixel coordinates in the original image coordinate system. Draw the bounding box rectangle of the crack on the original road surface image, label the crack category and confidence score, and output the final crack detection result image.
[0139] Through the above-described multi-scale detection and post-processing process, the method of the present invention can accurately detect crack targets of different scales and shapes in road surface images acquired by UAVs, realize the automatic identification and location of road surface cracks, and meet the real-time and accuracy requirements of UAV road surface inspection.
[0140] Example 2: This example uses a real-world application scenario of a city's road maintenance department to illustrate the application effect of the method of the present invention in the task of inspecting road surface cracks using a drone.
[0141] The city's road maintenance department is responsible for the daily maintenance of the city's main and secondary roads. Early detection and timely maintenance of road surface cracks are crucial for ensuring road lifespan and driving safety. Traditional manual inspection methods are inefficient and pose safety hazards in busy traffic areas. To achieve efficient and safe detection of road surface cracks, the department adopted this invention's crack detection system based on dual-path feature extraction and gating fusion, mounted on a drone equipped with an onboard embedded computing platform for intelligent inspection of road surface cracks. The system is as follows... Figure 2 As shown, the system includes: an image preprocessing module, a dual-path feature extraction module, a gated attention fusion module, a feature pyramid aggregation module, and a detection and post-processing module. The system is used to execute all the process steps of the crack detection method based on dual-path feature extraction and gated fusion in the above embodiments. The working principles and beneficial effects of the two are one-to-one.
[0142] Specifically, after system deployment, the raw road surface images collected by the UAV are first transmitted to the onboard computing platform. The system preprocesses the images according to step S100, scaling them to 640×640 resolution while maintaining the aspect ratio, and zero-padding the shorter sides. The RGB three-channel pixel values are normalized based on the mean obtained from the training dataset. and standard deviation Perform standardization, converting data formats to... The tensor input network.
[0143] The preprocessed tensor first passes through the HGStem module, which uses a dual-branch structure to downsample the feature map resolution from 640×640 to 80×80, outputting an initial feature map with 64 channels.
[0144] In the feature extraction stage of the backbone network, dual-path parallel processing is performed at three scale levels: P3 (80×80), P4 (40×40), and P5 (20×20). For example... Figure 3 As shown, at layer P3, path A uses standard convolution and a C3k2 module containing 3 bottleneck units to output a semantic feature map with 256 channels; path B uses depthwise separable convolution and an HGBlock module to output a detail feature map with 256 channels. Layers P4 and P5 are processed using the same architecture, with output channels of 512 and 1024, respectively.
[0145] At each scale level, the feature maps from both paths are input to a gated attention fusion process. For example... Figure 4As shown, for the two 256-channel feature maps at layer P3, local context is first extracted using depthwise separable convolutions (kernel size 3×3), with the dilation rate of the dilated convolutions set to 2. Simultaneously, global average pooling and global max pooling are performed to extract global features. After fusion, a gated weight map is generated using 1×1 convolutions, and the dual-path features are cross-modulated and weighted to output a 256-channel fused feature map. The same operation is performed at layers P4 and P5.
[0146] The P5-level fusion features enter the SPPF module, where three cascaded max pooling operations with a kernel size of 5×5 are performed respectively. After splicing, the channel response is enhanced through the C2PSA module.
[0147] In the feature pyramid aggregation stage, feature P5 is upsampled and then merged with P4, and then upsampled again and merged with P3 to complete top-down aggregation; subsequently, feature P3 is downsampled and then merged with P4, and then downsampled again and merged with P5 to complete bottom-up aggregation.
[0148] Finally, the detection features at the P3, P4, and P5 scales are input into the decoupled detection head. The classification branch predicts the crack category confidence, and the regression branch predicts the bounding box parameters. In the post-processing stage, the confidence threshold is set to 0.25, the NMS threshold is set to 0.45, and redundant detection boxes are filtered out before the coordinates are mapped back to the original 4096×3072 resolution image.
[0149] Figure 5 The system's detection results for different road surface scenarios are shown. From left to right, they include: longitudinal cracks in the curb area, transverse cracks in the center of the road surface, alligator cracks on the gray road surface, cracks near road markings, mesh cracks in the parking space area, and ring cracks around manhole covers. The location and type of each type of crack are marked with different colored bounding boxes in the detection results image, verifying the system's ability to detect cracks of different shapes and scales.
[0150] The system processes a single image in approximately 28 milliseconds, meeting the requirements for real-time inspection and providing road maintenance departments with an efficient and intelligent detection solution.
[0151] Finally, it should be noted that the above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A crack detection method based on dual-path feature extraction and gating fusion, characterized in that, include: S100: Acquire the road surface image to be detected, perform size adjustment and normalization processing on the road surface image to obtain the network input tensor; S200: Convolutional downsampling is performed on the input tensor of the network to obtain intermediate features; max pooling is performed on the intermediate features to obtain first branch features; concatenated convolutional operations are performed on the intermediate features to obtain second branch features; the first branch features and the second branch features are concatenated along the channel dimension and then fused by convolution to obtain an initial feature map; the initial feature map is input into the feature extraction network, and features are extracted in parallel through the first feature extraction path and the second feature extraction path at multiple scale levels; the first feature extraction path uses standard convolution to extract semantic features to obtain a first feature map; the second feature extraction path uses depthwise separable convolution to extract detail features to obtain a second feature map; the multiple scale levels include a first scale level, a second scale level, and a third scale level; The feature map at the first scale level has a resolution of 1 / 8 of the input image, the feature map at the second scale level has a resolution of 1 / 16 of the input image, and the feature map at the third scale level has a resolution of 1 / 32 of the input image. The steps for obtaining the first feature map include: performing standard convolution downsampling on the input features; and extracting features from the downsampled features using a residual module containing multiple stacked bottleneck units to obtain the first feature map. S300: For each scale level, gated attention fusion is performed on the corresponding first feature map and second feature map; adaptive gate weights are generated through local context extraction and global feature extraction, and the first feature map and second feature map are weighted and fused based on the gate weights to obtain a fused feature map; The steps of the gated attention fusion include: Local context features are obtained by performing local context extraction on the first feature map and the second feature map respectively through depthwise separable convolution and dilated convolution. Global context features are obtained by performing global feature extraction on the first feature map and the second feature map respectively through global average pooling and global max pooling. After fusing the local context features and global context features, the adaptive gating weights are generated through an activation function; The first and second feature maps are cross-gated and weighted to obtain the fused feature map by using the gating weights; multi-scale context information is extracted from the fused feature map at the third scale level by max pooling operations with different kernel sizes and then concatenated; the response weights of key feature channels are enhanced by a self-attention mechanism. S400: Perform bidirectional feature aggregation on the fused feature map from top to bottom and bottom to top to obtain multi-scale detection features; including: upsampling the fused feature map at the third scale level, then concatenating it with the fused feature maps at the second scale level and the first scale level in the channel dimension and performing convolutional fusion to achieve top-down aggregation; downsampling the fused feature map at the first scale level, then concatenating it with the fused feature maps at the second scale level and the third scale level in the channel dimension and performing convolutional fusion to achieve bottom-up aggregation; S500: Input the multi-scale detection features into the detection head for prediction to obtain the prediction result; perform confidence filtering and non-maximum suppression on the prediction result, and output the crack detection result.
2. The crack detection method based on dual-path feature extraction and gating fusion according to claim 1, characterized in that, The steps for obtaining the second feature map include: Perform depthwise separable convolution downsampling on the input features; The downsampled features are extracted by stacking multiple lightweight convolution operators. The outputs of each layer are concatenated along the channel dimension and then aggregated by squeezing convolution and activation convolution. Finally, residual connections are made with the input to obtain the second feature map.
3. The crack detection method based on dual-path feature extraction and gating fusion according to claim 1, characterized in that, Inputting the multi-scale detection features into the detection head for prediction includes: Predict the crack category confidence level using convolutional layers with classification branches; The bounding box parameters of the crack are predicted by the convolutional layer of the regression branch, and the bounding box parameters include the center point coordinates and width and height dimensions.
4. A system for implementing the crack detection method based on dual-path feature extraction and gating fusion as described in any one of claims 1-3, characterized in that, include: The image preprocessing module is used to acquire the road surface image to be detected, and to perform size adjustment and normalization processing on the road surface image to obtain the network input tensor; The dual-path feature extraction module is used to input the network input tensor into the feature extraction network, and extract features in parallel through the first feature extraction path and the second feature extraction path at multiple scale levels; the first feature extraction path uses standard convolution to extract semantic features to obtain a first feature map; the second feature extraction path uses depthwise separable convolution to extract detail features to obtain a second feature map. The gated attention fusion module is used to perform gated attention fusion on the corresponding first feature map and second feature map for each scale level; by extracting local context and global features, an adaptive gate weight is generated, and the first feature map and second feature map are weighted and fused based on the gate weight to obtain a fused feature map; The feature pyramid aggregation module is used to perform bidirectional feature aggregation on the fused feature map from top to bottom and from bottom to top to obtain multi-scale detection features; The detection and post-processing module is used to input the multi-scale detection features into the detection head for prediction and obtain the prediction results; perform confidence filtering and non-maximum suppression on the prediction results and output the crack detection results.