Bridge concrete crack detection method in complex background based on deep learning

By combining deep learning methods with the U-Net network and the Transformer model, the problems of low efficiency and insufficient accuracy in detecting concrete cracks in bridges under complex backgrounds are solved, and fast and accurate crack segmentation and identification are achieved.

CN116823800BActive Publication Date: 2026-06-26CHONGQING JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING JIAOTONG UNIV
Filing Date
2023-07-17
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies are inefficient and inaccurate in detecting cracks in bridge concrete under complex backgrounds, making it difficult to quickly and accurately segment and identify them.

Method used

A deep learning-based bridge crack detection method is adopted, which uses an attention fusion feature extraction network and a multi-scale convolutional attention module in the U-Net network architecture, combined with a shallow feature extraction network of Transformer. The model is trained by cross-entropy loss and Dice loss to achieve the fusion of high-level semantic features and positional contour features of cracks.

Benefits of technology

It achieves fast and accurate segmentation and recognition of bridge concrete cracks in complex backgrounds, reduces misjudgment of pixels in complex backgrounds, and improves detection efficiency and accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116823800B_ABST
    Figure CN116823800B_ABST
Patent Text Reader

Abstract

The application discloses a bridge concrete crack detection method under a complex background based on deep learning, which uses a bridge crack identification and detection model trained by a deep learning method to segment, identify and predict crack regions in a bridge crack image; the bridge crack identification and detection model extracts a high-level semantic feature map of the bridge crack image through an attention fusion feature extraction network, extracts a position contour feature map of the bridge crack image through a shallow feature extraction network, and then predicts a crack segmentation and identification detection result of the bridge crack image according to the fusion result of the two, so as to reduce the misjudgment of complex background pixels and accurately realize the positioning of the crack region; the method can more quickly and accurately realize the segmentation, identification and extraction of the bridge concrete crack under the complex background, thereby improving the problems of slow crack segmentation and identification speed and insufficient accuracy caused by excessive background noise.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of bridge structure inspection technology and neural network technology, specifically to a method for detecting concrete cracks in bridges under complex backgrounds based on deep learning. Background Technology

[0002] With the development and improvement of my country's infrastructure and transportation system, the number of bridges built is also rapidly increasing. At the same time, more and more bridges are gradually entering the stage requiring maintenance and repair. The most common bridge defect is concrete cracking, accounting for over 76% of concrete bridge damage. The main causes of crack formation are as follows: low tensile strength of concrete structures, leading to uneven stress and cracking due to prolonged excessive load; frequent exposure to the elements, making concrete structures susceptible to deformation due to temperature changes and frequent rain erosion; and the quality of construction materials and construction techniques directly affecting the quality and service life of the concrete structure. The location and morphology of cracks on bridges provide a wealth of information about internal structural damage, degradation, and potential risks. Therefore, crack detection in the concrete structure of bridges is a necessary and crucial task for assessing the bridge's health and for subsequent structural maintenance and repair.

[0003] Currently, commonly used methods for detecting cracks in bridge concrete mainly fall into two categories: manual inspection and image-based vision. Manual inspection relies primarily on human observation, with inspectors using telescopes, ropes, scaffolding, trucks, etc.; however, manual inspection is inefficient, prone to missing cracks, and suffers from high labor costs and significant safety risks. Image-based vision methods, on the other hand, rely on the development of drone technology. Drones equipped with cameras acquire images of the entire bridge surface, and then image-based vision recognition methods are used to analyze the acquired images to detect bridge cracks.

[0004] Image-based visual methods are efficient but also complex, making them an important research direction for bridge concrete crack detection. In 2017, ZHANG et al. established an efficient architecture based on a convolutional neural network (CNN) that could detect 3D asphalt surface cracks at the pixel level, but it lacked a polling layer. Yuan Weiqi et al. used a grayscale thresholding method combined with interference removal based on a combination of binary and grayscale images to identify cracks, but this was not suitable for situations where the difference between cracks and the background was not significant. Ruan Xiaoli et al. introduced the idea of ​​treating crack regions as connected regions, filtering out non-cracks based on crack feature parameters, which could identify the width of smaller cracks, but was still limited by grayscale differences. In 2018, YANG et al. introduced a fully convolutional network (FCN) to simultaneously solve the problem of crack identification and measurement, but the measurement error was large. Wang Sen et al. constructed a novel CrackFCN model, achieving high-precision crack detection and reducing false labeling in complex backgrounds, but the processing efficiency was still not high enough. In 2019, Zhou Ying et al. combined crack fragment stitching and image processing methods to achieve high-precision measurement of crack width, but the identification of cracks in more complex backgrounds needs to be improved.

[0005] Therefore, how to more quickly and accurately segment and identify bridge concrete cracks in complex backgrounds has become a problem that needs further research and solutions. Summary of the Invention

[0006] In view of the above-mentioned shortcomings of the existing technology, the purpose of this invention is to provide a method for detecting bridge concrete cracks in complex backgrounds based on deep learning, which can achieve faster and more accurate segmentation and identification of bridge concrete cracks in complex backgrounds.

[0007] To solve the above-mentioned technical problems, the present invention adopts the following technical solution:

[0008] A method for detecting concrete cracks in bridges under complex backgrounds based on deep learning is proposed. The method acquires the image of the bridge crack to be processed, inputs it into a pre-trained bridge crack recognition and detection model, and obtains the crack segmentation, recognition and detection results of the image of the bridge crack to be processed.

[0009] The bridge crack identification and detection model includes an attention fusion feature extraction network based on the U-Net network architecture and a shallow feature extraction network based on a multi-scale convolutional attention module and a Transformer. The bridge crack identification and detection model takes the input bridge crack image as the input to the multi-scale convolutional attention module and the shallow feature extraction network, respectively. The attention fusion feature extraction network extracts the high-level semantic feature map of the bridge crack image, and the shallow feature extraction network extracts the positional contour feature map of the bridge crack image. Then, the high-level semantic feature map and the positional contour feature map of the bridge crack image are multiplied together to fuse and generate the crack segmentation identification and detection result of the bridge crack image as the output.

[0010] In the above-mentioned method for detecting bridge concrete cracks in complex backgrounds based on deep learning, as a preferred embodiment, the attention fusion feature extraction network includes four encoding layers, one bottleneck layer, and four decoding layers connected in sequence.

[0011] Each coding layer includes a cascaded multi-scale convolutional attention module and a strip pooling module; wherein, the input of the coding layer serves as the input of the multi-scale convolutional attention module, the output of the multi-scale convolutional attention module serves as the input of the strip pooling module, and the output of the strip pooling module serves as the output of the coding layer;

[0012] Each decoding layer consists of two cascaded convolutional layers and a convolutional upsampling module; wherein, the input of the decoding layer serves as the input of the first convolutional layer, the output of the first convolutional layer serves as the input of the second convolutional layer, the output of the second convolutional layer serves as the input of the convolutional upsampling module, and the output of the convolutional upsampling module serves as the output of the decoding layer;

[0013] The bottleneck layer consists of two cascaded convolutional layers;

[0014] In this network, the bridge crack image input to the attention fusion feature extraction network serves as the input to the first encoding layer; the output of each encoding layer serves as the input to the next layer it is connected to in the attention fusion feature extraction network; the output of the fourth encoding layer serves as the input to the bottleneck layer, and the output of the bottleneck layer, after convolutional upsampling, is concatenated with the output of the multi-scale convolutional attention module in the fourth encoding layer to serve as the input to the fourth decoding layer; the input to each decoding layer is a stitched image formed by concatenating the output of the multi-scale convolutional attention module in its corresponding encoding layer with the output of the previous layer in the attention fusion feature extraction network; and the output of the first decoding layer serves as the overall output of the attention fusion feature extraction network.

[0015] In the above-mentioned method for detecting bridge concrete cracks in complex backgrounds based on deep learning, as a preferred embodiment, the shallow feature extraction network includes two multi-scale convolutional attention modules, one strip pooling module, and one Transformer module.

[0016] In this system, the bridge crack image input to the shallow feature extraction network is used as the input to the first multi-scale convolutional attention module and the Transformer module, respectively. The output of the first multi-scale convolutional attention module is used as the input to the strip pooling module. The output of the strip pooling module is concatenated with the output of the Transformer module and used as the input to the second multi-scale convolutional attention module. The output of the second multi-scale convolutional attention module is used as the overall output of the shallow feature extraction network.

[0017] In the above-mentioned deep learning-based method for detecting bridge concrete cracks in complex environments, as a preferred embodiment, the processing procedure of the multi-scale convolutional attention module includes the following steps:

[0018] After performing a 5×5 depthwise convolution on the input feature map, branch depthwise strip convolutions are performed on the depthwise convolution map in two separate branches. One branch performs 1×7 and 7×1 strip convolutions, and the other branch performs 1×11 and 1×11 strip convolutions. The two feature maps of different scales obtained from the strip convolutions of the two branches are concatenated with the depthwise convolution map, and then a multi-channel feature map is calculated through a convolution with a kernel of 1. Finally, the multi-channel feature map is multiplied with the input feature map to obtain the attention feature map as the output.

[0019] In the above-mentioned deep learning-based method for detecting bridge concrete cracks in complex environments, as a preferred embodiment, the processing procedure of the strip pooling module includes the following steps:

[0020] The input feature map is subjected to pixel width strip pooling and pixel height strip pooling respectively, resulting in a pixel width pooled feature map of size 1×W and a pixel height pooled feature map of size H×1, where H and W are the pixel height and pixel width of the input feature map, respectively. Then, the pixel width pooled feature map is expanded by a one-dimensional convolution operation to obtain an expanded feature map of size H×W based on pixel width, and the pixel height pooled feature map is expanded by a one-dimensional convolution operation to obtain an expanded feature map of size H×W based on pixel height. The two expanded feature maps are superimposed and fused, and then successively subjected to a convolution operation with a kernel of 1 and a sigmoid activation function operation to obtain an activation feature map. Finally, the activation feature map is multiplied by the input feature map to obtain the pooled downsampled feature map as the output.

[0021] In the above-mentioned deep learning-based method for detecting bridge concrete cracks in complex environments, as a preferred embodiment, the bridge crack identification and detection model is trained in the following manner:

[0022] Bridge crack sample images with pre-marked crack regions segmentation masks are used as training samples to form a training sample set, which is then input into the bridge crack recognition and detection model. A total loss function containing cross-entropy loss and Dice loss for evaluating the bridge crack recognition performance is constructed. The model parameters of the bridge crack recognition and detection model are optimized and updated with the goal of minimizing the total loss function, thereby training the bridge crack recognition and detection model.

[0023] In the above-mentioned deep learning-based method for detecting bridge concrete cracks in complex environments, the preferred embodiment is the total loss function:

[0024] L Total =L CE +λL Dice ;

[0025]

[0026]

[0027] Among them, L Total Let L represent the total loss function. CE L represents the cross-entropy loss. Dice Indicates Dice loss; g i p represents the true segmentation mask value at the i-th pixel position in the bridge crack sample image. i λ represents the predicted segmentation mask value of the i-th pixel position in the bridge crack sample image predicted by the bridge crack identification and detection model, i∈{1,2,…,H×W}, where H and W are the pixel height and pixel width of the bridge crack sample image, respectively; λ is the weight coefficient.

[0028] Compared with the prior art, the present invention has the following beneficial technical effects:

[0029] 1. This invention uses deep learning methods to train a bridge crack recognition and detection model. After training, it performs segmentation, recognition and prediction of crack regions in bridge crack images, which can quickly achieve segmentation, recognition and detection of concrete cracks in bridges under complex backgrounds.

[0030] 2. The bridge crack identification and detection model used in the method of the present invention extracts high-level semantic feature maps of bridge crack images through an attention fusion feature extraction network and extracts positional contour feature maps of bridge crack images through a shallow feature extraction network. Then, based on the fusion result of the two, the crack segmentation identification and detection result of the bridge crack image is predicted. It comprehensively utilizes the wide channel characteristics of the attention fusion feature extraction network and the crack detail enhancement performance of the shallow feature extraction network, reduces the misjudgment of complex background pixels, and accurately realizes the localization of crack areas.

[0031] 3. This invention improves the training effect and recognition performance of the bridge crack identification and detection model by jointly training the model using cross-entropy loss and Dice loss.

[0032] 4. The present invention provides a method for detecting bridge concrete cracks in complex backgrounds based on deep learning. This method can achieve faster and more accurate segmentation, identification and extraction of bridge concrete cracks in complex backgrounds, thereby improving the problems of slow crack segmentation and identification speed and insufficient accuracy caused by excessive background noise. Attached Figure Description

[0033] Figure 1 This is a schematic diagram of the framework of the bridge crack identification and detection model used in the method of the present invention.

[0034] Figure 2 This is a schematic diagram of the structure of the attention fusion feature extraction network.

[0035] Figure 3 This is a schematic diagram of the processing flow of the Multi-Scale Convolutional Attention Module (MSCA).

[0036] Figure 4 This is a schematic diagram of the processing flow of the strip pooling module (SP).

[0037] Figure 5 This is a schematic diagram of the shallow feature extraction network. Detailed Implementation

[0038] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but only to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0039] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.

[0040] This invention proposes a method for detecting concrete cracks in bridges under complex backgrounds based on deep learning. The method inputs the acquired bridge crack image to be processed into a pre-trained bridge crack recognition and detection model to obtain the crack segmentation, recognition and detection results of the bridge crack image to be processed, thereby realizing the detection of concrete cracks in bridges.

[0041] The bridge crack identification and detection model used in the method of this invention includes an attention fusion feature extraction network based on the U-Net network architecture and a shallow feature extraction network based on a multi-scale convolutional attention module and a Transformer. The bridge crack identification and detection model takes the input bridge crack image as the input to the multi-scale convolutional attention module and the shallow feature extraction network, respectively. The attention fusion feature extraction network extracts the high-level semantic feature map of the bridge crack image, and the shallow feature extraction network extracts the positional contour feature map of the bridge crack image. Then, the high-level semantic feature map and the positional contour feature map of the bridge crack image are multiplied together to fuse and generate the crack segmentation identification and detection result of the bridge crack image as the output.

[0042] The following is a more detailed description of the bridge concrete crack detection method based on deep learning in complex backgrounds.

[0043] 1. Image preprocessing

[0044] To better utilize the method of this invention for bridge concrete crack detection, it is best to preprocess the bridge crack images collected in the bridge inspection report, including screening and standardizing resolution and size. Specifically, images with stains, shadows, graffiti, lines, or other defects that could hinder detection can be manually filtered from the bridge inspection report. Then, for images with different sizes and resolutions, image interpolation algorithms can be used to standardize all bridge crack images to the same resolution and size.

[0045] After these preprocessing steps, the images can be used for detecting concrete cracks in bridges. Furthermore, after effective crack region segmentation and identification, these preprocessed bridge crack images can be segmented and masked, serving as sample images for bridge crack identification and detection models, and can be used as training or testing samples.

[0046] 2. Bridge Crack Identification and Detection Model

[0047] After training, the bridge crack identification and detection model is used to segment and identify cracks in bridge crack images. The framework of the bridge crack identification and detection model used in the method of this invention is as follows: Figure 1 As shown, it mainly consists of two parts: an attention fusion feature extraction network and a shallow feature extraction network, which will be described in detail below.

[0048] 2.1 Attention Fusion Feature Extraction Network

[0049] The attention fusion feature extraction network, based on the U-Net network architecture, is used to extract high-level semantic feature maps from bridge crack images. In practical implementation, the attention fusion feature extraction network is modified from the U-Net network architecture. The modification involves introducing a strip pooling (SP) module into the U-Net network to replace the conventional convolutional downsampling processing in the encoding structure. This is done to collect rich contextual information to detect crack information in complex scenes and reduce misjudgments of background pixels. At the same time, the conventional convolutional blocks in the encoding structure of the conventional U-Net network are replaced with a multi-scale convolutional attention (MSCA) module. This module aggregates local information while establishing relationships between different channels, evoking spatial attention to crack information, thereby suppressing interference from complex backgrounds such as lines and graffiti. Ultimately, this achieves multi-scale information interaction and improves the accuracy of crack detection.

[0050] Specifically, the structure of the attention fusion feature extraction network is based on the U-Net network architecture, such as... Figure 2 As shown, it includes four encoding layers, one bottleneck layer, and four decoding layers connected in sequence.

[0051] Each coding layer includes a cascaded multi-scale convolutional attention module (MSCA) and a strip pooling module (SP); wherein, the input of the coding layer serves as the input of the multi-scale convolutional attention module (MSCA), the output of the multi-scale convolutional attention module (MSCA) serves as the input of the strip pooling module, and the output of the strip pooling module (SP) serves as the output of the coding layer.

[0052] Each decoding layer includes two cascaded convolutional layers (Conv) and a convolutional upsampling module (UC); wherein, the input of the decoding layer serves as the input of the first convolutional layer (Conv1), the output of the first convolutional layer (Conv1) serves as the input of the second convolutional layer (Conv2), the output of the second convolutional layer (Conv2) serves as the input of the convolutional upsampling module (UC), and the output of the convolutional upsampling module (UC) serves as the output of the decoding layer.

[0053] The bottleneck layer consists of two cascaded convolutional layers, Conv.

[0054] like Figure 2 As shown, the bridge crack image input to the attention fusion feature extraction network is used as the input to the first encoding layer; the output of each encoding layer is used as the input to the next layer it is connected to in the attention fusion feature extraction network; the output of the fourth encoding layer is used as the input to the bottleneck layer, and the output of the bottleneck layer is convolved and upsampled before being concatenated with the output of the multi-scale convolutional attention module in the fourth encoding layer as the input to the fourth decoding layer; the input of each decoding layer is a stitched image of the output of the multi-scale convolutional attention module in its corresponding encoding layer and the output of the previous layer in the attention fusion feature extraction network; the output of the first decoding layer is used as the overall output of the attention fusion feature extraction network.

[0055] The processing procedure of the multi-scale convolutional attention module (MSCA) is as follows: Figure 3 As shown, specifically: after performing a 5×5 depthwise convolution on the input feature map, branch depthwise strip convolutions are performed on the depthwise convolution map for two branches respectively. One branch performs 1×7 and 7×1 strip convolutions, and the other branch performs 1×11 and 1×11 strip convolutions. The two feature maps of different scales obtained by the strip convolutions of the two branches are concatenated with the depthwise convolution map, and then a multi-channel feature map is calculated by a convolution with a kernel of 1. Finally, the multi-channel feature map is multiplied with the input feature map to obtain the attention feature map as the output.

[0056] Stacking conventional convolutional blocks does not enable the network to focus on more crack pixels. First, the Multi-Scale Convolutional Attention Module (MSCA) can evoke spatial attention through simple element-wise multiplication, increasing the network's focus on cracks, reducing the extraction of useless information, and effectively suppressing interference from background information in the image. Furthermore, the MSCA contains a depthwise convolutional branch, which can aggregate local feature information of cracks, extracting abstract features of cracks even in complex and occluded image backgrounds, enhancing the feature extraction capability of the encoding structure. Second, lightweight strip convolutions reduce computational costs while aiding in the extraction of features from strip-shaped objects, making them suitable for crack detection. Therefore, this invention replaces the standard convolutional blocks in the U-Net network's encoding structure with the MSCA to enhance crack feature extraction.

[0057] The processing procedure of the strip pooling module SP is as follows: Figure 4As shown, specifically: pixel width strip pooling and pixel height strip pooling are performed on the input feature map to obtain a pixel width pooled feature map of size 1×W and a pixel height pooled feature map of size H×1, respectively, where H and W are the pixel height and pixel width of the input feature map, respectively; then, the pixel width pooled feature map is expanded by a one-dimensional convolution operation to obtain an expanded feature map of size H×W based on pixel width, and the pixel height pooled feature map is expanded by a one-dimensional convolution operation to obtain an expanded feature map of size H×W based on pixel height. The two expanded feature maps are superimposed and fused, and then successively subjected to a convolution operation with a kernel of 1 and a Sigmoid activation function operation to obtain an activation feature map; finally, the activation feature map is multiplied by the input feature map to obtain the pooled downsampled feature map as the output.

[0058] Cracks typically occupy a small area in an image and are long and thin. Furthermore, conventional downsampling limits the contextual information of cracks and cannot effectively handle background obstacles such as drawn lines or graffiti. Based on these considerations, this invention introduces a strip pooling module (SP) into the encoding structure to better capture local contextual information, thereby reducing misjudgments of background pixels.

[0059] The attention fusion feature extraction network processes the input bridge crack image as follows: the bridge crack image is processed sequentially by the multi-scale convolutional attention module (MSCA) and strip pooling module (SP) in the first, second, third, and fourth encoding layers, then enters the bottleneck layer for two convolutional layers followed by upsampling. It then sequentially passes through the fourth, third, second, and first decoding layers for convolution and upsampling. During decoding, the upsampled image from the previous layer is concatenated with the feature map output by the multi-scale convolutional attention module in the corresponding encoding layer of the decoding layer, and then input into the decoding layer for processing. Finally, the high-level semantic feature map of the output bridge crack image is obtained.

[0060] In the process of processing bridge crack images by the attention fusion feature extraction network, the encoding stage employs a four-layer multi-scale convolutional attention module (MSCA) and a strip pooling module (SP). After four layers of strip pooling downsampling, the resolution is continuously reduced to obtain image information at different scales. The crack feature information in the bridge crack image gradually transitions from points, lines, gradients, and other information in the lower-level information to contours and more abstract semantic information in the elevation information. The entire network completes the extraction and combination of features "from fine to coarse," making the high-level semantic feature information extracted from the bridge crack image more comprehensive. On the other hand, in the subsequent decoding stage, if the image is simply upsampled from low resolution to high resolution, it loses sensitivity to detail information. Therefore, the upsampled image from the previous layer is concatenated with the feature map output by the multi-scale convolutional attention module in the corresponding encoding layer of the decoding layer, and then the processing of that decoding layer is performed. This operation directly transfers the accurate gradient, point, line, and other information extracted from the encoding layer at the same level to the decoding layer at the same level. This is equivalent to adding detail information to the general area of ​​the crack target, enabling the attention fusion feature extraction network to obtain more accurate crack segmentation and recognition results.

[0061] 2.2 Shallow Feature Extraction Network

[0062] Attention fusion feature extraction networks are mainly used to extract high-level semantic features related to cracks in bridge crack images. High-level semantic background information can improve the detection performance of larger structures, but it may suffer from loss or discontinuity when processing crack boundaries, and it is easy to misidentify shadows and stains as cracks. In contrast, the shallow features hidden in bridge crack images can better reflect the detailed information of cracks, containing rich location and contour information. Therefore, a shallow branch is considered to extract shallow features of cracks, obtain the location and contour information of cracks, and thus more accurately locate cracks.

[0063] Specifically, the structure of a shallow feature extraction network is as follows: Figure 5 As shown, it includes two multi-scale convolutional attention modules, one strip pooling module, and one Transformer module. The multi-scale convolutional attention module MSCA and the strip pooling module SP are the same as described above; the Transformer module is also a mature and commonly used network module unit in the field, and will not be described in detail here.

[0064] In this system, the bridge crack image input to the shallow feature extraction network is used as the input to the first multi-scale convolutional attention module MSCA1 and the Transformer module, respectively. The output of the first multi-scale convolutional attention module MSCA1 is used as the input to the strip pooling module SP. The output of the strip pooling module SP is superimposed with the output of the Transformer module and used as the input to the second multi-scale convolutional attention module MSCA2. The output of the second multi-scale convolutional attention module MSCA2 is used as the overall output of the shallow feature extraction network.

[0065] The shallow feature extraction network processes the input bridge crack image as follows: the bridge crack image is first processed through two branches. One branch is processed by the first multi-scale convolutional attention module MSCA1 and the strip pooling module SP in sequence. The other branch is processed by the Transformer module for feature extraction. The processing results of the two branches are stitched together and then processed by the second multi-scale convolutional attention module MSCA2 to obtain the position contour feature map of the output bridge crack image.

[0066] The shallow feature extraction network extracts the contour features of crack points and lines in bridge crack images through the first multi-scale convolutional attention module (MSCA1) and the strip pooling module (SP), making good use of the multi-scale convolutional attention module's ability to represent local features. At the same time, it extracts global location features in bridge crack images through the Transformer module, making good use of the Transformer module's ability to model global context features. Then, the two are concatenated and then shallowly fused through the second multi-scale convolutional attention module (MSCA2) to obtain the location contour feature map of the bridge crack image, which makes more efficient use of contour and location feature information and prevents information redundancy.

[0067] Finally, the high-level semantic feature map obtained by the attention fusion feature extraction network and the positional contour feature map obtained by the shallow feature extraction network are fused through multiplication. The bridge crack recognition and detection model predicts the crack segmentation and detection results of the bridge crack image based on the fusion result. This comprehensively utilizes the wide-channel characteristics of the attention fusion feature extraction network and the crack detail enhancement performance of the shallow feature extraction network, reduces misjudgment of complex background pixels, and accurately locates the crack region. This enables faster and more accurate segmentation, recognition, and extraction of bridge concrete cracks in complex backgrounds, thus improving the problems of slow crack segmentation and recognition speed and insufficient accuracy caused by excessive background noise.

[0068] 3. Training of the bridge crack identification and detection model

[0069] In training the bridge crack identification and detection model, this invention employs the following method:

[0070] Bridge crack sample images with pre-marked crack regions segmentation masks are used as training samples to form a training sample set, which is then input into the bridge crack recognition and detection model. A total loss function containing cross-entropy loss and Dice loss for evaluating the bridge crack recognition performance is constructed. The model parameters of the bridge crack recognition and detection model are optimized and updated with the goal of minimizing the total loss function, thereby training the bridge crack recognition and detection model.

[0071] The total loss function is:

[0072] L Total =L CE +λL Dice ;

[0073]

[0074]

[0075] Among them, L Total Let L represent the total loss function. CE L represents the cross-entropy loss. Dice Indicates Dice loss; g i p represents the true segmentation mask value at the i-th pixel position in the bridge crack sample image. i λ represents the predicted segmentation mask value of the i-th pixel position in the bridge crack sample image predicted by the bridge crack identification and detection model, i∈{1,2,…,H×W}, where H and W are the pixel height and pixel width of the bridge crack sample image, respectively; λ is the weight coefficient.

[0076] This invention trains a bridge crack recognition and detection model by jointly using cross-entropy loss and Dice loss. This enables the model to better adapt to the complex morphological features of cracks of different sizes and shapes in bridge crack images, and it is also better suited to bridge crack images with complex background interference such as lines and graffiti. This improves the training effect and recognition performance of the bridge crack recognition and detection model, and ultimately achieves faster and more accurate segmentation and recognition of bridge concrete cracks in complex backgrounds.

[0077] 4. Overview

[0078] In summary, the method of the present invention has the following technical advantages:

[0079] 1. This invention uses deep learning methods to train a bridge crack recognition and detection model. After training, it performs segmentation, recognition and prediction of crack regions in bridge crack images, which can quickly achieve segmentation, recognition and detection of concrete cracks in bridges under complex backgrounds.

[0080] 2. The bridge crack identification and detection model used in the method of the present invention extracts high-level semantic feature maps of bridge crack images through an attention fusion feature extraction network and extracts positional contour feature maps of bridge crack images through a shallow feature extraction network. Then, based on the fusion result of the two, the crack segmentation identification and detection result of the bridge crack image is predicted. It comprehensively utilizes the wide channel characteristics of the attention fusion feature extraction network and the crack detail enhancement performance of the shallow feature extraction network, reduces the misjudgment of complex background pixels, and accurately realizes the localization of crack areas.

[0081] 3. This invention improves the training effect and recognition performance of the bridge crack identification and detection model by jointly training the model using cross-entropy loss and Dice loss.

[0082] 4. The present invention provides a method for detecting bridge concrete cracks in complex backgrounds based on deep learning. This method can achieve faster and more accurate segmentation, identification and extraction of bridge concrete cracks in complex backgrounds, thereby improving the problems of slow crack segmentation and identification speed and insufficient accuracy caused by excessive background noise.

[0083] Finally, it should be noted that although the invention has been described with reference to preferred embodiments thereof, those skilled in the art should understand that various changes in form and detail may be made thereto without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for detecting concrete cracks in bridges under complex backgrounds based on deep learning, characterized in that, The bridge crack image to be processed is obtained and input into a pre-trained bridge crack recognition and detection model to obtain the crack segmentation, recognition and detection results of the bridge crack image to be processed. The bridge crack identification and detection model includes an attention fusion feature extraction network based on the U-Net network architecture and a shallow feature extraction network based on a multi-scale convolutional attention module and a Transformer. The bridge crack identification and detection model takes the input bridge crack image as the input to the multi-scale convolutional attention module and the shallow feature extraction network, respectively. The attention fusion feature extraction network extracts the high-level semantic feature map of the bridge crack image, and the shallow feature extraction network extracts the positional contour feature map of the bridge crack image. Then, the high-level semantic feature map and the positional contour feature map of the bridge crack image are multiplied together to fuse and generate the crack segmentation identification and detection result of the bridge crack image as the output. The attention fusion feature extraction network comprises four sequentially connected encoding layers, one bottleneck layer, and four decoding layers. Each encoding layer includes a cascaded multi-scale convolutional attention module and a strip pooling module. The input of the encoding layer serves as the input to the multi-scale convolutional attention module, the output of the multi-scale convolutional attention module serves as the input to the strip pooling module, and the output of the strip pooling module serves as the output of the encoding layer. Each decoding layer comprises two cascaded convolutional layers and a convolutional upsampling module. The input of the decoding layer serves as the input to the first convolutional layer, the output of the first convolutional layer serves as the input to the second convolutional layer, the output of the second convolutional layer serves as the input to the convolutional upsampling module, and the output of the convolutional upsampling module serves as the output of the decoding layer. The output of the decoding layer; the bottleneck layer consists of two cascaded convolutional layers; the bridge crack image input to the attention fusion feature extraction network is used as the input to the first encoding layer; the output of each encoding layer is used as the input to the next layer it is connected to in the attention fusion feature extraction network; the output of the fourth encoding layer is used as the input to the bottleneck layer, and the output of the bottleneck layer is convolved and upsampled before being concatenated with the output of the multi-scale convolutional attention module in the fourth encoding layer as the input to the fourth decoding layer; the input of each decoding layer is a stitched image of the output of the multi-scale convolutional attention module in its corresponding encoding layer and the output of the previous layer in the attention fusion feature extraction network; the output of the first decoding layer is used as the overall output of the attention fusion feature extraction network; The shallow feature extraction network includes two multi-scale convolutional attention modules, one strip pooling module, and one Transformer module. The bridge crack image input to the shallow feature extraction network is used as the input to the first multi-scale convolutional attention module and the Transformer module, respectively. The output of the first multi-scale convolutional attention module is used as the input to the strip pooling module. The output of the strip pooling module is concatenated with the output of the Transformer module and used as the input to the second multi-scale convolutional attention module. The output of the second multi-scale convolutional attention module is used as the overall output of the shallow feature extraction network. The processing steps of the multi-scale convolutional attention module include the following steps: After performing a 5×5 depthwise convolution on the input feature map, the depthwise convolution map is subjected to two branch depthwise strip convolutions, one branch performing 1×7 and 7×1 strip convolutions, and the other branch performing 1×11 and 1×11 strip convolutions; the two different scale feature maps obtained from the strip convolutions of the two branches are concatenated with the depthwise convolution map, and then a multi-channel feature map is obtained by convolution with a kernel of 1; finally, the multi-channel feature map is multiplied with the input feature map to obtain the attention feature map as the output.

2. The method for detecting bridge concrete cracks in complex backgrounds based on deep learning according to claim 1, characterized in that, The processing steps of the strip pooling module include the following: The input feature map is subjected to pixel width strip pooling and pixel height strip pooling respectively, resulting in a pixel width pooled feature map of size 1×W and a pixel height pooled feature map of size H×1, where H and W are the pixel height and pixel width of the input feature map, respectively. Then, the pixel width pooled feature map is expanded by a one-dimensional convolution operation to obtain an expanded feature map of size H×W based on pixel width, and the pixel height pooled feature map is expanded by a one-dimensional convolution operation to obtain an expanded feature map of size H×W based on pixel height. The two expanded feature maps are superimposed and fused, and then successively subjected to a convolution operation with a kernel of 1 and a sigmoid activation function operation to obtain an activation feature map. Finally, the activation feature map is multiplied by the input feature map to obtain the pooled downsampled feature map as the output.

3. The method for detecting bridge concrete cracks in complex backgrounds based on deep learning according to claim 1, characterized in that, The bridge crack identification and detection model is trained in the following way: Bridge crack sample images with pre-marked crack regions segmentation masks are used as training samples to form a training sample set, which is then input into the bridge crack recognition and detection model. A total loss function containing cross-entropy loss and Dice loss for evaluating the bridge crack recognition performance is constructed. The model parameters of the bridge crack recognition and detection model are optimized and updated with the goal of minimizing the total loss function, thereby training the bridge crack recognition and detection model.

4. The method for detecting bridge concrete cracks in complex backgrounds based on deep learning according to claim 3, characterized in that, The total loss function is: ; ; ; in, Represents the total loss function. Represents cross-entropy loss, Indicates Dice loss; The image shows the first crack in the bridge sample image. The true segmentation mask value at each pixel location. This represents the bridge crack sample image predicted by the bridge crack identification and detection model. The predicted segmentation mask value at each pixel position. H and W are the pixel height and pixel width of the bridge crack sample image, respectively; These are the weighting coefficients.