[0037] In order to facilitate understanding of the present application, the present application will be described more fully below with reference to the related drawings. The preferred embodiments of the present application are shown in the accompanying drawings. However, the present application may be implemented in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided.
[0038] like figure 1 As shown, the concrete crack detection method based on ConcreteCrackSegNet includes the following steps:
[0039] 1. Collect concrete photos of bridge concrete structures and highways through drone photography, industrial cameras, etc., some of which contain cracks and diseases;
[0040] 2. Use LabelMe software to label photos. LabelMe is a software for labeling images. Use LabelMe to label the bounding box polygons of cracks and diseases. After saving, a json file can be generated. If the photo contains crack disease, then the coordinate position of each vertex of the polygon with crack disease is included; if the photo does not contain crack disease, then the json file has no polygon data;
[0041] 3. There are a total of 5000 labeled image data sets. According to the ratio of 6:2:2, the image data set is divided into training set, validation set and test set, so 3000 training set, 1000 validation set and test set are obtained. 1000 sheets;
[0042] 4. Then construct the ConcreteCrackSegNet concrete crack detection network model proposed by the present invention. In this experiment, the GPU is NVIDIA GeForce RTX 2080, the software environment is the operating system Ubuntu, and the PyTorch 1.10 framework is used for deep learning.
[0043] 5. After the model is constructed, a loss function needs to be set, the main loss function softmax used in the present invention, plus an auxiliary loss function.
[0044] 6. Input the training data and train the model through the model and loss function.
[0045] 7. During the model training process, the data of the validation set will be read and the parameters of the model will be adjusted.
[0046] 8. We set 1000 epochs as the maximum training epoch, and when the training does not reach the maximum training epoch, continue training until 1000 epochs are completed.
[0047] 9. The best performing model during training is saved as the best model.
[0048] 10. The best model will further test the training effect through the test set to prove the effectiveness of the trained best model.
[0049]11. In practical applications, when detecting a new concrete structure, first collect photos through drones or industrial cameras, and then convert the collected photos into standard input sizes (the standard input size refers to the same image input when training the model). size), the standard input size in this patent is (3,512,512), where 3 is the number of channels and (512,512) is the width and height.
[0050] 12. Then load the best model trained in 9 and input the pictures collected in 11.
[0051] 13. Prediction through the model to achieve image crack segmentation.
[0052] 14. Then a picture of the split cracks can be obtained.
[0053] 15. Submit these pictures of the concrete structure with split cracks to the maintenance department for corresponding crack disease treatment.
[0054] The ConcreteCrackSegNet used in the above-mentioned steps 4 is proposed according to the concrete features of concrete crack segmentation in combination with the test, the color demarcation of concrete cracks is not obvious, and the cracks have the characteristics of certain continuity, the present invention has collected actual concrete crack photos, and the picture collection includes All kinds of cracks with indistinct differences were found, and the test fully considered the indistinguishability and continuity of cracks. The network structure is as follows: figure 2 shown. The input image is an RGB three-channel image with a size of (3,512,512), where 3 is the number of channels, and (512,512) is the width and height. The input image is divided into two channels, one is the spatial path module, and the other is the ResNet feature extraction channel.
[0055] ResNet (https://arxiv.org/abs/1512.03385v1) proposes the idea of residual learning. By directly passing the input information to the output, the integrity of the information is protected. The entire network only needs to learn the part of the difference between the input and the output. , simplifying learning objectives and difficulty. ResNet has many bypasses to directly connect the input to the subsequent layers, this structure is also called shortcut or skip connections. The pre-trained ResNet model will be used for image feature extraction. The present invention uses the pre-trained ResNet101, and uses the output of the last three layers before the fully connected layer, the sizes are C (1024, 32, 32 ), D(2048, 16, 16) and E(2048, 1, 1). Using pre-trained ResNet to extract three-level features for concrete crack images, the features of multiple resolutions of concrete crack images can be fully extracted from the existing model, and these features can fully retain the existing crack information.
[0056] The original input image A also passes through a spatial path module with a size of (256, 64, 64), denoted as B.
[0057] The feature C extracted by ResNet passes through the bottleneck attention module, the size is still (1024, 32, 32), denoted as F; F is then input to the attention improvement module, and the size is still (1024, 32, 32), denoted as H; then H After an upsampling module 1, the size becomes (1024, 64, 64), denoted as J.
[0058] The feature D extracted by ResNet passes through the convolution block attention module, and the size is still (2048, 16, 16), which is G.
[0059] The size of the feature E extracted by ResNet is (2048, 1, 1); multiply G and E, and the size of the result is (2048, 16, 16), denoted as I. I passes through an upsampling module 2, and the size becomes (2048, 64, 64), denoted as K.
[0060] The number of channels of J and K are different, and the size is the same, both are (64, 64), so the vector splicing is performed on the channel dimension, and the obtained result is (3072, 64, 64), and the splicing result is recorded as L, of which 3072 is The result of adding the number of channels of J to 1024 and the number of channels of K to 2048. L is followed by a normalized attention module, and the output size is still (3072, 64, 64), denoted by M.
[0061] The number of channels B and M above are different, but the width and height are the same, which is (64, 64). At this time, through a feature fusion module, the output size is (2, 64, 64), denoted as N.
[0062] The output N passes through an up-sampling module with a size of (2,512,512), denoted as O; followed by a convolution module, the output size is still (2,512,512); denoted as P. Then through the prediction, the predicted segmented concrete crack image is output, in which the cracked part is white and the non-cracked part is black.
[0063] The structure of the above space path module is as follows image 3 As shown, the spatial path module contains three convolution units, each of which consists of a convolution function with a kernel of 3 and a stride of 2, followed by a batch normalization and activation function. Therefore, each time the input size passes through a convolution unit, it will be divided by 2, and the final output feature map size will be 1/8 of the original size. In the image segmentation of concrete cracks, it is difficult for traditional image segmentation methods to retain sufficient high resolution for the input image while retaining enough spatial information. However, the use of the spatial path module can not only retain enough spatial information on another path of feature extraction, but also make the receptive field large enough, so that the output feature map has global spatial information and more accurately distinguish the cracks. location and precise pixel information.
[0064] The above-mentioned bottleneck attention (BAM) module adopts the self-paper BAM: Bottleneck Attention Module. For a 3-dimensional feature map, such as the feature extraction C (1024, 32, 32) of the concrete crack picture in the present invention, the bottleneck attention (BAM) ) module can output a feature map of the same size, but the output feature map emphasizes more important elements on the crack image. In the Bottleneck Attention (BAM) module, the processing of input feature maps is divided into two branches, one is spatial and the other is channel, and the two branches focus on “where to look” and “what to look at” information, respectively. For an input feature map, such as The feature map generated by the bottleneck attention (BAM) module is, and the optimized feature map output is:
[0065]
[0066] Represents element-wise multiplication.
[0067] The formula for calculating M(F) is:
[0068] in, is the channel's attention, is spatial attention, and the two constitute two processing branches. σ is the sigmoid function. Before adding, both are sized and deformed to
[0069] The above convolutional block attention (CBAM) module adopts the self-paper Convolutional Block AttentionModule, for a three-dimensional input image The convolutional block attention module sequentially generates a one-dimensional channel attention map and a two-dimensional spatial attention Calculated as follows:
[0070]
[0071]
[0072] Then F″ is the optimized final output feature map. The feature extraction map D (2048, 16, 16) applied to the concrete crack image in the present invention is a high-dimensional feature map of the crack image, and the resolution is not very high. , the sequential connection of one-dimensional channel attention and two-dimensional spatial attention can preserve the attention points of the information of "where to look" and "what to look at" to the greatest extent, making the subsequent crack segmentation more accurate.
[0073] The structure of the above attention improvement module is as follows Figure 4 As shown, the input S1 first passes through an average pooling layer, which can be used to capture the global context information. change, the output is recorded as S3. Finally, the original input S1 and S3 are multiplied, and the result is still the same size as S1, keeping the same size as the original input. The attention improvement module is used for concrete crack image segmentation, which can better capture the global image information, and then calculate an attention vector to guide the learning of features, so that global information can be integrated into the network without upsampling.
[0074] The structure of the above normalized attention module is as follows Figure 5 As shown, for the input feature map F1, first through a normalization (BN) function, the result and a weight vector W γ Multiply, and finally go through a sigmoid function to get the output. Its formula is: M c =sigmoid(W γ (BN(F 1 ))).
[0075] Among them, the calculation of normalization is shown in the following formula:
[0076]
[0077] Among them, B in represents the input vector, represents the mean of the input vector, Represents the standard deviation of the input vector, γ represents the scaling factor, which is a trainable parameter, and β represents the offset value, which is also a trainable parameter. ∈ is a small constant value in order to prevent the invalid division calculation caused by the standard deviation being 0, and its value is 0.00001 in the present invention.
[0078] The formula for calculating the weight is W γ =γ i /∑ j=0 γ j , where j represents the number of channels.
[0079] The normalized attention module is used in the segmentation of concrete cracks. It can perform the attention calculation according to the weight of the multi-channel image features extracted, processed and spliced by ResNet, and highlights the characteristics of key elements, such as cracks and edge elements. Subsequent segmentations are more precise.
[0080] The structure of the above feature fusion module is as follows Image 6 shown, for multi-channel input, such as input 1 here, from figure 2 B in, input 2, from figure 2 The M in , respectively, come from different paths, and the resolution and extraction accuracy of the features are also different, and cannot be simply fused by addition. B mainly captures the spatial information, and M captures the context information. Therefore, the feature fusion module first performs vector splicing on B and M to obtain a vector, and then passes through a convolution unit to obtain the output and denote it as P1. P1 will be average pooled, followed by convolution unit 2 and convolution unit 3 to get P2. The next step is to multiply P1 and P2, and the result is P3. P3 and P1 are then added to the vector to obtain the output P4. Here P2 acts as a weight vector, and assigns the weight of each dimension of the feature to the feature P1, realizing the effect of feature selection and feature fusion. For concrete crack images, the input from multi-channel feature processing includes high-resolution and low-resolution, and effective feature fusion can allow each feature to be fully weighted, highlight the crack characteristics, and facilitate accurate segmentation.
[0081]The loss function of the above ConcreteCrackSegNet uses the main and auxiliary loss functions. The main loss function is used to measure the output of the entire ConcreteCrackSegNet network. At the same time, an auxiliary loss function is also added to measure the output of the context channel. All loss functions are calculated by softmax. At the same time, Use the parameter α to balance the weight of the main loss function and the auxiliary loss function. The loss function formula is:
[0082]
[0083] where l p is the loss of the main loss function, X i is the output feature, l i is the loss of the auxiliary loss function in the i-th stage, and K is 3 in the present invention.
[0084] On the basis of the above-mentioned embodiment, the present invention evaluates the performance of the ConcreteCrackSegNet model by collecting the test data of concrete cracks of Yantai highway bridges.
[0085] All experiments are carried out on computers with the following specifications: the software environment is based on the ubuntu operating system, python3.8 is the main programming language, and the experiments are carried out on the Pytorch deep learning framework. For each type of experiment, in addition to ConcreteCrackSegNet, the present invention will run typical image segmentation models, including ENet, FCN, LinkNet, SegNet, and UNet, etc.
[0086] ENet (Effificient Neural Network) segmentation network is especially good at low-latency operations because it has fewer parameters; FCN (Fully Convolutional Networks) uses fully convolutional layers instead of fully connected layers, and is the first segmentation model to achieve a major breakthrough; LinkNet The segmentation model is also based on the encoder-decoder architecture, which achieves better accuracy with fewer parameters; the SegNet segmentation model is designed for efficient semantic segmentation; UNet is a symmetric encoder-decoder architecture, just like The letter U shape, originally used for medical image segmentation.
[0087] For the crack segmentation task in the present invention, the following evaluation metrics are used: accuracy, average IoU, precision (P), recall (R), and F1. The F1 score is the harmonic mean of precision and recall, cracked pixels (white pixels in the image) are defined as positive samples, and pixels are classified into four types based on the combination of labeled and predicted results: true positives ( TP), false positive (FP), true negative (TN) and false negative (FN).
[0088] Precision is defined as the number of correctly identified pixels on all predicted pixels; Precision is defined as the ratio of correctly predicted crack pixels relative to all predicted crack pixels; Recall is defined as correctly predicted crack pixels relative to all The ratio of true cracked pixels; the F1 score is the harmonic mean of precision and recall, where the precision (Precision) is formula (9), the recall (Recall) is formula (10), and the F1 score is the formula:
[0089]
[0090]
[0091]
[0092] Intersection Union (IOU) reflects the degree of overlap between two objects. In the invention, the IOU is evaluated on the "cracks" category to provide a measure of the overlap between actual and predicted cracks in concrete, as the formula:
[0093]
[0094] Using different segmentation models, the test results using the established concrete crack damage image dataset are shown in Table 1:
[0095] Table 1 Test results of different segmentation models
[0096] segmentation model Accuracy/% Average IoU Recall/% F1/% Enet 83.39 80.04 82.68 84.5 FCN 87.25 79.85 89.19 89.77 LinkNet 87.05 79.49 85.36 88.85 SegNet 84.46 70.25 73.45 82.09 Unet 90.01 80.51 70.04 73.09 ConcreteCrackSegNet 91.63 82.14 94.52 91.54
[0097] From the test results in Table 1, it can be seen that ConcreteCrackSegNet is significantly better than other segmentation models in accuracy (91.63%) and F1 (91.54%), and achieves very good results in concrete crack segmentation.
[0098] Figure 7 The segmentation effect of ConcreteCrackSegNet on concrete cracks is shown, the left part is the original image, and the right part is the segmentation effect image. As can be seen from the figure, ConcreteCrackSegNet can segment images very accurately and can detect cracks in concrete structures, so that cracks in concrete structures can be detected early, which can provide support for pre-curing decisions for the maintenance department, and save a lot of manpower and material resources. .
[0099] The above are only preferred embodiments of the present invention, and are not intended to limit the present invention in other forms. Any person skilled in the art may use the technical content disclosed above to make changes or modifications to equivalent changes. Example. However, any simple modifications, equivalent changes and modifications made to the above embodiments according to the technical essence of the present invention without departing from the content of the technical solutions of the present invention still belong to the protection scope of the technical solutions of the present invention.