A road crack segmentation method based on an improved U-Net network
By improving the U-Net network structure and combining residual neural networks and attention mechanism modules, the problems of incomplete semantic information and weak contextual connections in road crack segmentation of U-Net network are solved, thereby improving segmentation accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2022-12-29
- Publication Date
- 2026-06-19
Smart Images

Figure CN115830038B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a road crack segmentation method in computer image processing, and more particularly to a road crack segmentation method based on an improved U-Net network. Background Technology
[0002] In recent years, due to the rapid urbanization process in my country, the scale of highway construction has continued to expand, and the mileage of highways in operation has increased dramatically. Both existing and newly constructed highways require maintenance, and road cracks are one of the most common road surface hazards. How to quickly and efficiently obtain road crack information is fundamental for relevant departments to conduct road maintenance and make scientific decisions. However, given the characteristics of road crack images—blurred crack boundaries, low contrast with the surrounding environment, and complex topological structures—most existing segmentation methods still have significant shortcomings in obtaining receptive fields and extracting image feature information.
[0003] Deep learning has become a widely discussed buzzword in recent years, finding extensive applications in fields such as medicine and autonomous driving. While its rudiments date back to the 1940s, its development was significantly hampered by insufficient computer hardware and a lack of rich datasets. However, the rapid advancements in computer CPUs and GPUs in recent years, along with the construction of large-scale datasets, have laid the foundation for deep learning's resurgence and brought it back into the public eye. Convolutional neural network (CNN) frameworks, as a crucial component of deep learning, have experienced explosive growth thanks to advancements in GPUs and other hardware technologies, as well as the emergence of numerous artificial datasets. The advent of the U-Net network, in particular, has been very helpful for deep learning in image processing with limited sample sizes, achieving significant progress in image segmentation. However, U-Net directly fuses feature maps from the encoding and decoding stages, ignoring the importance of features from different channels, resulting in redundant information. Furthermore, U-Net suffers from incomplete semantic information and weak contextual connections, leading to insufficient accuracy in road crack segmentation. Improving the U-Net network structure can enhance its efficiency and accuracy. Summary of the Invention
[0004] The technical problem solved by this invention is that U-Net networks suffer from incomplete semantic information, weak contextual information, and a large amount of redundant information generated when simply splicing feature maps during the encoding and decoding stages of U-Net networks, which leads to a decrease in segmentation accuracy. This invention proposes a road crack segmentation method based on an improved U-Net.
[0005] The technical solution of the present invention is as follows:
[0006] Step 1: Preprocess the original road crack dataset to obtain the preprocessed road crack dataset;
[0007] Step 2: Construct an improved U-Net road crack segmentation network;
[0008] Step 3: Input the preprocessed road crack dataset into the improved U-Net road crack segmentation network for training to obtain the trained road crack segmentation network model;
[0009] Step 4: Use the trained road crack segmentation network model to segment the preprocessed road crack image to obtain the segmentation result map of the road crack.
[0010] In step two, the improved U-Net network includes 5 stacked U-Net modules, 9 residual modules, 8 convolutional modules, 4 attention mechanism modules, 4 pooling layers, 4 upsampling modules, 4 spatiotemporal sequence modules, and a first convolutional layer;
[0011] The input of the improved U-Net road crack segmentation network is used as the input of the first stacked U-Net module. The first stacked U-Net module is connected to the first residual module. The first residual module is connected to the first convolution module, the first pooling layer, and the first spatiotemporal sequence module. The first pooling layer is connected to the second residual module after passing through the second stacked U-Net module. The second residual module is connected to the second convolution module, the second pooling layer, and the second spatiotemporal sequence module. The second pooling layer is connected to the third residual module after passing through the third stacked U-Net module. The third residual module is connected to the third convolution module, the third pooling layer, and the third spatiotemporal sequence module. The third pooling layer is connected to the fourth residual module after passing through the fourth stacked U-Net module. The fourth residual module is connected to the fourth convolution module, the fourth pooling layer, and the fourth spatiotemporal sequence module. The fourth pooling layer is connected to the fifth residual module after passing through the fifth stacked U-Net module. The fifth residual module is connected to the fourth spatiotemporal sequence module. The output of the fifth residual module is channel-concatenated with the output of the fourth spatiotemporal sequence module and then input into the sixth residual module. The output of the sixth residual module is added to the output of the fourth convolution module after passing through the fourth attention mechanism module and then input into the fifth convolution module. The output of the fifth convolution module is used as the input of the third spatiotemporal sequence module. The output of the fifth convolution module is channel-concatenated with the output of the third spatiotemporal sequence module and then input into the seventh residual module.
[0012] The output of the seventh residual module and the output of the third convolution module are added together by the third attention mechanism module and then input into the sixth convolution module. The output of the sixth convolution module is used as the input of the second spatiotemporal sequence module. The output of the sixth convolution module is channel-concatenated with the output of the second spatiotemporal sequence module by the third upsampling module and then input into the eighth residual module.
[0013] The output of the eighth residual module is added to the output of the second convolution module via the second attention mechanism module and then input into the seventh convolution module. The output of the seventh convolution module is used as the input of the first spatiotemporal sequence module. The output of the seventh convolution module is channel-concatenated with the output of the first spatiotemporal sequence module via the output of the fourth upsampling module and then input into the ninth residual module.
[0014] The output of the ninth residual module and the output of the first convolutional module after passing through the first attention mechanism module are added together and then input into the eighth convolutional module. The output of the eighth convolutional module after passing through the first convolutional layer is used as the output of the improved U-Net road crack segmentation network.
[0015] The five stacked U-Net modules have the same structure. Each stacked U-Net module includes a tenth residual module, an eleventh residual module, a ninth convolutional module, a fifth upsampling module, and a fifth pooling layer. The input of the stacked U-Net module is used as the input of the tenth residual module. The tenth residual module is connected to the fifth pooling layer and the ninth convolutional module. The fifth pooling layer is connected to the fifth upsampling module after passing through the eleventh residual module. The output of the fifth upsampling module and the output of the ninth convolutional module are concatenated and used as the output of the stacked U-Net module.
[0016] The nine residual modules have the same structure. Each residual module includes four convolutional layers and a first SE module. The input of the residual module is used as the input of the second and third convolutional layers. The second convolutional layer is connected to the SE module after passing through the fourth convolutional layer. The third convolutional layer is connected to the fifth convolutional layer. The output of the residual module is the sum of the output of the fifth convolutional layer and the output of the SE module.
[0017] The residual module includes four convolutional layers and a second SE module. The input of the residual module is used as the input of the sixth, seventh, and eighth convolutional layers. The outputs of the seventh and eighth convolutional layers are channel-concatenated and then input into the ninth convolutional layer. The output of the ninth convolutional layer is obtained by adding the output of the second SE module to the output of the sixth convolutional layer.
[0018] The four attention mechanism modules have the same structure. Each attention mechanism module includes a channel attention module and a spatial attention module. The input of the attention mechanism module is used as the input of the channel attention module. The output of the channel attention module multiplied by the input of the attention mechanism module is called the intermediate output. The intermediate output is used as the input of the spatial attention module. The output of the spatial attention module multiplied by the intermediate output is used as the output of the attention mechanism module.
[0019] The channel attention module includes a tenth convolutional layer, two max pooling layers, two average pooling layers, and a first activation layer. The input of the channel attention module is used as the input of the first max pooling layer and the first average pooling layer. The first max pooling layer and the first average pooling layer are connected to the tenth convolutional layer. The tenth convolutional layer is connected to the second max pooling layer and the second average pooling layer. The outputs of the second max pooling layer and the second average pooling layer are multiplied and then input into the first activation layer. The output of the first activation layer is used as the output of the channel attention module.
[0020] The spatial attention module includes a second max pooling layer, a second average pooling layer, an eleventh convolutional layer, and a second activation layer. The input of the spatial attention module is used as the input of the second max pooling layer. The second max pooling layer is connected to the second activation layer after passing through the second average pooling layer and the eleventh convolutional layer. The output of the second activation layer is used as the output of the spatial attention module.
[0021] Compared with the prior art, the above-described technical solution of the present invention has the following advantages:
[0022] For road crack image data with complex backgrounds and significant interference, this invention designs a road crack detection network based on a U-Net nested attention mechanism. Improvements are made to the traditional U-Net's downsampling layer, upsampling layer, and skip connection part. To effectively enhance feature representation capabilities and adapt to different situations, a U-Net nested sampling method is proposed for the downsampling process, where each downsampling step uses an encoder-decoder structure. Simultaneously, combined with residual neural networks, two residual convolutional modules based on the Squeeze-and-Excitation (SE) module are proposed and applied to the traditional U-Net upsampling process and the nested U-Net model. Finally, CBAM and an SE-based residual attention module are added to the skip connections to improve prediction accuracy and sensitivity.
[0023] Compared with existing mainstream segmentation methods, this invention improves the performance of the network model, achieves higher segmentation accuracy, and enhances the robustness of the algorithm. Attached Figure Description
[0024] Figure 1 This is a flowchart of the method of the present invention;
[0025] Figure 2 This is a schematic diagram of the improved U-Net network structure described in this invention;
[0026] Figure 3 This is a schematic diagram of the superimposed U-Net module described in this invention.
[0027] Figure 4This is a schematic diagram of a residual module based on the Squeeze-and-Excitation (SE) module described in this invention, namely the RE module;
[0028] Figure 5 This is a schematic diagram of a residual module based on the Squeeze-and-Excitation (SE) module described in this invention, namely a convolutional residual block;
[0029] Figure 6 This is a schematic diagram of the attention mechanism module CBAM described in this invention;
[0030] Figure 7 This is a schematic diagram of the sub-modules (CAM and SAM) of CBAM described in this invention. Detailed Implementation
[0031] The technical solution of the invention will be further described below with reference to the accompanying drawings and embodiments.
[0032] This invention provides a road crack segmentation method based on an improved U-Net, which achieves road crack segmentation and provides accurate image segmentation maps for urban road maintenance.
[0033] like Figure 1 As shown, the present invention includes the following steps:
[0034] Step 1: Preprocess the original road crack dataset to obtain the preprocessed road crack dataset;
[0035] The specific process of step one is as follows:
[0036] 1) The dataset in step one comes from 500 images of road cracks taken by Lei et al. at Temple University using a mobile phone, and generated a label for each image. At the same time, each label image has the same name as the original image, but the original image is in JPG format and the semantic label image is in PNG format. The pixel resolution of the road crack image is 200*1500*3.
[0037] 2) An image with too large a pixel size is not conducive to the training of a neural network model. Based on this, in order to adapt to the requirements of the network model on the input image pixels and to facilitate the training of the model, each image is cropped into 16 non-overlapping regions. Then, images with crack pixels greater than 1000 are retained in the 16 non-overlapping regions. The resulting crack dataset contains 1517 crack images in the training group.
[0038] 3) Using data augmentation, 3792 crack images and their corresponding label images were obtained. The road crack images and their corresponding label images in the dataset were divided into a training set and a validation set. The training set included 3034 images, the validation set included 624 images, and the test set included 134 images.
[0039] 4) Denoise the data image by setting a threshold and cutting off the grayscale values outside the threshold. Slice the road crack image to obtain an image size of 256*256*3.
[0040] The preprocessing steps are image cropping, data augmentation, image denoising, and image slicing.
[0041] Step 2: Construct an improved U-Net road crack segmentation network;
[0042] In step two, as Figure 2 As shown, the improved U-Net network includes 5 stacked U-Net modules, 9 residual modules, 8 convolutional modules, 4 attention mechanism modules (CBAM), 4 pooling layers, 4 upsampling modules, 4 spatiotemporal sequence modules, and a first convolutional layer;
[0043] The input of the improved U-Net road crack segmentation network is used as the input of the first stacked U-Net module. The first stacked U-Net module is connected to the first residual module. The first residual module is connected to the first convolution module, the first pooling layer, and the first spatiotemporal sequence module. The first pooling layer is connected to the second residual module after passing through the second stacked U-Net module. The second residual module is connected to the second convolution module, the second pooling layer, and the second spatiotemporal sequence module. The second pooling layer is connected to the third residual module after passing through the third stacked U-Net module. The third residual module is connected to the third convolution module, the third pooling layer, and the third spatiotemporal sequence module. The third pooling layer is connected to the fourth residual module after passing through the fourth stacked U-Net module. The fourth residual module is connected to the fourth convolution module, the fourth pooling layer, and the fourth spatiotemporal sequence module. The fourth pooling layer is connected to the fifth residual module after passing through the fifth stacked U-Net module. The fifth residual module is connected to the fourth spatiotemporal sequence module. The output of the fifth residual module is channel-concatenated with the output of the fourth spatiotemporal sequence module and then input into the sixth residual module. The output of the sixth residual module is added to the output of the fourth convolution module after passing through the fourth attention mechanism module and then input into the fifth convolution module. The output of the fifth convolution module is used as the input of the third spatiotemporal sequence module. The output of the fifth convolution module is channel-concatenated with the output of the third spatiotemporal sequence module and then input into the seventh residual module.
[0044] The output of the seventh residual module and the output of the third convolution module are added together by the third attention mechanism module and then input into the sixth convolution module. The output of the sixth convolution module is used as the input of the second spatiotemporal sequence module. The output of the sixth convolution module is channel-concatenated with the output of the second spatiotemporal sequence module by the third upsampling module and then input into the eighth residual module.
[0045] The output of the eighth residual module is added to the output of the second convolution module via the second attention mechanism module and then input into the seventh convolution module. The output of the seventh convolution module is used as the input of the first spatiotemporal sequence module. The output of the seventh convolution module is channel-concatenated with the output of the first spatiotemporal sequence module via the output of the fourth upsampling module and then input into the ninth residual module.
[0046] The output of the ninth residual module and the output of the first convolutional module after passing through the first attention mechanism module CBAM are added together and then input into the eighth convolutional module. The output of the eighth convolutional module after passing through the first convolutional layer is used as the output of the improved U-Net road crack segmentation network. In specific implementation, the first to fifth pooling layers are max pooling.
[0047] The five stacked U-Net modules have the same structure, such as Figure 3 As shown, each stacked U-Net module includes a tenth residual module, an eleventh residual module, a ninth convolutional module, a fifth upsampling module, and a fifth pooling layer. The input of the stacked U-Net module is used as the input of the tenth residual module. The tenth residual module is connected to the fifth pooling layer and the ninth convolutional module. The fifth pooling layer is connected to the fifth upsampling module after passing through the eleventh residual module. The output of the fifth upsampling module and the output of the ninth convolutional module are concatenated and then used as the output of the stacked U-Net module.
[0048] The nine residual modules have the same structure, such as Figure 4 As shown, the residual module includes four convolutional layers and a first SE module. The input of the residual module is used as the input of the second and third convolutional layers. The second convolutional layer is connected to the first SE module after passing through the fourth convolutional layer. The third convolutional layer is connected to the fifth convolutional layer. The output of the residual module is the sum of the output of the fifth convolutional layer and the output of the first SE module.
[0049] like Figure 5 As shown, the residual module may include four convolutional layers and a second SE module. The input of the residual module is used as the input of the sixth, seventh, and eighth convolutional layers. The outputs of the seventh and eighth convolutional layers are channel-concatenated and then input into the ninth convolutional layer. The output of the ninth convolutional layer is then added to the output of the sixth convolutional layer via the output of the second SE module. The output of the residual module is then used as the output of the residual module.
[0050] The four attention mechanism modules CBAM have the same structure, such as Figure 6 As shown, the attention mechanism module CBAM includes a channel attention module CAM and a spatial attention module SAM. The input of the attention mechanism module CBAM is used as the input of the channel attention module. The output of the channel attention module multiplied by the input of the attention mechanism module CBAM is called the intermediate output. The intermediate output is used as the input of the spatial attention module. The output of the spatial attention module multiplied by the intermediate output is used as the output of the attention mechanism module CBAM.
[0051] like Figure 7 As shown in (a), the channel attention module includes a tenth convolutional layer, two max pooling layers, two average pooling layers, and a first activation layer. The input of the channel attention module is used as the input of the first max pooling layer and the first average pooling layer. The first max pooling layer and the first average pooling layer are connected to the tenth convolutional layer, and the tenth convolutional layer is connected to the second max pooling layer and the second average pooling layer. The outputs of the second max pooling layer and the second average pooling layer are multiplied and then input into the first activation layer. The output of the first activation layer is used as the output of the channel attention module.
[0052] The formula for calculating channel attention is:
[0053]
[0054] in, This represents the Sigmoid activation function. , .
[0055] like Figure 7 As shown in (b), the spatial attention module includes a second max pooling layer, a second average pooling layer, an eleventh convolutional layer, and a second activation layer. The input of the spatial attention module is used as the input of the second max pooling layer. The second max pooling layer is connected to the second activation layer after passing through the second average pooling layer and the eleventh convolutional layer in sequence. The output of the second activation layer is used as the output of the spatial attention module.
[0056] The formula for calculating spatial attention is:
[0057]
[0058] in, This represents the Sigmoid activation function. This indicates that a 7x7 convolution operation is being performed.
[0059] The overall calculation process of CBAM can be summarized as follows:
[0060]
[0061]
[0062] in, This represents the final output. This indicates element-wise multiplication, which means that the values of two feature maps at the same position are directly multiplied.
[0063] In practice, a network model for road crack segmentation is built based on the Tensorflow deep learning framework, and improvements are made to U-Net, specifically:
[0064] For the downsampling process, a nested U-Net sampling method is proposed, in which an encoder-decoder structure is used in each downsampling process while retaining the original U-shaped symmetrical structure. Combined with residual neural networks, two residual modules based on the Squeeze-and-Excitation (SE) module are proposed and applied to the upsampling process of traditional U-Net and nested U-Net models. Finally, CBAM and an SE-based residual attention module, i.e., the spatiotemporal sequence module, are added to the skip connections to prevent gradient vanishing or gradient exploding problems caused by the increase of neural network model depth.
[0065] Step 3: Input the preprocessed road crack dataset into the improved U-Net road crack segmentation network for training to obtain the trained road crack segmentation network model;
[0066] The specific process of step three is as follows:
[0067] 1) Input the training set into the network and use it to train the network model built on TensorFlow. Begin training the road crack image segmentation network based on the improved U-Net. Train the network model through forward propagation, and then optimize the parameters using the Adam optimizer. Set the hyperparameters of the Adam optimizer to their default values, the batch size to 16, the initial learning rate to 0.001, and the epoch to 300. Adjust the learning rate using a decay strategy with a decay rate of 0.95. Select a loss function, and perform backpropagation based on the calculation error obtained from the loss function to update the parameter values in the network model. Repeat the above process until the loss function value converges to the set range. Validate the obtained network model to obtain the optimal network model.
[0068] 2) During model training, a hybrid loss function based on binary cross-entropy and Jaccard loss is used to address the problems of traditional loss functions in road crack segmentation. The formula is as follows:
[0069]
[0070]
[0071]
[0072] in, Represents the overall loss function. This represents the binary cross-entropy loss function used in binary classification tasks. This represents the Jaccard loss function. Indicates the prediction result. Indicates belonging to The probability is given by N, where N represents the total number of samples and i represents the sample number.
[0073] 3) The validation set is used to validate the network model during the training process. When a suitable learning cycle is found, i.e. when the network converges, the training is terminated early and the parameters of the trained network model are saved.
[0074] Step 4: Use the trained road crack segmentation network model to segment the preprocessed road crack image to obtain the segmentation result map of the road crack.
[0075] This invention uses Dice coefficient, Jaccard coefficient, and accuracy—commonly used in computer image segmentation—as evaluation metrics. The calculation method is as follows:
[0076]
[0077]
[0078]
[0079] In this system, TP stands for True Positive Class, predicting positive values as positive; FP stands for False Positive Class, predicting negative values as positive; FN stands for False Negative Class, predicting positive values as negative; and TN stands for True Negative Class. "True" means the detection result is correct, and "false" means the detection result is incorrect. "Positive" means the detection result shows the expected target, and "negative" means the detection result does not show the expected target.
Claims
1. A road crack segmentation method based on an improved U-Net network, characterized in that, Includes the following steps: Step 1: Preprocess the original road crack dataset to obtain the preprocessed road crack dataset; Step 2: Construct an improved U-Net road crack segmentation network; Step 3: Input the preprocessed road crack dataset into the improved U-Net road crack segmentation network for training to obtain the trained road crack segmentation network model; Step 4: Use the trained road crack segmentation network model to segment the preprocessed road crack image to obtain the segmentation result image of the road crack. In step two, the improved U-Net network includes 5 stacked U-Net modules, 9 residual modules, 8 convolutional modules, 4 attention mechanism modules, 4 pooling layers, 4 upsampling modules, 4 spatiotemporal sequence modules, and a first convolutional layer; The input of the improved U-Net road crack segmentation network is used as the input of the first stacked U-Net module. The first stacked U-Net module is connected to the first residual module. The first residual module is connected to the first convolution module, the first pooling layer, and the first spatiotemporal sequence module. The first pooling layer is connected to the second residual module after passing through the second stacked U-Net module. The second residual module is connected to the second convolution module, the second pooling layer, and the second spatiotemporal sequence module. The second pooling layer is connected to the third residual module after passing through the third stacked U-Net module. The third residual module is connected to the third convolution module, the third pooling layer, and the third spatiotemporal sequence module. The third pooling layer is connected to the fourth residual module after passing through the fourth stacked U-Net module. The fourth residual module is connected to the fourth convolution module, the fourth pooling layer, and the fourth spatiotemporal sequence module. The fourth pooling layer is connected to the fifth residual module after passing through the fifth stacked U-Net module. The fifth residual module is connected to the fourth spatiotemporal sequence module. The output of the fifth residual module is channel-concatenated with the output of the fourth spatiotemporal sequence module and then input into the sixth residual module. The output of the sixth residual module is added to the output of the fourth convolution module after passing through the fourth attention mechanism module and then input into the fifth convolution module. The output of the fifth convolution module is used as the input of the third spatiotemporal sequence module. The output of the fifth convolution module is channel-concatenated with the output of the third spatiotemporal sequence module and then input into the seventh residual module. The output of the seventh residual module and the output of the third convolution module are added together by the third attention mechanism module and then input into the sixth convolution module. The output of the sixth convolution module is used as the input of the second spatiotemporal sequence module. The output of the sixth convolution module is channel-concatenated with the output of the second spatiotemporal sequence module by the third upsampling module and then input into the eighth residual module. The output of the eighth residual module is added to the output of the second convolution module via the second attention mechanism module and then input into the seventh convolution module. The output of the seventh convolution module is used as the input of the first spatiotemporal sequence module. The output of the seventh convolution module is channel-concatenated with the output of the first spatiotemporal sequence module via the output of the fourth upsampling module and then input into the ninth residual module. The output of the ninth residual module and the output of the first convolutional module after passing through the first attention mechanism module are added together and then input into the eighth convolutional module. The output of the eighth convolutional module after passing through the first convolutional layer is used as the output of the improved U-Net road crack segmentation network. The spatiotemporal sequence module is a residual attention module based on SE; The five stacked U-Net modules have the same structure. Each stacked U-Net module includes a tenth residual module, an eleventh residual module, a ninth convolutional module, a fifth upsampling module, and a fifth pooling layer. The input of the stacked U-Net module is used as the input of the tenth residual module. The tenth residual module is connected to the fifth pooling layer and the ninth convolutional module. The fifth pooling layer is connected to the fifth upsampling module after passing through the eleventh residual module. The output of the fifth upsampling module and the output of the ninth convolutional module are concatenated and used as the output of the stacked U-Net module.
2. The road crack segmentation method based on an improved U-Net network according to claim 1, characterized in that, The nine residual modules have the same structure. Each residual module includes four convolutional layers and a first SE module. The input of the residual module is used as the input of the second and third convolutional layers. The second convolutional layer is connected to the SE module after passing through the fourth convolutional layer. The third convolutional layer is connected to the fifth convolutional layer. The output of the residual module is the sum of the output of the fifth convolutional layer and the output of the SE module.
3. The road crack segmentation method based on an improved U-Net network according to claim 1, characterized in that, The residual module includes four convolutional layers and a second SE module. The input of the residual module is used as the input of the sixth, seventh, and eighth convolutional layers. The outputs of the seventh and eighth convolutional layers are channel-concatenated and then input into the ninth convolutional layer. The output of the ninth convolutional layer is obtained by adding the output of the second SE module to the output of the sixth convolutional layer.
4. The road crack segmentation method based on an improved U-Net network according to claim 1, characterized in that, The four attention mechanism modules have the same structure. Each attention mechanism module includes a channel attention module and a spatial attention module. The input of the attention mechanism module is used as the input of the channel attention module. The output of the channel attention module multiplied by the input of the attention mechanism module is called the intermediate output. The intermediate output is used as the input of the spatial attention module. The output of the spatial attention module multiplied by the intermediate output is used as the output of the attention mechanism module. The channel attention module includes a tenth convolutional layer, two max pooling layers, two average pooling layers, and a first activation layer. The input of the channel attention module is used as the input of the first max pooling layer and the first average pooling layer. The first max pooling layer and the first average pooling layer are connected to the tenth convolutional layer. The tenth convolutional layer is connected to the second max pooling layer and the second average pooling layer. The outputs of the second max pooling layer and the second average pooling layer are multiplied and then input into the first activation layer. The output of the first activation layer is used as the output of the channel attention module. The spatial attention module includes a second max pooling layer, a second average pooling layer, an eleventh convolutional layer, and a second activation layer. The input of the spatial attention module is used as the input of the second max pooling layer. The second max pooling layer is connected to the second activation layer after passing through the second average pooling layer and the eleventh convolutional layer. The output of the second activation layer is used as the output of the spatial attention module.