A honeycomb structure air coupled ultrasonic defect identification method based on a hollow convolution

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By combining dilated convolution and depthwise separable convolution into an encoder-decoder network model, the problems of ambiguity and noise interference in defect identification in air-coupled ultrasonic testing are solved, achieving high-precision and efficient defect identification that is adaptable to defect identification of different scales and types.

CN122243874APending Publication Date: 2026-06-19BEIHANG UNIV +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIHANG UNIV
Filing Date: 2026-02-10
Publication Date: 2026-06-19

Application Information

Patent Timeline

10 Feb 2026

Application

19 Jun 2026

Publication

CN122243874A

IPC: G06T7/00; G06V10/774; G06V10/764; G06V10/40; G06V10/82; G06N3/045

AI Tagging

Application Domain

Image analysis Character and pattern recognition

Technology Topics

Pattern recognition Data set

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing air-coupled ultrasonic testing technology has difficulty effectively distinguishing defects from background noise in composite honeycomb sandwich structures, resulting in high rates of missed detection and false detection. Furthermore, its generalization ability is insufficient, making it unable to adapt to the needs of defect identification of different types and scales.

Method used

An encoder-decoder network model based on dilated convolution is adopted. By combining dilated convolution and depthwise separable convolution, multi-scale feature extraction and boundary optimization are achieved, and an encoder-decoder network is built for defect identification.

Benefits of technology

It achieves accurate segmentation and recognition of defect features, reduces computational load, improves recognition accuracy and robustness, solves the problems of blurred defect boundaries and noise interference, and is adaptable to defect recognition of different scales and types.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122243874A_ABST

Patent Text Reader

Abstract

This invention discloses a method for air-coupled ultrasonic defect identification of honeycomb structures based on dilated convolution, belonging to the field of air-coupled ultrasonic defect detection technology for honeycomb sandwich structures. The method includes the following steps: acquiring detection images of the honeycomb sandwich structure using an air-coupled ultrasonic detection device; processing the air-coupled ultrasonic detection images to obtain a dataset for model training; constructing an encoder-decoder network based on dilated convolution; training the dilated convolution-based encoder-decoder network using the acquired dataset; and automatically identifying defects in new air-coupled ultrasonic detection images using the trained encoder-decoder network. This invention flexibly controls the resolution of the encoded features through dilated convolution and reduces the computational load to about 1 / 8 of that of standard convolution, solving the problem of balancing high accuracy and real-time performance, and achieving a dynamic balance between the accuracy and operational efficiency of defect identification in air-coupled ultrasonic detection images.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of air-coupled ultrasonic defect detection technology for honeycomb sandwich structures, and specifically to a method for identifying air-coupled ultrasonic defects in honeycomb sandwich structures based on a code-decoder network with dilated convolution. Background Technology

[0002] Composite honeycomb sandwich structures, with their high strength, lightweight, and excellent fatigue resistance, have become core load-bearing components for high-end equipment such as aerospace vehicles and high-speed transportation vehicles. Honeycomb sandwich structures are composed of multiple layers of anisotropic skins and honeycomb cores such as aluminum / aramid. While their heterogeneous layered characteristics provide structural lightweight advantages, they also lead to defects such as panel-core material debonding and interlayer delamination. These hidden defects can easily propagate under complex loads, potentially causing performance failure and even threatening the overall structural safety.

[0003] Air-coupled ultrasonic testing has become a primary inspection method for composite honeycomb sandwich structures due to its high sensitivity and resolution, lack of liquid coupling requirement, and adaptability to complex structures. However, current air-coupled ultrasonic testing images generally suffer from problems such as blurred defect features, severe background noise interference, and discontinuous defect boundaries. Conventional feature recognition methods based on grayscale thresholding and edge detection operators are unable to effectively distinguish defects from background noise, resulting in high false negative and false positive rates. Furthermore, the methods lack generalization ability and cannot adapt to the needs of identifying defects of different types and scales.

[0004] Deep learning-based encoder-decoder network models automatically learn deep semantic features of images, demonstrating powerful multi-scale feature extraction capabilities and boundary optimization performance in semantic segmentation tasks. This significantly improves the accuracy and robustness of target recognition in complex scenes, providing a new technical approach for defect identification in ultrasonic inspection images. Therefore, this invention proposes a defect identification method for air-coupled ultrasonic inspection images using an encoder-decoder network model based on dilated convolution, achieving accurate segmentation and effective identification of defect features in air-coupled ultrasonic inspection images. Summary of the Invention

[0005] To achieve the above objectives, the technical solution of the present invention is as follows: A method for identifying air-coupled ultrasonic defects in a cellular sandwich structure based on a dilated convolution encoder-decoder network, comprising the following steps:

[0006] Step 1) Acquire inspection images of the honeycomb sandwich structure using an air-coupled ultrasonic testing device;

[0007] Step 2) Process the air-coupled ultrasound detection images to obtain the dataset for model training;

[0008] Step 3) Construct an encoder-decoder network based on dilated convolution;

[0009] Step 4) Train the dilated convolution-based encoder-decoder network using the acquired dataset;

[0010] Step 5) Use the trained encoder-decoder network to automatically identify defects in the new air-coupled ultrasound images.

[0011] Preferably, step 2) specifically involves selecting an equal number of normal images and images with defects from the air-coupled ultrasound detection images as an image set, and then dividing the image set into a test set and a training set. Both the test set and the training set include an equal number of normal images and images with defects. The test set and the training set constitute the dataset of the encoder-decoder network model.

[0012] Preferably, the specific process of step 3) includes an encoder-decoder network model based on dilated convolution, which includes an encoder module and a decoder module. The encoder module includes a backbone network and a pyramid pooling network. The backbone network and the pyramid pooling network extract shallow and deep features of the air-coupled ultrasound image, respectively. The decoder module recovers spatial information through an upsampling module and fuses the shallow features extracted by the encoder module.

[0013] Preferably, the backbone network includes several inverse residual blocks. Each inverse residual block includes a channel expansion layer, a feature extraction layer, and a channel compression layer. The channel expansion layer uses 1×1 convolutions to expand the number of input channels to a higher dimension. The feature extraction layer performs channel-wise feature extraction and spatial filtering on high-dimensional features through depthwise separable convolutions. The channel compression layer uses 1×1 convolutions to compress the number of channels back to a lower dimension. In each inverse residual block, if the input and output dimensions are the same, skip connections will directly add the input to the output.

[0014] Preferably, the pyramid pooling network includes dilated convolution, which changes the sampling interval of the convolution kernel by introducing a dilation rate r in the convolution kernel. By combining convolutional layers with different dilation rates, the extraction of feature information of multi-scale detection images can be achieved.

[0015] For a two-dimensional ultrasound image, given a position i on the output feature map y and a convolution kernel w, the calculation expression for dilated convolution on the input feature map x is as follows:

[0016] (1),

[0017] The dilation rate r determines the sampling step size of the input signal. When r=1, the dilated convolution is the standard convolution. Increasing the value of r can adaptively expand the receptive field of the convolution kernel and capture defect features at a larger scale. Decreasing the value of r focuses on local details and optimizes the accuracy of defect boundary recognition. A 3×3 convolution kernel is used.

[0018] Preferably, the pyramid pooling network includes dilated spatial pyramid pooling. The dilated spatial pyramid pooling module consists of a 1×1 convolutional layer, three 3×3 dilated convolutional layers, and a global average pooling layer. The 1×1 standard convolutional layer is used to capture the local information of the shallowest layer. The three 3×3 dilated convolutions with dilation rates of 6, 12, and 18 act on the input feature map in parallel, so that each convolutional kernel can have a different receptive field and capture defect features at different scales. The global average pooling is used to capture the background or scene information of the entire image and summarize the information of the entire feature map into a single global feature. The feature maps of different branches generated by the ASPP module are merged together by concatenation, and then integrated by a 1×1 convolution. Finally, batch normalization and ReLU activation function are used for further processing.

[0019] Preferably, the pyramid pooling network includes depthwise separable convolution, which decomposes standard convolution into depthwise convolution and pointwise convolution. Depthwise convolution performs spatial convolution independently on each input channel, focusing on local feature extraction. Pointwise convolution achieves information fusion between each input channel through a 1×1 convolution kernel, reconstructing the feature dimension. While maintaining detection accuracy, depthwise separable convolution performs sparse sampling and cross-channel fusion through depthwise convolution and pointwise convolution, reducing the number of model parameters and computation to about 1 / 8 to 1 / 9 of that of standard convolution. Dilated separable convolution is applied to the ASPP module, setting each branch in the module to a 3×3 dilated depthwise convolution and a 1×1 pointwise convolution.

[0020] Preferably, in the decoder module, the low-resolution feature map containing deep information from the encoder output is upsampled by 4 times using bilinear interpolation, so that its spatial resolution is the same as that of the shallow features output by the backbone network; 1×1 convolution is used to reduce the dimensionality of the shallow features, reducing the number of channels from 256 / 512 to 48; in the decoder module, the shallow feature map is concatenated with the upsampled deep semantic feature map in depth dimension; after feature fusion, the fused feature map is further processed by two 3×3 convolutions; finally, the fused feature map is upsampled by 4 times using bilinear interpolation to restore the feature map to the same resolution as the original input image.

[0021] Preferably, the specific process of step 4) is as follows: initialize all parameters of the encoder-decoder network, and input the training set and test set into the encoder-decoder network; perform forward propagation training, calculate the accuracy of the encoder-decoder network on the test set according to the training results, and determine whether the accuracy reaches a predetermined value or a predetermined number of training iterations; if the predetermined value or the predetermined number of training iterations is reached, end the training; otherwise, perform backpropagation training, calculate the weights and biases; update the weights and biases according to the calculation results, substitute the updated parameters into the encoder-decoder network, and perform forward propagation training again until the error continuously decreases.

[0022] Preferably, step 5) involves importing the new air-coupled ultrasonic detection image into the trained encoder-decoder network model to achieve automatic identification of defects in the detection image.

[0023] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0024] (1) By using dilated convolution to flexibly control the resolution of the coding features and reduce the computational load to about 1 / 8 of that of standard convolution, the problem of high accuracy and real-time performance can be solved, and a dynamic balance between defect recognition accuracy and operating efficiency can be achieved.

[0025] (2) The code-decoder network model based on dilated convolution is applied to the field of defect recognition of air-coupled ultrasonic inspection images of honeycomb sandwich structures. By utilizing its multi-scale feature extraction capability and boundary optimization advantage, the problems of blurring and noise interference in air-coupled ultrasonic inspection images are solved, and the defects such as delamination and layering are accurately identified. Attached Figure Description

[0026] Figure 1 This is a schematic diagram of an encoder-decoder network structure based on dilated convolution;

[0027] Figure 2 The original images and defect identification results of air-coupled ultrasonic testing of honeycomb sandwich structures are compared, where (a) is the amplitude-based detection image; (b) is the Canny edge detection result; and (c) is the defect identification result of this method. Detailed Implementation

[0028] The present invention will be further illustrated below with reference to the accompanying drawings and specific embodiments. It should be understood that the following specific embodiments are for illustrative purposes only and are not intended to limit the scope of the invention.

[0029] Example:

[0030] This method uses an air-coupled ultrasonic testing system to scan and image a honeycomb sandwich structure, acquiring ultrasonic C-scan images containing defect information. After preprocessing such as normalization and preliminary noise suppression, the images are input into a dilated convolution-based encoder-decoder network model to extract semantic information from the detected images and restore details of defect boundaries, thus solving the problem of blurred defect boundaries in air-coupled ultrasonic testing images and achieving efficient segmentation and accurate identification of defect features. The dilated convolution-based encoder-decoder network model consists of an encoder that progressively reduces feature maps and extracts higher semantic information, and a decoder that progressively restores spatial information. A schematic diagram is shown below. Figure 1 As shown.

[0031] Furthermore, the encoder consists of two modules: a backbone network and a pyramid pooling module, which extract shallow and deep features from air-coupled ultrasound images, respectively.

[0032] Furthermore, in the backbone network, this method uses MobileNet V2 as the backbone network to extract shallow features from air-coupled ultrasound images. This network consists of a series of inverse residual blocks. Unlike the residual structure of ResNet, the inverse residual blocks are large in the middle and small at both ends, thus improving training speed and accuracy. Each inverse residual block consists of three parts: channel expansion, feature extraction, and channel compression. The channel expansion layer uses 1×1 convolutions to expand the number of input channels to a higher dimension; the feature extraction layer performs channel-wise feature extraction and spatial filtering on high-dimensional features using depthwise separable convolutions, reducing computation and parameter count compared to standard convolution operations; the channel compression layer uses 1×1 convolutions to compress the number of channels back to a lower dimension, removing redundant information. In each inverse residual block, if the input and output dimensions are the same, skip connections directly add the input to the output, which not only improves the network's expressive power but also effectively reduces the number of parameters and computational complexity, enabling the network to maintain high efficiency while possessing strong feature extraction capabilities.

[0033] Furthermore, pyramid pooling includes:

[0034] Pyramid pooling includes dilated convolution.

[0035] As an extension of standard convolution, dilated convolution changes the sampling interval of the convolution kernel by introducing a dilation rate *r*, allowing for flexible adjustment of the receptive field of the kernel without increasing network parameters. By combining convolutional layers with different dilation rates, multi-scale detection image feature extraction can be achieved.

[0036] For a two-dimensional ultrasound image, given a position i on the output feature map y and a convolution kernel w, the calculation expression for dilated convolution on the input feature map x is as follows:

[0037] (1)

[0038] The dilation rate *r* determines the sampling stride of the input signal. When *r* = 1, dilated convolution is the standard convolution. By changing the dilation rate, the receptive field of the convolution kernel can be adaptively adjusted. Increasing the value of *r* adaptively expands the receptive field of the convolution kernel, capturing defect features at a larger scale; decreasing the value of *r* focuses on local details, optimizing the accuracy of defect boundary recognition. This method uses a 3×3 convolution kernel and dynamically adjusts the dilation rate by designing parallel or serial structures with different *r* values to extract defect features at different scales, from micro-layering to large-area debonding.

[0039] Furthermore, pyramid pooling includes hollow space pyramid pooling.

[0040] The Atrous Spatial Pyramid Pooling (ASPP) module mainly consists of a 1×1 convolutional layer, three 3×3 dilated convolutional layers, and a global average pooling layer. It expands the receptive field, fuses multi-scale features, and thus improves semantic segmentation accuracy. The 1×1 standard convolutional layer captures the shallowest local information, reduces the number of channels in the feature map, and decreases computational cost. The three 3×3 dilated convolutions with dilation rates of 6, 12, and 18 operate in parallel on the input feature map, allowing each convolutional kernel to have a different receptive field and capture defect features at different scales. The global average pooling layer captures background or scene information from the entire image, summarizing the information from the entire feature map into a single global feature.

[0041] The feature maps generated by the ASPP module from different branches are merged together by concatenation, and then integrated by a 1×1 convolution. Finally, batch normalization and ReLU activation function are used for further processing to ensure that the features can be effectively utilized in the decoding layer.

[0042] Furthermore, pyramid pooling includes depthwise separable convolution.

[0043] Depthwise separable convolution decomposes standard convolution into depthwise convolution (also known as channel-wise convolution) and pointwise convolution (1×1 convolution). Depthwise convolution performs spatial convolution independently on each input channel, focusing on local feature extraction; pointwise convolution achieves information fusion between each input channel through a 1×1 convolution kernel, reconstructing the feature dimension. While maintaining detection accuracy, depthwise separable convolution reduces the number of model parameters and computational cost to about 1 / 8 to 1 / 9 of standard convolution by performing sparse sampling and cross-channel fusion through depthwise and pointwise convolution.

[0044] Within the PyTorch framework, depthwise convolution supports dilated convolutions, and depthwise separable convolutions can thus form dilated separable convolutions. The original ASPP module uses multiple standard 3×3 dilated convolutions with different dilation rates in parallel on top of spatial pyramid pooling, resulting in high computational cost. Therefore, this method applies dilated separable convolutions to the ASPP module, replacing the 3×3 dilated convolutions (r=6, 12, 18) in each branch with 3×3 dilated depthwise convolutions and 1×1 pointwise convolutions. Without sacrificing detection accuracy, this reduces the computational burden of the ASPP module, maintains multi-scale feature extraction capabilities, and significantly improves model efficiency.

[0045] Furthermore, in the decoder module, although the model captures rich multi-scale contextual information in the encoder through the backbone network and ASPP module, the feature map resolution is low due to the downsampling process in the convolution operation. This results in insufficient detail representation of defect boundaries, easily leading to blurry or inaccurate edge segmentation. To address the low-resolution feature maps, the decoder recovers spatial information through upsampling and fuses the shallow features extracted from the encoder. This improves the precision of semantic segmentation while maintaining high accuracy, thus enhancing the segmentation effect on defect boundaries. The specific feature processing flow of the decoder is as follows:

[0046] (1) Upsampling.

[0047] The decoder module upsamples the low-resolution feature map containing deep information from the encoder output by a factor of 4 using bilinear interpolation, so that its spatial resolution is the same as that of the shallow features output by the backbone network, in order to facilitate further fusion and processing.

[0048] (2) 1×1 convolution dimensionality reduction.

[0049] The shallow features output by the encoder backbone network usually contain a large number of channels. In order to avoid the excessive number of shallow feature channels from obscuring the semantic information of the encoded features, the number of channels is reduced to match the upsampled deep feature map. 1×1 convolution is used to reduce the dimensionality of the shallow features, reducing the number of channels from 256 / 512 to 48, thereby reducing the computational complexity.

[0050] (3) Feature splicing.

[0051] The shallow features output by the encoder backbone network retain a significant amount of spatial information, providing good preservation of details such as image edges and contours. By concatenating the shallow feature maps with the upsampled deep semantic feature maps along the depth dimension, the deep semantic information and shallow spatial information are fully integrated, preserving both global information and restoring defect boundary details.

[0052] (4) 3×3 convolution.

[0053] After feature fusion, the fused feature map is further processed by two 3×3 convolutions to reduce the aliasing effect that may be caused by feature fusion and gradually refine the defect segmentation boundary.

[0054] (5) Upsampling.

[0055] Finally, the fused feature map is upsampled by 4 times using bilinear interpolation to restore the feature map to the same resolution as the original input image, ensuring that the defect segmentation result can accurately correspond to every pixel in the input image, thus achieving pixel-by-pixel semantic segmentation.

[0056] The decoder enables the model to generate finer defect edges and details while maintaining high semantic understanding capabilities, thereby improving the performance of semantic segmentation tasks.

[0057] Figure 2 The image shows a comparison of the original images and defect identification results of air-coupled ultrasonic testing of a honeycomb sandwich structure. (a) is the conventional amplitude-based air-coupled ultrasonic testing image, (b) is the conventional Canny edge detection result, and (c) is the defect identification result of the proposed method. As shown in the figure, the conventional Canny edge detection method is affected by ultrasonic image noise and can only capture some defect edge fragments, failing to fully characterize the defect morphology; a total of 9 defects were detected, while 5 were not, resulting in a false negative rate exceeding 35%. The proposed method can accurately segment the defect region and completely restore the shape and boundary of the defect.

[0058] It should be noted that the above content merely illustrates the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. For those skilled in the art, various improvements and modifications can be made without departing from the principle of the present invention, and all such improvements and modifications fall within the scope of protection of the claims of the present invention.

Claims

1. A method for identifying air-coupled ultrasonic defects in a honeycomb structure based on dilated convolution, characterized in that, Includes the following steps: Step 1) Acquire inspection images of the honeycomb sandwich structure using an air-coupled ultrasonic testing device; Step 2) Process the air-coupled ultrasound detection images to obtain the dataset for model training; Step 3) Construct an encoder-decoder network based on dilated convolution; Step 4) Train the dilated convolution-based encoder-decoder network using the acquired dataset; Step 5) Use the trained encoder-decoder network to automatically identify defects in the new air-coupled ultrasound images.

2. The method according to claim 1, characterized in that, Step 2) involves selecting an equal number of normal images and images with defects from the air-coupled ultrasound images as an image set. The image set is then divided into a test set and a training set. Both the test set and the training set include an equal number of normal images and images with defects. The test set and the training set together form the dataset for the encoder-decoder network model.

3. The method according to claim 1, characterized in that, Step 3) The specific process includes an encoder-decoder network model based on dilated convolution, which includes an encoder module and a decoder module. The encoder module includes a backbone network and a pyramid pooling network. The backbone network and the pyramid pooling network extract shallow and deep features of the air-coupled ultrasound image, respectively. The decoder module recovers spatial information through an upsampling module and fuses the shallow features extracted by the encoder module.

4. The method according to claim 3, characterized in that, The backbone network includes several inverse residual blocks, each containing a channel expansion layer, a feature extraction layer, and a channel compression layer. The channel expansion layer uses 1×1 convolutions to expand the number of input channels to a higher dimension. The feature extraction layer performs channel-wise feature extraction and spatial filtering on the high-dimensional features using depthwise separable convolutions. The channel compression layer uses 1×1 convolutions to compress the number of channels back to a lower dimension. In each inverse residual block, if the input and output dimensions are the same, skip connections will directly add the input to the output.

5. The method according to claim 3, characterized in that, The pyramid pooling network includes dilated convolution, which changes the sampling interval of the convolution kernel by introducing a dilation rate r in the convolution kernel. By combining convolutional layers with different dilation rates, the extraction of feature information of multi-scale detection images can be achieved. For a two-dimensional ultrasound image, given a position i on the output feature map y and a convolution kernel w, the calculation expression for dilated convolution on the input feature map x is as follows: （1）， The dilation rate r determines the sampling step size of the input signal. When r=1, the dilated convolution is the standard convolution. Increasing the value of r can adaptively expand the receptive field of the convolution kernel and capture defect features at a larger scale. Decreasing the value of r focuses on local details and optimizes the accuracy of defect boundary recognition. A 3×3 convolution kernel is used.

6. The method according to claim 5, characterized in that, The pyramid pooling network includes dilated spatial pyramid pooling, which consists of a 1×1 convolutional layer, three 3×3 dilated convolutional layers, and a global average pooling layer. The 1×1 standard convolutional layer is used to capture the local information of the shallowest layer. The three 3×3 dilated convolutions with dilation rates of 6, 12, and 18 act on the input feature map in parallel, so that each convolutional kernel can have a different receptive field and capture defect features at different scales. The global average pooling is used to capture the background or scene information of the entire image and summarize the information in the entire feature map into a single global feature. The feature maps of different branches generated by the ASPP module are merged together by concatenation, then integrated by a 1×1 convolution, and finally further processed by batch normalization and ReLU activation function.

7. The method according to claim 6, characterized in that, The pyramid pooling network includes depthwise separable convolution, which decomposes standard convolution into depthwise convolution and pointwise convolution. Depthwise convolution performs spatial convolution independently on each input channel, focusing on local feature extraction. Pointwise convolution achieves information fusion between each input channel through a 1×1 convolution kernel, reconstructing the feature dimension. While maintaining detection accuracy, depthwise separable convolution performs sparse sampling and cross-channel fusion through depthwise convolution and pointwise convolution, reducing the number of model parameters and computation to about 1 / 8 to 1 / 9 of standard convolution. Dilated separable convolution is applied to the ASPP module, setting each branch in the module to a 3×3 dilated depthwise convolution and a 1×1 pointwise convolution.

8. The method according to claim 3, characterized in that, In the decoder module, the low-resolution feature map containing deep information from the encoder output is upsampled by 4 times using bilinear interpolation, so that its spatial resolution is the same as that of the shallow features output by the backbone network. A 1×1 convolution is used to reduce the dimensionality of the shallow features, reducing the number of channels from 256 / 512 to 48. In the decoder module, the shallow feature map is concatenated with the upsampled deep semantic feature map in the depth dimension. After feature fusion, the fused feature map is further processed by two 3×3 convolutions. Finally, the fused feature map is upsampled by 4 times using bilinear interpolation to restore the feature map to the same resolution as the original input image.

9. The method according to claim 1, characterized in that, The specific process of step 4) is as follows: initialize all parameters of the encoder-decoder network, and input the training set and test set into the encoder-decoder network; perform forward propagation training, calculate the accuracy of the encoder-decoder network on the test set based on the training results, and determine whether the accuracy has reached the predetermined value or whether the predetermined number of training times has been reached. If the predetermined value or the predetermined number of training iterations is reached, training ends; otherwise, backpropagation training is performed to calculate the weights and biases. The weights and biases are updated based on the calculation results, and the updated parameters are substituted into the encoder-decoder network for forward propagation training until the error continuously decreases.

10. The method according to claim 1, characterized in that, Step 5) involves importing the new air-coupled ultrasonic test image into the trained encoder-decoder network model to achieve automatic identification of defects in the test image.