A steel surface defect detection method based on cross-domain dynamic reorganization U-Net

By using the cross-domain dynamic recombination U-Net method, the problems of missing small targets and interference from complex backgrounds in steel surface defect detection are solved, achieving high-precision and high-efficiency defect detection, improving detection accuracy and edge recovery capability, and meeting the needs of real-time industrial detection.

CN122289237APending Publication Date: 2026-06-26BENGBU COLLEGE +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BENGBU COLLEGE
Filing Date
2026-04-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies for detecting defects on steel surfaces suffer from problems such as missed detection of small targets, interference from complex backgrounds, and blurred edges, making it difficult to achieve high-precision and high-efficiency detection.

Method used

A detection method based on cross-domain dynamic reconstruction U-Net is adopted. By introducing a cross-domain focusing attention network in the encoder and a dynamic channel reconstruction upsampling unit in the decoder, the multi-scale feature focusing capability and edge recovery accuracy of small defects are enhanced, while reducing computational redundancy.

Benefits of technology

It improves detection accuracy and computational efficiency, achieving high-precision defect detection. The average crossover ratio is improved by 2.23%, the edge recovery accuracy is improved by 3.32%, and the model inference speed reaches 63 FPS, meeting the needs of real-time industrial detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289237A_ABST
    Figure CN122289237A_ABST
Patent Text Reader

Abstract

This invention discloses a steel surface defect detection method based on cross-domain dynamic reconstruction U-Net. Addressing issues such as missed detection of small targets, interference from complex backgrounds, and blurred edges in steel surface defect detection, a cross-domain dynamic reconstruction U-Net (CDR-Unet) model is constructed. This model is an improvement on the U-Net architecture, introducing a cross-domain focusing attention network in the encoder section. This network enhances the multi-scale feature focusing capability for small defects such as inclusions through local-global dual-path interaction. In the decoder section, a dynamic channel reconstruction upsampling unit is introduced, combined with adaptive channel shuffling and lightweight upsampling strategies. This reduces computational load while improving the geometric continuity recovery accuracy of scratch defects, meeting the real-time detection needs of industry and providing a lightweight solution for high-precision defect detection in complex industrial scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and industrial inspection technology, and specifically relates to a method for detecting surface defects in steel based on cross-domain dynamic recombination U-Net. Background Technology

[0002] As one of the most widely used raw materials for mechanical components, the quality of steel not only affects the performance and service life of equipment, but also national defense and the development of the national economy. Early detection and accurate identification of surface defects in steel are crucial for ensuring product quality and improving production efficiency. With the advancement of Industry 4.0 and the development of intelligent manufacturing, intelligent detection of surface defects in steel has become a key technology for improving steel quality and reducing failure rates.

[0003] In recent years, deep learning technology has demonstrated significant advantages in the field of industrial defect detection due to its powerful feature representation capabilities. Detection models, represented by Convolutional Neural Networks (CNN), Faster R-CNN, and the YOLO series, have achieved autonomous extraction and classification of defect features through end-to-end learning mechanisms, greatly improving the intelligence level of detection systems. However, when faced with problems such as sub-millimeter-level micro-defects in high-resolution images in industrial scenarios, complex background interference caused by reflective textures and oxide patches on metal surfaces, and edge blurring due to low contrast between defect areas and normal substrates, traditional models often suffer from missed detections and false detections due to limited receptive fields and imperfect feature fusion mechanisms. In addition, existing algorithms generally suffer from high computational redundancy and a large number of parameters, making it difficult to meet the stringent requirements of lightweight models and inference efficiency for real-time production line detection.

[0004] While U-Net and its variants have achieved breakthroughs in semantic segmentation tasks through their encoder-decoder architecture, their symmetrical structure still faces bottlenecks in complex industrial scenarios, such as multi-scale feature loss, insufficient edge localization accuracy, and limited model generalization ability. This leads to particularly prominent topological degradation problems in typical defects such as thin cracks and point pits. Therefore, exploring new network architectures that balance detection accuracy and computational efficiency has become a key path to overcome the bottlenecks in the implementation of industrial defect detection technology. Summary of the Invention

[0005] The purpose of this invention is to provide a steel surface defect detection method based on cross-domain dynamic recombination U-Net, which solves the problems of missed detection of small targets, interference from complex backgrounds and blurred edges in steel surface defect detection, and achieves high-precision and high-efficiency defect detection.

[0006] The technical solution to achieve the purpose of this invention is: a method for detecting surface defects in steel based on cross-domain dynamic recombination U-Net, comprising the following steps:

[0007] Step 1: Build and train a cross-domain dynamic recombination U-Net model, denoted as CDR-Unet model;

[0008] Step 2: Input the image of the steel surface to be detected into the trained CDR-Unet model;

[0009] Step 3: Obtain the pixel-level defect segmentation map output by the CDR-Unet model to complete the detection of defects on the steel surface;

[0010] The CDR-Unet model is based on an improved U-Net architecture. In the encoder part, a cross-domain focusing attention network is introduced to replace part of the standard convolution. In the decoder part, a dynamic channel reassembly upsampling unit is introduced to replace the standard upsampling module, which is used to restore the geometric continuity of defects while making the model lightweight.

[0011] Furthermore, the construction and processing flow of the Cross-Domain Focused Attention Network (CDFA) is as follows:

[0012] Step 1-1: Construct and train a cross-domain dynamic recombination U-Net (CDR-Unet) model; input feature map X∈R^(H^'×W^'×C) into two parallel cross-domain branches;

[0013] Steps 1-2: Add a cross-domain module to the branch section and adopt a dual-path cross-domain attention mechanism. Use different domain size parameters d=2 and d=4 to capture local and global features respectively, so as to achieve multi-domain information fusion.

[0014] 1) Local feature path in the pixel domain (d=2)

[0015] The input features are divided into indivual Local domains, each domain having local attention weights computed using a lightweight MLP. :

[0016] (1)

[0017] Where X patch Referring to a 2×2 local domain, the Flatten operation stretches a three-dimensional local tensor into a one-dimensional vector. The computational operation is used to learn the relative importance of the four pixels, and Softmax is the normalization operation;

[0018] This approach focuses on the fine features of small regions and is highly sensitive to minor defects, effectively extracting feature information from small targets.

[0019] 2) Global feature path in the feature domain (d=4)

[0020] The features are divided into Each branch path uses a 4×4 feature field, which has a stronger ability to collect global information. This branch path captures global features better by using a 4×4 feature field that is larger than the pixel field size. Global attention weights are also used. The calculation formula is:

[0021] (2)

[0022] Where Q, K, V are query, key, and value matrices generated by linear transformation of the input features, and d is the dimension (scaling factor) of the key vector, used to stabilize the gradient; It is a 16×16 similarity score matrix, representing the semantic relevance between pairs of tokens.

[0023] Steps 1-3: Concatenate and fuse the outputs of the two branches with the feature map processed by another serial convolution path;

[0024] Steps 1-4: Apply channel attention and spatial attention mechanisms to the fused feature map in sequence.

[0025] A global context compression-incentive mechanism is used to dynamically adjust channel weights, optimize computing resource allocation, and reduce computational redundancy. Among these, channel attention weights... The calculation formula is:

[0026] (3)

[0027] Where GAP is global average pooling, and σ is the Sigmoid function, which suppresses the response of the noise channel.

[0028] Spatial attention weights are calculated using two-channel (Max-Avg) spatial saliency modeling. :

[0029] (4)

[0030] in, For average pooling operation, For max pooling operation, This refers to using a 7×7 convolutional layer for convolution operations.

[0031] Finally, attention weights are fused using tensor multiplication to enhance key regions based on input features. Output features are obtained after dual attention enhancement of channel and space. :

[0032] (5)

[0033] Furthermore, the processing steps for the local feature path in the pixel domain include:

[0034] Step 2-1, Unfold operation: Split the input feature tensor into multiple non-overlapping p×p small domains, and output a tensor with shape p×p, H' / p, W' / p, C'; p is the domain size, H', W' are the spatial dimensions of the input features, and the number of domains is calculated accordingly. C' is the feature embedding dimension of the attention subspace.

[0035] Step 2-2, Cross-channel layer normalization (CFN): Normalize the averaged domain features to eliminate feature bias caused by uneven illumination and reflection from the steel plate surface;

[0036] Steps 2-3: Softmax calculation: The attention weights for each domain are calculated using the Softmax function, with values ​​ranging from 0 to 1.

[0037] Steps 2-4: Feature selection and cross-domain fusion: Based on attention weights, the original domain features are weighted and filtered. High-weight domains retain details through residual connections and enhance defect features through channel scaling. Low-weight domains truncate feature values ​​to suppress background interference. Adjacent domains are smoothly transitioned through bilinear interpolation. The high-weight and low-weight domains are set according to thresholds.

[0038] Furthermore, the construction and processing flow of the Dynamic Channel Reassembly Upsampling Unit (DCRU) is as follows:

[0039] Step 3-1: The input feature map is first sampled using bilinear interpolation to enlarge the spatial resolution to 2H×2W, where H is the height of the feature map and W is the width of the feature map.

[0040] Step 3-2: Input the upsampled feature map into the depthwise separable convolutional layer. By calculating the separation of channels and spatial dimensions, the background interference response intensity is reduced while preserving the detail capture capability of the 3×3 convolutional kernel.

[0041] Step 3-3: Perform channel grouping and shuffling operations on the output of depthwise separable convolutions, dividing the total number of channels into 32 independent subgroups. Each subgroup is equipped with a dedicated local feature extractor, and the feature interaction between groups is promoted through periodic channel permutation.

[0042] Steps 3-4: Compress and project the features after channel shuffling using 1×1 convolution, adjust the number of channels to the target dimension, and apply sparsification coefficients to redundant channels based on feature importance scores to eliminate redundant feature responses.

[0043] Steps 3-5: Perform residual connections and summation between the processed feature map and the corresponding shallow high-resolution feature map transmitted from the encoder. Through cross-layer connections, recover the detailed information of small indentations and reduce the number of model parameters.

[0044] Furthermore, the encoder part of the CDR-Unet model consists of multiple stages of convolution, CDFA modules, and downsampling stacked together. When the input image size is 512×512×3, it specifically includes:

[0045] The first stage is to perform a 3×3 64-channel pull-through convolution, pass through a CDFA module once to obtain a preliminary effective feature layer of [512,512,64], and then perform 2×2 max pooling to obtain a feature layer of [256,256,64].

[0046] The second stage: Perform a 3×3 128-channel convolution, pass through a CDFA module once to obtain a feature layer of [256,256,128], and then perform 2×2 max pooling to obtain a feature layer of [128,128,128].

[0047] The third stage: Perform a 3×3 256-channel convolution, pass through a CDFA module once to obtain a feature layer of [128,128,256], and then perform 2×2 max pooling to obtain a feature layer of [64,64,256].

[0048] Fourth stage: Perform a 3×3 512-channel convolution, pass through a CDFA module once to obtain a feature layer of [64,64,512], and then perform 2×2 max pooling to obtain a feature layer of [32,32,512].

[0049] Fifth stage: Perform a 3×3 512-channel convolution and pass it through a CDFA module to obtain a preliminary effective feature layer of [32,32,512].

[0050] Furthermore, the decoder part of the CDR-Unet model utilizes the five preliminary effective feature layers obtained from the encoder part for progressive upsampling and feature fusion, specifically including:

[0051] The feature layer [32,32,512] obtained in the fifth stage is input into the DCRU module for upsampling to obtain a feature map of [64,64,512].

[0052] The upsampled feature map is concatenated with the [64,64,512] feature map of the fourth stage of the encoder to obtain the [64,64,1024] feature map.

[0053] The concatenated feature map is subjected to a 3×3 256-channel depass convolution for channel compression and feature enhancement, resulting in a feature map of [64, 64, 256].

[0054] The above feature map is input into the DCRU module for upsampling to obtain a feature map of [128, 128, 256], which is then concatenated with the [128, 128, 256] feature map from the third stage.

[0055] Repeat the DCRU upsampling, stitching, and convolution operations described above until the original input image resolution of 512×512 is restored.

[0056] Furthermore, p takes the value of 2, and q takes the value of 4.

[0057] Compared with the prior art, the significant advantages of this invention are:

[0058] 1) High detection accuracy: Through the local-global dual-path interaction of the CDFA module, the ability to focus on multi-scale features of small defects such as inclusions is enhanced. The average crossover union ratio (mIoU) on the NEU-Seg dataset reaches 80.97%, which is 2.23% higher than the original U-Net.

[0059] 2) Good edge recovery: Through the adaptive channel shuffling and lightweight upsampling strategy of the DCRU module, the geometric continuity recovery accuracy of linear defects such as scratches is effectively improved, and the PA index of scratch segmentation is improved by 3.32%.

[0060] 3) High computational efficiency: The DCRU module reduces the upsampling computation by 72% while ensuring a model inference speed of 63 FPS, which can meet the needs of industrial real-time detection.

[0061] 4) Strong robustness: Cross-channel layer normalization (CFN) in CDFA effectively eliminates feature biases caused by uneven lighting, surface reflection, and other industrial scenarios, improving the robustness of the model in complex backgrounds.

[0062] The present invention will now be described in further detail with reference to the accompanying drawings. Attached Figure Description

[0063] Figure 1 This is a schematic diagram illustrating the principle of the steel surface defect detection method based on cross-domain dynamic recombination U-Net of the present invention.

[0064] Figure 2 This is the overall flowchart of the CDR-Unet model of the present invention.

[0065] Figure 3 This is the overall flowchart of the CDFA cross-domain focusing attention network in this invention.

[0066] Figure 4 This is a flowchart of the pixel domain branch module (d=2) in this invention.

[0067] Figure 5This is a comparison chart of the crack segmentation performance of various algorithms on the NEU-Seg dataset. Detailed Implementation

[0068] Combination Figure 1 The present invention proposes a method for detecting surface defects in steel based on cross-domain dynamic recombination U-Net, comprising the following steps:

[0069] Step 1: The input feature map is processed through two parallel domain modules and a series of convolutional modules. One domain module has a domain size of 2 (i.e., a 2×2 matrix), which yields more detailed local features. The other domain module has a domain size of 4×4, which yields more comprehensive global features. The resulting new feature maps are concatenated and then processed through a channel + spatial attention mechanism to obtain the output feature map. This achieves bidirectional interaction between the local detail pixel domain (PD) and the global semantic feature domain (FD), overcoming the single-domain limitations of traditional attention mechanisms. Furthermore, through dynamic weight allocation, it strengthens attention focus on small defect regions and suppresses background interference.

[0070] Step 2-1: Input features (H'×W'×C). The module receives multi-scale feature maps from the encoder. Where H′×W′ is the spatial resolution of the current decoding stage, and C is the channel dimension. Feature maps are passed through skip connections to preserve high-frequency details of the original image.

[0071] Step 2-2: Add a cross-domain module to the branch section and adopt a dual-path cross-domain attention mechanism. Use different domain size parameters d=2 and d=4 to capture local and global features respectively, so as to achieve multi-domain information fusion.

[0072] Steps 2-3: Local feature path in the pixel domain (d=2)

[0073] The input features are divided into indivual The local domains are calculated, and the local attention weights for each domain are computed using a lightweight MLP:

[0074] (1)

[0075] This approach focuses on the fine features of small regions and is highly sensitive to minor defects, effectively extracting feature information from small targets.

[0076] Steps 2-4: Global Feature Path in the Feature Domain (d=4)

[0077] The features are divided into Each 4×4 feature field has a stronger ability to collect global information. This branch path captures global features better through a 4×4 feature field that is larger than the pixel field size.

[0078] (2)

[0079] In addition to cross-domain branches, the CDFA module also includes multi-layer convolutional operations to capture more detailed feature information. Simultaneously, an attention module is added, fusing channel and spatial attention mechanisms to further enhance the extracted features and improve extraction performance.

[0080] Step 3-1: Use a global context compression-incentive mechanism to dynamically adjust channel weights, optimize computing resource allocation, and reduce computational redundancy.

[0081] (3)

[0082] Where GAP is global average pooling, and σ is the Sigmoid function, which suppresses the response of the noise channel.

[0083] Step 3-2: Model spatial saliency using two-channel (Max-Avg) saliency:

[0084] (4)

[0085] Finally, attention weights are fused using tensor multiplication to enhance key regions:

[0086] (5)

[0087] Step 3-3: After the features obtained from the fusion of the three branches are processed through a channel-space attention mechanism, the output feature dimension is H′×W′×C, while the number of channels remains unchanged. This improves the information richness of the extracted features without altering the number of channels. Compared to the ordinary convolution step in the original Unet network, this upgrade of the extraction function is achieved without affecting the overall network framework.

[0088] Step 4, the feature domain module algorithm, is similar, except the domain size is changed to 4×4. This module achieves refined perception of steel surface defects through local feature analysis and dynamic weight allocation. Its processing flow can be divided into five core stages.

[0089] Step 4-1: The input feature tensor F is split into multiple non-overlapping small regions of size p×p using the Unfold operation, and the output is a tensor with shape (pxp, H' / p, W' / p, C'). This operation transforms global features into a local observation perspective, significantly enhancing the ability to capture minute defects (such as hairline cracks). Then, a Mean operation is performed, averaging the features across all channels of each region in dimension -1 to generate a grayscale image representing the overall saliency of the region. This operation effectively suppresses background noise such as oxide scale texture and oil residue on the steel surface, while highlighting the abnormal features of defect areas (such as scratches).

[0090] Step 4-2: Cross-channel Layer Feature Normalization (CFN) is used to standardize the domain features. The normalization process eliminates feature biases caused by uneven lighting, steel plate surface reflections, and other industrial scenarios, improving the model's robustness to low-contrast defects (such as interlayers). The processed feature values ​​stabilize in the [-1, 1] range, accelerating model convergence. The normalized features are then subjected to Softmax calculation. The Softmax function is used to calculate the attention weight (0~1) for each domain, dynamically identifying key defect regions.

[0091] High-weighted region (weight > 0.7): corresponds to significant defects such as cracks and inclusions, with a 2-3 times enhancement in characteristic response;

[0092] Medium weight region (0.3~0.7): handles scar defects with blurred edges while preserving detailed features;

[0093] Low-weighted regions (weight < 0.3): suppress the feature contributions of interfering areas such as oxide scale and water stains;

[0094] Step 4-3: Perform feature selection on the obtained features, and based on attention weights, perform corresponding soft filtering on the two different domains in the original domain, followed by cross-domain fusion:

[0095] 1) Gating enhancement: High-weighted domains retain details through residual connections and enhance defect features using channel scaling.

[0096] 2) Background suppression: Feature value truncation is performed in the low-weight domain to reduce invalid computation.

[0097] 3) Cross-domain fusion: Adjacent domains are smoothly transitioned through bilinear interpolation to avoid jagged edges at the segmentation boundaries.

[0098] Finally, the output features are compressed through a 1×1 convolutional channel and then passed to subsequent networks.

[0099] Step 5: To address the challenge of detail recovery in steel surface defect detection, a lightweight and high-precision feature upsampling scheme is proposed. Its core innovation lies in the use of multi-scale feature fusion and dynamic channel recombination strategies to adaptively adjust the channel grouping strategy according to feature complexity, breaking through the limitations of fixed mixing mode and significantly improving the edge restoration capability and computational efficiency of small defects.

[0100] Step 5-1: In the deep feature resolution recovery stage, the module uses bilinear upsampling for basic amplification to ensure the continuity of defect geometric features (such as crack propagation direction). To avoid the checkerboard artifact problem caused by traditional transposed convolution, depthwise separable convolution is introduced to optimize spatial feature extraction—by separating the calculation of channels and spatial dimensions, while retaining the detail capture capability of the 3×3 convolution kernel, the response intensity of background interference such as oxide scale texture is reduced by 62%, and the computational cost is reduced to 10% of that of standard convolution.

[0101] Step 5-2: To address the channel isolation issue in depthwise convolution, a group shuffling mechanism is employed.

[0102] First, the channels are grouped equally into 32 independent subgroups (Ng=32). Each subgroup is equipped with a dedicated local feature extractor, focusing on learning feature patterns in a specific spatial frequency domain. Then, cross-group shuffling is performed. Through periodic channel permutation, feature interaction between groups is promoted, and cross-channel semantic associations are enhanced (such as the collaborative representation of crack depth and width).

[0103] Step 5-3: Feature compression. A 1×1 convolution is used to construct a fully connected channel projection matrix through a 1×1 convolution kernel, compressing the number of channels to the target dimension. Then, based on the feature importance score, a sparsity coefficient is applied to the redundant channels to eliminate redundant feature responses.

[0104] Step 5-4: Add the upsampling results to the shallow high-resolution features, and recover the detailed information of the tiny indentations through cross-layer connections. This design reduces the number of model parameters to 34% of traditional methods while ensuring real-time performance (processing time per frame is 8.3ms).

[0105] Step 6: After the input features pass through the DCRU dynamic channel reassembly upsampling unit, H×W becomes 2H×2W without changing the number of channels, thus achieving the upsampling process. Compared to traditional upsampling steps such as double upsampling, the DCRU dynamic channel reassembly upsampling unit, through depthwise separable convolution coupled with dynamic channel reassembly, distinguishes defect features from background noise, achieving multi-scale fusion and noise suppression. The channel shuffling mechanism eliminates checkerboard artifacts, and residual connections restore sub-pixel-level crack edges, meeting lightweight design requirements.

[0106] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0107] Example

[0108] Combination Figure 2 The present invention proposes a method for detecting surface defects in steel based on cross-domain dynamic recombination U-Net, comprising the following steps:

[0109] Step 1: Build and train a cross-domain dynamic recombination U-Net (CDR-Unet) model;

[0110] Step 2: Input the image of the steel surface to be detected into the trained CDR-Unet model;

[0111] Step 3: Obtain the pixel-level defect segmentation map output by the CDR-Unet model to complete the detection of defects on the steel surface.

[0112] like Figure 2 As shown, the CDR-Unet model is an improvement on the U-Net architecture, and the algorithm consists of an encoder part and a decoder part.

[0113] I. Encoder Section

[0114] The encoder section is used for backbone feature extraction and consists of multiple stages of convolution, CDFA modules, and downsampling (max pooling) stacked together. When the input image size is 512x512x3, the specific execution method is as follows:

[0115] The first stage involves performing a 64-channel pull-through convolution of [3,3] and passing it through a CDFA module to obtain a preliminary effective feature layer of [512,512,64]. Then, 2x2 max pooling is performed to obtain a feature layer of [256,256,64].

[0116] The second stage involves performing a 128-channel convolution of [3,3] and passing it through a CDFA module to obtain a preliminary effective feature layer of [256,256,128]. Then, a 2x2 max pooling is performed to obtain a feature layer of [128,128,128].

[0117] Similarly, the encoder section can be used to obtain five preliminary effective feature layers of different scales (sizes of 512, 256, 128, 64, and 32, respectively), which will be used for feature fusion in the decoder section.

[0118] II. Decoder Section

[0119] The decoder section enhances feature extraction by performing progressive DCRU upsampling on the five preliminary effective feature layers obtained from the encoder section, and then fusing features to obtain a final effective feature layer that integrates all features. Specifically, this includes:

[0120] The feature layer [32,32,512] obtained in the fifth stage is input into the DCRU module for upsampling to obtain a feature map of [64,64,512].

[0121] The upsampled feature map is concatenated with the [64,64,512] feature map from the fourth stage of the encoder to obtain the [64,64,1024] feature map.

[0122] The concatenated feature map is subjected to a [3,3] 256-channel decreasing convolution (FMN) for channel compression and feature enhancement, resulting in a [64,64,256] feature map.

[0123] The above feature map is input into the DCRU module for upsampling to obtain a feature map of [128,128,256], which is then concatenated with the feature map of [128,128,256] from the third stage.

[0124] Repeat the DCRU upsampling, stitching, and depass convolution operations until the original input image resolution of 512×512 is restored, and output the final pixel-level segmentation result.

[0125] Combination Figure 3 The construction and processing flow of the Cross-Domain Focused Attention Network (CDFA) is as follows:

[0126] Step 2-1: Input features (H'×W'×C). The module receives multi-scale feature maps from the encoder. Where H′×W′ is the spatial resolution of the current decoding stage, and C is the channel dimension. Feature maps are passed through skip connections to preserve high-frequency details of the original image.

[0127] Step 2-2: Add a cross-domain module to the branch section and adopt a dual-path cross-domain attention mechanism. Use different domain size parameters d=2 and d=4 to capture local and global features respectively, so as to achieve multi-domain information fusion.

[0128] Steps 2-3: Local feature path in the pixel domain (d=2)

[0129] The input features are divided into indivual The local domains are calculated, and the local attention weights for each domain are computed using a lightweight MLP:

[0130] (1)

[0131] This approach focuses on the fine features of small regions and is highly sensitive to minor defects, effectively extracting feature information from small targets.

[0132] Steps 2-4: Global Feature Path in the Feature Domain (d=4)

[0133] The features are divided into Each 4×4 feature field has a stronger ability to collect global information. This branch path captures global features better through a 4×4 feature field that is larger than the pixel field size.

[0134] (2)

[0135] In addition to cross-domain branches, the CDFA module also includes multi-layer convolutional operations to capture more detailed feature information. Simultaneously, an attention module is added, fusing channel and spatial attention mechanisms to further enhance the extracted features and improve extraction performance.

[0136] Combination Figure 4 The processing steps for the pixel domain local feature path (d=2) include:

[0137] Step 3-1, Unfold operation: Split the input feature tensor into multiple non-overlapping 2×2 small domains, and output a tensor with shape (4, H' / 2, W' / 2, C').

[0138] Step 3-2, Mean operation: Perform the mean operation on the channel dimension, calculate the mean of all channel features in each domain, generate a feature map representing the overall significance of the domain, effectively suppress background noise and highlight defect areas.

[0139] Step 3-3, Cross-channel layer normalization (CFN): Normalize the mean-averaged domain features to eliminate feature bias caused by industrial scenarios such as uneven lighting and reflections on the steel plate surface, and improve the model's robustness to low-contrast defects.

[0140] Steps 3-4: Softmax calculation: The attention weight of each domain is calculated using the Softmax function, with a value range of 0 to 1, to dynamically identify key defect areas (e.g., high-weight domains correspond to significant defects, and low-weight domains correspond to background interference).

[0141] Steps 3-5: Feature Selection and Cross-Domain Fusion: Original domain features are weighted and filtered based on attention weights. High-weight domains retain details through residual connections and enhance defect features using channel scaling; low-weight domains have feature values ​​truncated to suppress background interference; adjacent domains are smoothly transitioned through bilinear interpolation to avoid jagged segmentation boundaries.

[0142] Steps 3-6: After the output features are compressed through a 1×1 convolution channel, they are passed to subsequent networks.

[0143] The construction and processing flow of the Dynamic Channel Reassembly Upsampling Unit (DCRU) is as follows:

[0144] Step 4-1: The input feature map is first sampled based on bilinear interpolation to enlarge the spatial resolution to 2H×2W to ensure the continuity of defect geometric features (such as crack propagation direction).

[0145] Step 4-2: Input the upsampled feature map into the depthwise separable convolutional layer. By separating the channel and spatial dimension calculation, the response intensity of background interference such as oxide skin texture is reduced while retaining the detail capture capability of the 3×3 convolutional kernel, and the amount of computation is also greatly reduced.

[0146] Step 4-3: Perform channel grouping and shuffling operations on the output of the depthwise separable convolution. Divide the total number of channels equally into multiple independent subgroups (e.g., 32 groups). Each subgroup is equipped with a dedicated local feature extractor, focusing on learning feature patterns in a specific spatial frequency domain. Then, through periodic channel permutation, promote feature interaction between groups and enhance cross-channel semantic association.

[0147] Step 4-4: Compress and project the shuffled features using 1×1 convolution to adjust the number of channels to the target dimension, and apply sparsity coefficients to redundant channels based on feature importance scores to eliminate redundant feature responses.

[0148] Steps 4-5: Perform residual connections and summation between the processed feature map and the corresponding shallow high-resolution feature map transmitted from the encoder. Through cross-layer connections, the detailed information of the tiny indentations is recovered, which significantly reduces the number of model parameters.

[0149] Figure 5 The image shows a comparison of the crack segmentation performance of various algorithms on the NEU-Seg dataset. It is evident that this invention can clearly segment common steel defects (inclusions, patches, scratches) into images, marking the defective parts in white and the remaining acceptable parts in black.

Claims

1. A method for detecting surface defects in steel based on cross-domain dynamic recombination U-Net, characterized in that, Includes the following steps: Step 1: Build and train a cross-domain dynamic recombination U-Net model, denoted as CDR-Unet model; Step 2: Input the image of the steel surface to be detected into the trained CDR-Unet model; Step 3: Obtain the pixel-level defect segmentation map output by the CDR-Unet model to complete the detection of defects on the steel surface; The CDR-Unet model is based on an improved U-Net architecture. In the encoder part, a cross-domain focusing attention network is introduced to replace part of the standard convolution. In the decoder part, a dynamic channel reassembly upsampling unit is introduced to replace the standard upsampling module, which is used to restore the geometric continuity of defects while making the model lightweight.

2. The method for detecting steel surface defects based on cross-domain dynamic recombination U-Net according to claim 1, characterized in that, The construction and processing flow of the cross-domain focused attention network is as follows: Step 1-1: Construct and train a cross-domain dynamic recombination U-Net model; input feature map They enter two parallel cross-domain branches respectively; where H′×W′ is the spatial resolution of the current decoding stage, and C is the channel dimension; Steps 1-2: Add a cross-domain module to the branch section and adopt a dual-path cross-domain attention mechanism. Use different domain size parameters d=2 and d=4 to capture local and global features respectively, so as to achieve multi-domain information fusion. 1) Local feature paths in the pixel domain; The input features are divided into indivual Local domains, each domain having local attention weights computed using a lightweight MLP. : (1) Where X patch Referring to a 2×2 local domain, the Flatten operation stretches a three-dimensional local tensor into a one-dimensional vector. The computational operation is used to learn the relative importance of the four pixels, and Softmax is the normalization operation; 2) Global feature paths in the feature domain; The features are divided into Each branch path uses a 4×4 feature field, which is larger than the pixel field size, to better capture global features. Global attention weights are also included. The calculation formula is: (2) Where Q, K, V are query, key, and value matrices generated by linear transformation of the input features, and d is the dimension of the key vector; It is a 16×16 similarity score matrix, representing the semantic relevance between any two tokens; Steps 1-3: Concatenate and fuse the outputs of the two branches with the feature map processed by another serial convolution path; Steps 1-4: Apply channel attention and spatial attention mechanisms to the fused feature map in sequence; A global context compression-incentive mechanism is used to dynamically adjust channel weights, including channel attention weights. The calculation formula is: (3) Where GAP is global average pooling and σ is the Sigmoid function; Spatial attention weights are calculated using dual-channel spatial saliency modeling. : (4) in, For average pooling operation, For max pooling operation, This refers to using a 7×7 convolutional layer for convolution operations; The final attention weights are fused using tensor multiplication, based on the input features. Output features are obtained after dual attention enhancement of channel and space. : (5)。 3. The method for detecting steel surface defects based on cross-domain dynamic recombination U-Net according to claim 2, characterized in that, The processing steps for the local feature path in the pixel domain include: Step 2-1, Unfold operation: Split the input feature tensor into multiple non-overlapping p×p mini-domains, and output a tensor with shape p×p, H' / p, W' / p, C'; p is the domain size, H', W' are the spatial dimensions of the input features, and C' is the feature embedding dimension of the attention subspace; Step 2-2, Cross-channel layer normalization: Normalize the averaged domain features to eliminate feature deviations caused by uneven illumination and reflections on the steel plate surface; Steps 2-3: Softmax calculation: The attention weights for each domain are calculated using the Softmax function, with values ​​ranging from 0 to 1. Steps 2-4: Feature selection and cross-domain fusion: Based on attention weights, the original domain features are weighted and filtered. High-weight domains retain details through residual connections and enhance defect features by channel scaling. Low-weight domains truncate feature values ​​to suppress background interference. Adjacent domains are smoothly transitioned through bilinear interpolation.

4. The method for detecting steel surface defects based on cross-domain dynamic recombination U-Net according to claim 1, characterized in that, The construction and processing flow of the dynamic channel reconstruction upsampling unit is as follows: Step 3-1: The input feature map is first sampled using bilinear interpolation to enlarge the spatial resolution to 2H×2W, where H is the height of the feature map and W is the width of the feature map. Step 3-2: Input the upsampled feature map into the depthwise separable convolutional layer. By calculating the separation of channels and spatial dimensions, the background interference response intensity is reduced while preserving the detail capture capability of the 3×3 convolutional kernel. Step 3-3: Perform channel grouping and shuffling operations on the output of depthwise separable convolutions, dividing the total number of channels into 32 independent subgroups. Each subgroup is equipped with a dedicated local feature extractor, and the feature interaction between groups is promoted through periodic channel permutation. Steps 3-4: Compress and project the features after channel shuffling using 1×1 convolution, adjust the number of channels to the target dimension, and apply sparsification coefficients to redundant channels based on feature importance scores to eliminate redundant feature responses. Steps 3-5: Perform residual connections and summation between the processed feature map and the corresponding shallow high-resolution feature map transmitted from the encoder. By connecting across layers, the detailed information of the indentation is recovered, thereby reducing the number of model parameters.

5. The method for detecting steel surface defects based on cross-domain dynamic recombination U-Net according to claim 1, characterized in that, The encoder part of the CDR-Unet model consists of multiple stages of convolution, CDFA modules, and downsampling stacked together. The input image size is 512×512×3, and specifically includes: The first stage is to perform a 3×3 64-channel pull-through convolution, pass through a CDFA module once to obtain a preliminary effective feature layer of [512,512,64], and then perform 2×2 max pooling to obtain a feature layer of [256,256,64]. The second stage: Perform a 3×3 128-channel convolution, pass through a CDFA module once to obtain a feature layer of [256,256,128], and then perform 2×2 max pooling to obtain a feature layer of [128,128,128]. The third stage: Perform a 3×3 256-channel convolution, pass through a CDFA module once to obtain a feature layer of [128,128,256], and then perform 2×2 max pooling to obtain a feature layer of [64,64,256]. Fourth stage: Perform a 3×3 512-channel convolution, pass through a CDFA module once to obtain a feature layer of [64,64,512], and then perform 2×2 max pooling to obtain a feature layer of [32,32,512]. Fifth stage: Perform a 3×3 512-channel convolution and pass it through a CDFA module to obtain a preliminary effective feature layer of [32,32,512].

6. The method for detecting steel surface defects based on cross-domain dynamic recombination U-Net according to claim 5, characterized in that, The decoder part of the CDR-Unet model utilizes five preliminary effective feature layers obtained from the encoder part for progressive upsampling and feature fusion, specifically including: The feature layer [32,32,512] obtained in the fifth stage is input into the DCRU module for upsampling to obtain a feature map of [64,64,512]. The upsampled feature map is concatenated with the [64,64,512] feature map of the fourth stage of the encoder to obtain the [64,64,1024] feature map. The concatenated feature map is subjected to a 3×3 256-channel depass convolution for channel compression and feature enhancement, resulting in a feature map of [64, 64, 256]. The above feature map is input into the DCRU module for upsampling to obtain a feature map of [128, 128, 256], which is then concatenated with the [128, 128, 256] feature map from the third stage. Repeat the DCRU upsampling, stitching, and convolution operations described above until the original input image resolution of 512×512 is restored.

7. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of any of the methods described in claims 1-6.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method described in any one of claims 1-6.

9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method described in any one of claims 1-6.