Method, device and equipment for detecting ultra-small defects and storage medium

CN122265632APending Publication Date: 2026-06-23SHENZHEN NTEK TESTING TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN NTEK TESTING TECH
Filing Date
2026-03-26
Publication Date
2026-06-23

Smart Images

  • Figure CN122265632A_ABST
    Figure CN122265632A_ABST
Patent Text Reader

Abstract

The application provides a method and device for detecting a very small defect, equipment and a storage medium. The method is based on a pre-trained improved neural network. The improved neural network is trained by using a combined loss function including a Focal Loss classification loss function and a GIoU Loss bounding box regression loss function. The improved neural network comprises a backbone network, a bidirectional feature pyramid network, an attention enhancement module and a detection head. The method comprises the following steps: inputting a target image to be detected into the pre-trained improved neural network; performing feature extraction on the target image based on the backbone network to obtain a multi-scale feature map; performing fusion processing on the multi-scale feature map based on the bidirectional feature pyramid network to obtain an enhanced feature map; performing feature recalibration on the enhanced feature map by the attention enhancement module to obtain an optimized feature map; and performing processing on the optimized feature map by the detection head to output category information and position information of the very small defect in the target image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of industrial vision technology, and in particular to a method, apparatus, device, and storage medium for detecting extremely small defects. Background Technology

[0002] In the field of high-end precision manufacturing, although deep learning-based surface defect detection methods have excellent recognition potential, their supervised learning paradigm relies heavily on massive amounts of accurately labeled data. However, in actual industrial scenarios, samples of extremely small defects are scarce, their shapes are varied, and prior knowledge is insufficient, leading to a dual dilemma of high labeling costs and severely inadequate generalization ability under small sample sizes. Summary of the Invention

[0003] This application provides a method, apparatus, device, and storage medium for detecting extremely small defects, which can improve the reliability and accuracy of industrial vision systems in precision inspection and reduce training costs.

[0004] In a first aspect, embodiments of this application provide a method for detecting minute defects. The method is based on a pre-trained improved neural network, which is trained using a combined loss function comprising Focal Loss classification and GIoU Loss bounding box regression. The pre-trained improved neural network includes a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head. The method includes: The target image to be detected is input into the pre-trained improved neural network; Based on the backbone network, feature extraction is performed on the target image to obtain a multi-scale feature map; The multi-scale feature map is fused based on the bidirectional feature pyramid network to obtain an enhanced feature map. The enhanced feature map is recalibrated by the attention enhancement module to obtain an optimized feature map; wherein, the attention enhancement module includes a channel attention submodule and a spatial attention submodule, the channel attention submodule is used to adjust the weights of the feature channels, and the spatial attention submodule is used to focus on the spatial location of the defect; The detection head processes the optimized feature map and outputs the category and location information of minute defects in the target image.

[0005] Secondly, embodiments of this application provide a device for detecting extremely small defects. This device can invoke a pre-trained improved neural network. The pre-trained improved neural network is trained using a combined loss function including Focal Loss classification loss function and GIoU Loss bounding box regression loss function. The pre-trained improved neural network includes: a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head. The device for detecting extremely small defects is used to execute any of the extremely small defect detection methods described in the embodiments of this application. The device for detecting extremely small defects includes: The image transmission module is used to input the target image to be detected into the pre-trained improved neural network; The feature extraction module is used to extract features from the target image based on the backbone network to obtain a multi-scale feature map; The feature fusion module is used to fuse the multi-scale feature maps based on the bidirectional feature pyramid network to obtain an enhanced feature map; The feature calibration module is used to recalibrate the enhanced feature map through the attention enhancement module to obtain an optimized feature map; wherein, the attention enhancement module includes a channel attention submodule and a spatial attention submodule in sequence, the channel attention submodule is used to adjust the weight of the feature channels, and the spatial attention submodule is used to focus on the spatial location of the defect; The result output module is used to process the optimized feature map through the detection head and output the category and location information of tiny defects in the target image.

[0006] Thirdly, embodiments of this application provide a detection device, which includes a memory and a processor; The memory is used to store computer programs; The processor is configured to execute the computer program and, in executing the computer program, implement the method for detecting minute defects as described in any of the embodiments of this application.

[0007] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to implement a method for detecting minute defects as described in any of the embodiments of this application.

[0008] This application provides a method for detecting minute defects. The method is based on a pre-trained improved neural network, which is trained using a combined loss function including Focal Loss classification and GIoU Loss bounding box regression. The pre-trained improved neural network includes a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head. The method includes: inputting the target image to be detected into the pre-trained improved neural network; extracting features from the target image based on the backbone network to obtain multi-scale feature maps; fusing the multi-scale feature maps based on the bidirectional feature pyramid network to obtain enhanced feature maps; recalibrating the enhanced feature maps using the attention enhancement module to obtain optimized feature maps; wherein the attention enhancement module includes a channel attention submodule and a spatial attention submodule, the channel attention submodule being used to adjust the weights of feature channels, and the spatial attention submodule being used to focus on the spatial location of the defect; and processing the optimized feature map using the detection head to output the category and location information of the minute defects in the target image. In the above method, an improved neural network trained with a combined loss function including FocalLoss and GIoU Loss is used. The network then extracts multi-scale features through a backbone network, achieves bidirectional fusion of shallow and deep features through a bidirectional feature pyramid network, performs channel and spatial recalibration through an attention enhancement module, and finally outputs the results through the detection head. Without significantly increasing data dependence, this method enhances the model's feature capture ability, localization accuracy, and overall detection robustness for scarce and minute defects, effectively solving the technical problem of insufficient generalization performance for detecting minute defects under small sample conditions. Attached Figure Description

[0009] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0010] Figure 1 A schematic diagram of an improved neural network provided in an embodiment of this application; Figure 2 A schematic flowchart illustrating a method for detecting minute defects provided in an embodiment of this application; Figure 3 A schematic diagram illustrating the composition of an attention enhancement module provided in an embodiment of this application; Figure 4 A schematic block diagram of a device for detecting minute defects provided in an embodiment of this application; Figure 5A flowchart illustrating a method for constructing a detection system for minute defects provided in this application embodiment. Detailed Implementation

[0011] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described below with reference to the accompanying drawings.

[0012] The terms "first" and "second," etc., used in the specification, claims, and drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.

[0013] The term "embodiment" as used herein means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0014] It should be understood that in this application, "at least one (item)" means one or more, "more than one" means two or more, "at least two (items)" means two or three or more, and "and / or" is used to describe the relationship between related objects, indicating that there can be three relationships. For example, "A and / or B" can mean: only A exists, only B exists, and A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the related objects before and after are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can mean: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0015] Please see Figure 1 , Figure 1 This is a schematic diagram of an improved neural network provided by an embodiment of this application. (As shown...) Figure 1 As shown, the improved neural network includes: an input port, a backbone network, a bidirectional feature pyramid, an attention enhancement module, a detection head, and a loss function connected in sequence.

[0016] The input port is the image input end of the neural network, receiving a standardized input image with a size of 640x640x3, providing raw data of uniform specifications for subsequent feature extraction, and is the starting point for feature input of the entire network; The target image to be detected is fed into the backbone network through the input port. The backbone network first completes the preliminary feature extraction and dimensionality compression of the image through initial convolutional downsampling. Then, it combines the C2f module to realize cross-stage feature connection. At the same time, a high-resolution branch is built to enhance the preservation and transmission of shallow features. The MobileViT lightweight Transformer is integrated to improve the efficiency and accuracy of feature extraction. The output is P3, P4, and P5 multi-scale feature maps, laying the foundation for subsequent feature fusion.

[0017] The multi-scale feature map output from the backbone network is fed into the bidirectional feature pyramid PAN-FPN. This module performs deep fusion and enhancement of multi-scale features through bidirectional operations of top-down upsampling and feature fusion and bottom-up downsampling and feature fusion, generating enhanced multi-scale feature maps to achieve complementarity and optimization of features at different scales.

[0018] The enhanced multi-scale feature maps enter the attention enhancement module, which integrates channel attention and spatial attention mechanisms: first, global pooling is used to achieve feature selection for channel attention, and then channel pooling combined with convolution operations is used to complete feature enhancement for spatial attention. Through the dual attention mechanism, key features are highlighted and redundant information is suppressed, thereby further improving the effectiveness of the feature maps. The attention-enhanced feature map is fed into the detection head, which contains three detection branches of different scales: Head1 (80x80), Head2 (40x40), and Head3 (20x20). The multi-scale detection head can adapt to the detection needs of targets of different sizes and complete the detection and recognition of targets.

[0019] The output of the detection head is fed into the loss function module, which integrates three core loss functions and performs a weighted sum to obtain the total loss: GloULoss (with minimum bounding box penalty) is used as the regression loss, FocalLoss with α=0.25 and γ=2.0 is used as the classification loss, and BCELoss is used as the confidence loss. The error assessment of the network detection results is realized through the joint calculation of multiple losses, providing a basis for the optimization of network parameters.

[0020] Please see Figure 2 , Figure 2 This is a schematic flowchart illustrating a method for detecting extremely small defects provided in an embodiment of this application. Figure 2As shown, this method is based on a pre-trained improved neural network, which is trained using a combined loss function that includes Focal Loss classification loss function and GIoU Loss bounding box regression loss function. The pre-trained improved neural network includes: a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head. The specific steps of this method for detecting minute defects include: S101-S105.

[0021] S101. Input the target image to be detected into the pre-trained improved neural network.

[0022] For example, the target image to be detected is input into a pre-trained improved neural network, which is typically a high signal-to-noise ratio image obtained through a physical enhancement imaging system.

[0023] The imaging process of the target image involves the interaction between a specific structured light field generated by a computationally superoscillating optical element and the sample surface, thereby producing a nonlinear optical response to the edge and depth features of subwavelength defects such as nanoscale scratches or microdust. A digital image, nonlinearly amplified, is formed by recording changes in light intensity, phase, or polarization state using a high-sensitivity photodetector in a non-imaging scanning manner.

[0024] The training process of the pre-trained improved neural network incorporates specialized optimization strategies for extremely small defects. The training dataset is constructed using a small-sample strategy combining a pre-trained model based on virtual defect data with semi-automatic annotation of real data. Simultaneously, during model training, a combined loss function is applied, consisting of the FocalLoss classification loss function (adjusting parameters to mitigate extreme positive-negative sample imbalance) and the GIoU Loss bounding box regression loss function (used to improve localization accuracy), to optimize parameters and ensure the network possesses robust recognition capabilities for rare and varied defects. Input operations are performed on the computing device, converting the image data into a tensor format that meets the requirements of the network input layer, such as a three-dimensional array normalized to 640 pixels by 640 pixels with pixel values ​​normalized to the 0-1 range.

[0025] In some embodiments, before inputting the target image to be detected into a pre-trained improved neural network, the method further includes: scanning and recording the object to be detected based on a super-oscillating optical element and a photodetector to obtain multiple initial scan images; and preprocessing the initial scan images to obtain the target image to be detected.

[0026] Multiple initial scan images are obtained by scanning and recording the object under test using a super-oscillating optical element and a photodetector. In this process, a specific structured light field is generated by calculating the super-oscillating optical element. When this light field illuminates the sample surface, it generates a nonlinear response with subwavelength features such as the edge and depth of defects, transmitting defect information into the scattered light field. A high-sensitivity photodetector is used to record the changes in light intensity, phase, or polarization state point by point during the scanning process. This nonlinear amplification of defect information using a non-imaging scanning detection method generates a defect response image with a high signal-to-noise ratio, thus breaking the diffraction limit at the physical level and obtaining initial scan images containing information about extremely small defects.

[0027] The initial scanned image is preprocessed to eliminate noise and enhance defect features, resulting in the target image to be detected. The specific preprocessing includes: denoising the initial scanned image using filtering algorithms, such as wavelet denoising or nonlocal mean filtering, to eliminate system noise and background interference; simultaneously, enhancing contrast algorithms to strengthen the contrast between the defect area and the background, such as adaptive histogram equalization or homomorphic filtering, making even minute defects more prominent and clear in the image. The image obtained after this preprocessing is the target image that can be used for subsequent neural network detection.

[0028] S102. Based on the backbone network, feature extraction is performed on the target image to obtain a multi-scale feature map.

[0029] For example, feature extraction is performed on the input target image based on the backbone network to obtain a multi-scale feature map containing semantic information and spatial details at different levels.

[0030] The backbone network is not the original architecture, but rather a specially modified version adapted for detecting extremely small defects. This includes introducing a high-resolution auxiliary detection branch in the shallow feature extraction stage. This branch performs convolution operations directly on the high-spatial-resolution feature maps to capture the finest defects in the image. For example, nanometer-scale scratches only a few pixels in size; the introduction of this branch significantly improves the initial recall rate for such tiny targets. Simultaneously, some traditional convolutional modules in the backbone network have been replaced with lightweight Transformer modules, such as the MobileViT module. These modules, through their built-in self-attention mechanism, can perform global context modeling of the entire image, helping the network understand the overall morphology of defect regions rather than misjudging local image noise, thus enhancing the discriminative power of feature representations.

[0031] Furthermore, the backbone network effectively transmits high-resolution shallow features rich in detail but weak in semantics to deeper network layers by increasing cross-layer connections. This, combined with a dense connection structure, promotes gradient flow and prevents feature information from being diluted during propagation.

[0032] The feature extraction process is carried out layer by layer. Through a series of convolution, normalization and activation function operations, the output usually includes feature maps from the shallow, medium and deep stages of the network. For example, the spatial size is 80x80, 40x40 and 20x20 pixels respectively, and the number of channels increases sequentially. These feature maps together constitute the multi-scale feature maps containing details and semantics required for subsequent processing.

[0033] S103. Based on the bidirectional feature pyramid network, the multi-scale feature maps are fused to obtain the enhanced feature maps.

[0034] For example, a bidirectional feature pyramid network is used to fuse multi-scale feature maps to integrate multi-level information and output an enhanced feature map.

[0035] The bidirectional feature pyramid network constructs a top-down feature fusion path. In this path, the strong semantic information contained in the deep feature maps is upsampled to improve their spatial resolution. Then, it is added element-wise or concatenated channel-wise with the high-resolution feature maps of the corresponding scale from the shallow layers of the backbone network, thereby enhancing the detailed features with semantic information.

[0036] The bidirectional feature pyramid network also adds a bottom-up secondary feature fusion path. In this path, the shallow features that have already been fused through the top-down path and thus possess certain semantic meaning are downsampled again, and then fused with deeper features to construct a closed bidirectional feature flow, forming a feature pyramid structure.

[0037] This bidirectional, multi-level fusion mechanism, such as performing two top-down and two bottom-up cross-fusions, ensures that each scale of the feature map used for prediction contains both high-frequency details from the shallow layer for accurately locating the edges of tiny defects, and abstract semantics from the deep layer for accurately determining the defect category. This significantly improves the network's ability to represent tiny and ambiguous defect features, providing a key feature foundation for solving the challenge of balancing sensitivity and accuracy in detecting extremely small defects.

[0038] S104. The enhanced feature map is recalibrated using the attention enhancement module to obtain an optimized feature map. The attention enhancement module includes a channel attention submodule and a spatial attention submodule. The channel attention submodule is used to adjust the weights of the feature channels, and the spatial attention submodule is used to focus on the spatial location of the defect.

[0039] For example, the attention enhancement module is a serial structure that sequentially includes a channel attention submodule and a spatial attention submodule.

[0040] The channel attention submodule first receives the enhanced feature map as input and employs a multi-branch pooling strategy to retain more comprehensive contextual information. Within the channel attention submodule, global average pooling and global max pooling operations are simultaneously performed on the input feature map, yielding a global description vector for the channel's average response and a global description vector used to highlight the most significant response of each channel, respectively.

[0041] These two description vectors are fed into a multilayer perceptron with shared parameters. This multilayer perceptron typically has a dimensionality reduction layer and an updimensionality recovery layer to learn the nonlinear interaction relationships between channels.

[0042] After element-wise addition of the two feature vectors output by the multilayer perceptron, a channel attention weight vector ranging from 0 to 1 is generated using the sigmoid activation function. Each element in this weight vector corresponds to a channel of the input feature map. By performing channel-wise multiplication of this channel attention weight vector with the original input feature map, different feature channels are reweighted, strengthening channels that strongly respond to the identification of specific defects (such as micropores) while suppressing irrelevant channels containing a large amount of background noise. This completes the feature recalibration of the channel dimension, outputting a channel-recalibrated feature map.

[0043] Subsequently, the channel recalibrated feature map is fed into the spatial attention submodule, which performs average pooling and max pooling operations on the channel dimension to obtain two single-channel feature maps, which summarize the average and maximum responses of all channels in spatial location.

[0044] The two single-channel feature maps are concatenated along the channel dimension to obtain a dual-channel feature map. This dual-channel feature map is then subjected to dimensionality reduction and feature transformation using a standard convolutional layer to generate a single-channel spatial attention weight map, which is then normalized using the Sigmoid function.

[0045] The spatial attention weight map is multiplied pixel-wise with the channel recalibration feature map to selectively enhance the spatial location of the feature map. This allows the neural network to spontaneously focus computational resources on pixel regions that may contain defects, while suppressing interference from background regions, thereby outputting an optimized feature map that is refined in both the channel dimension and focused in the spatial dimension.

[0046] S105. The optimized feature map is processed by the detection head to output the category and location information of tiny defects in the target image.

[0047] For example, a detection head typically consists of convolutional layers responsible for mapping optimized feature maps to detection predictions. The detection head is used to simultaneously predict object classification confidence and regress bounding box coordinates at each preset anchor point or grid cell of the feature map.

[0048] The classification confidence prediction outputs the probability value of each preset category (such as scratches, dust, holes) through the convolutional layer. Its training process is supervised by the aforementioned classification loss function (i.e., Focal Loss). This loss function reduces the contribution of easily classified background samples to the total loss through its adjustment factor, forcing the network to focus more on learning difficult-to-distinguish defect samples and background, so that it can still make a high confidence response to sparsely occurring defects even in the case of extremely imbalanced categories.

[0049] Bounding box coordinate regression outputs the center point coordinates, width, and height offsets of the predicted box through another convolutional layer. Its training process is supervised by the aforementioned bounding box regression loss function (i.e., GIoU Loss). This loss function not only measures the intersection-union ratio between the predicted box and the ground truth box, but also introduces the minimum bounding box as a penalty term. Even when the predicted box and the ground truth box do not overlap initially, it can provide an effective gradient direction to guide the predicted box to move towards the ground truth box. This significantly improves the accuracy of bounding box regression for scenarios with extremely high localization accuracy requirements, such as small defects.

[0050] The output of the detection head is structured data containing bounding box coordinates, category labels, and corresponding confidence scores. This data undergoes post-processing steps such as non-maximum suppression to filter out highly overlapping low-confidence predicted boxes, resulting in a list of all identified minute defects in the target image and their precise location coordinates within the image.

[0051] This application provides a method for detecting minute defects. The method is based on a pre-trained improved neural network, which is trained using a combined loss function including Focal Loss classification and GIoU Loss bounding box regression. The pre-trained improved neural network includes a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head. The method includes: inputting the target image to be detected into the pre-trained improved neural network; extracting features from the target image based on the backbone network to obtain multi-scale feature maps; fusing the multi-scale feature maps based on the bidirectional feature pyramid network to obtain enhanced feature maps; recalibrating the enhanced feature maps using the attention enhancement module to obtain optimized feature maps; wherein the attention enhancement module includes a channel attention submodule and a spatial attention submodule, the channel attention submodule being used to adjust the weights of feature channels, and the spatial attention submodule being used to focus on the spatial location of the defect; and processing the optimized feature map using the detection head to output the category and location information of the minute defects in the target image. In the above method, an improved neural network trained with a combined loss function including FocalLoss and GIoU Loss is used. The network then extracts multi-scale features through a backbone network, achieves bidirectional fusion of shallow and deep features through a bidirectional feature pyramid network, performs channel and spatial recalibration through an attention enhancement module, and finally outputs the results through the detection head. Without significantly increasing data dependence, this method enhances the model's feature capture ability, localization accuracy, and overall detection robustness for scarce and minute defects, effectively solving the technical problem of insufficient generalization performance for detecting minute defects under small sample conditions.

[0052] To more clearly illustrate the technical solution of this application, the technical solution of this application will be described below through specific embodiments. It should be noted that the specific embodiments are used to expand the description of the technical solution of this application, and are not intended to limit this application.

[0053] In some embodiments, feature extraction of the target image through a backbone network to obtain a multi-scale feature map includes: performing initial convolution and downsampling operations on the target image to obtain a first-level shallow feature map; performing convolution and feature transformation on the first-level shallow feature map to obtain a second-level mid-level feature map; performing further downsampling and feature transformation on the second-level mid-level feature map to obtain a third-level deep feature map; enhancing the first-level shallow feature map through a high-resolution auxiliary branch to generate an enhanced shallow feature map; and outputting the enhanced shallow feature map, the second-level mid-level feature map, and the third-level deep feature map together as a multi-scale feature map.

[0054] For example, the backbone network, designed to address the challenge of detecting minute defects, begins its processing by performing an initial convolution operation on the input target image. This operation typically uses a set of 3x3 kernels with a stride of 1, possibly with padding to maintain the spatial size of the feature map. This is followed by a downsampling convolution or max pooling operation with a stride of 2. This aims to initially extract low-level visual features and reduce the data dimensionality. The resulting data tensor is defined as the first-level shallow feature map. This first-level shallow feature map maintains a high spatial resolution, for example, perhaps one-quarter the size of the original input image. It contains the most original details of the image, such as edges, corners, and textures, which are crucial for recognizing the subtle contours of nanoscale scratches or dust particles.

[0055] Subsequently, further convolution and feature transformations are applied to this first-level shallow feature map, gradually reducing the spatial size of the feature map while increasing the number of channels and improving the semantic abstraction capability of the features. The feature tensor output after this stage of processing is the second-level intermediate feature map.

[0056] A lightweight Transformer module (such as the MobileViT module) is introduced to perform deeper downsampling and feature transformation on the second-level mid-layer feature map. The self-attention mechanism of the lightweight Transformer module is used to model the global context of the image, thereby helping the network distinguish between real defect regions and local noise, resulting in the third-level deep feature map. The third-level deep feature map has the smallest spatial size but the most channels, carrying the most global and abstract semantic understanding of the image content.

[0057] In parallel with the main process described above, the key improvement lies in the establishment of a high-resolution auxiliary detection branch. This branch directly takes the first-level shallow feature map as input and performs feature enhancement and refinement through several convolutional layers that maintain spatial resolution. This is used to strengthen and retain high-frequency spatial details that are crucial for the detection of small defects. After processing by this branch, an enhanced shallow feature map is obtained.

[0058] The enhanced shallow feature map, the second-level mid-level feature map, and the third-level deep feature map complement each other in terms of spatial scale and semantic depth. The output is a multi-scale feature map, which provides an input foundation with rich details and high-level semantics for the subsequent feature fusion stage.

[0059] In some embodiments, a bidirectional feature pyramid network is used to fuse multi-scale feature maps to obtain an enhanced feature map, including: upsampling a third-level deep feature map and fusing it with a second-level mid-level feature map to generate a first intermediate fused feature map; upsampling the first intermediate fused feature map and fusing it with an enhanced shallow feature map to generate a second intermediate fused feature map; downsampling the second intermediate fused feature map and fusing it with the first intermediate fused feature map to generate a third intermediate fused feature map; downsampling the third intermediate fused feature map and fusing it with a third-level deep feature map to generate a fourth intermediate fused feature map; and aggregating the second and fourth intermediate fused feature maps to output an enhanced feature map.

[0060] For example, multi-scale feature maps are used as input to a bidirectional feature pyramid network and pass through two parallel feature fusion paths, one from top to bottom and the other from bottom to top.

[0061] In the top-down fusion path, the third-level deep feature map is selected and upsampled using bilinear interpolation or transposed convolution to expand its spatial size to match that of the second-level mid-level feature map. Then, the upsampled deep features are element-wise added to the second-level mid-level feature map to inject strong semantic information from the deep layers into the mid-level features, thus generating the first intermediate fused feature map. Continuing along the top-down path, the resulting first intermediate fused feature map is upsampled again to improve its resolution to match that of the enhanced shallow features. Figure 1 The result is then fused with an enhanced shallow feature map rich in the finest spatial details through channel splicing. This fusion combines the semantically enhanced features with the original image details to produce a second intermediate fused feature map.

[0062] In the bottom-up secondary fusion path, the second intermediate fusion feature map is downsampled through a convolutional layer with a stride of 2, reducing its size to the same as the first intermediate fusion feature map. Then, the downsampled features are fused element-wise with the first intermediate fusion feature map. This operation allows detailed information from the high-resolution features to be fed back into the lower-resolution feature stream, generating the third intermediate fusion feature map. Continuing along the bottom-up path, the third intermediate fusion feature map is downsampled again to the same scale as the initial input third-level deep feature map and fused with it. This achieves reverse information transfer and enhancement from shallow details to deep semantics, outputting the fourth intermediate fusion feature map.

[0063] After the above two paths, the second intermediate fusion feature map, which represents the result of mixing top-level details and semantics, is concatenated with the fourth intermediate fusion feature map, which represents the bottom-level semantics enhanced by detail feedback, along the channel dimension. Channel adjustment and fusion may be performed through a lightweight 1x1 convolutional layer to output an enhanced feature map. The enhanced feature map deeply integrates high-resolution spatial details and multi-level semantic context at multiple scales.

[0064] In some embodiments, the enhanced feature map is recalibrated by the attention enhancement module to obtain an optimized feature map, including: inputting the enhanced feature map into the channel attention submodule to obtain a channel recalibrated feature map; and inputting the channel recalibrated feature map into the spatial attention submodule to obtain an optimized feature map.

[0065] In one embodiment, the enhanced feature map is input to the channel attention submodule to obtain the channel recalibrated feature map. Specifically, this includes: performing global average pooling and global max pooling on the input features to generate average pooling feature vectors and max pooling feature vectors; inputting the average pooling feature vectors and max pooling feature vectors into the same shared multilayer perceptron for processing and outputting the processed feature vectors; adding the two processed feature vectors element-wise and generating channel attention weight vectors through the sigmoid activation function; and performing channel-wise multiplication of the channel attention weight vectors with the input features to output the channel recalibrated feature map.

[0066] For example, the enhanced feature map serves as input to the channel attention submodule, which employs a dual-branch pooling strategy to comprehensively capture the channel-level global context. Specifically, global average pooling and global max pooling operations are performed on the input feature map in the spatial dimensions (i.e., height and width). The global average pooling operation calculates the arithmetic mean of all pixel values ​​in each feature channel to characterize the overall channel activity, generating an average pooled feature vector. The global max pooling operation extracts the maximum value among all pixels in each feature channel, generating a max pooled feature vector that reflects the most significant response of the channel. Subsequently, the obtained average pooled feature vector and max pooled feature vector are input into a multilayer perceptron with shared parameters for processing. This multilayer perceptron typically consists of two fully connected layers. The first layer reduces the dimensionality of the channels to reduce computation and introduces non-linearity, while the second layer restores the dimensionality to the original number of channels. Through this network, the complex dependencies between channels are learned. The two processed feature vectors output by the multilayer perceptron are added element-wise, and each value of the sum is mapped to 0 or 1 using a sigmoid activation function, thus generating a channel attention weight vector. Each scalar value in this vector corresponds to a specific channel of the original input feature map, representing the importance weight of that channel for the current detection task. This channel attention weight vector is then multiplied channel-wise with the enhanced feature map of the original input, that is, each coefficient in the weight vector is used to scale the pixel values ​​of all spatial locations of the corresponding channel, thereby outputting a feature map that has been selectively enhanced by the channel dimension, i.e., a channel recalibrated feature map. This process strengthens the feature channels that respond strongly to the identification of specific types of defects (such as micropores or scratches).

[0067] In one specific embodiment, the attention enhancement module of this application is improved. Please refer to [link / reference]. Figure 3 , Figure 3 A schematic diagram illustrating the components of an attention enhancement module is shown. For example... Figure 3 As shown, the improved attention enhancement module is a spatial and channel collaborative attention module (SCSA), which consists of two core sub-modules connected in series: shareable multi-semantic spatial attention (SMSA) and progressive channel self-attention (PCSA). First, spatial feature enhancement of the input feature map is completed through SMSA, and then its output is used as the input of PCSA for channel feature enhancement. Through dual attention optimization in spatial and channel dimensions, the key information of the feature map is accurately selected and highlighted, effectively suppressing redundant features and improving feature representation capabilities.

[0068] The SMSA-shared multi-semantic spatial attention takes a B×C×H×W dimensional feature map as input. First, average pooling is performed on the input feature map in the height (H) and width (W) directions to obtain two dimensional feature maps, B×C×H and B×C×W. Then, each of the two feature maps is split into four feature parts, each with C / 4 channels, containing one local feature and three global features at different scales. Subsequently, multi-scale depth-shared 1D convolution (MS-DWConv1d) with kernel sizes of 3, 5, 7, and 9 is used to extract multi-scale information from all the split feature parts. The extracted features are then concatenated and merged, and after GroupNorm-4 normalization, the spatial attention weights in the height and width directions are calculated through the Sigmoid gating mechanism. The weights are then multiplied element-wise with the original input feature map to complete the spatial dimension feature enhancement.

[0069] PCSA progressive channel self-attention follows the output feature map of SMSA. First, it performs dimensionality reduction on the B×C×H×W dimension feature map using downsampling methods such as average pooling. Then, it normalizes the downsampled feature map and performs feature transformation by combining 1×1 convolution and depthwise convolution (DWConv) to generate the query, key, and value required for self-attention calculation, and incorporates the channel attention (CA-MHSA) mechanism. The query and key are multiplied to obtain the attention matrix. After scaling to avoid numerical overflow, the channel attention weights are obtained through sigmoid gating and dropout regularization. The weights and values ​​are then multiplied by a matrix to perform weighted evaluation, and the channel-dimensional enhanced feature map is output, completing the feature optimization process of the entire attention enhancement module.

[0070] In addition, the attention enhancement module also uses basic operations such as group normalization, dimensional transformation, and element-wise multiplication to provide technical support for attention calculation of SMSA and PCSA, ensuring efficient collaboration between the two sub-modules and accurate generation of attention weights.

[0071] In another embodiment, the channel recalibration feature map is input into the spatial attention submodule to obtain an optimized feature map. Specifically, this includes: performing average pooling and max pooling on the channel recalibration feature map in the channel dimension to obtain an average pooling feature map and a max pooling feature map; and concatenating the average pooling feature map and the max pooling feature map in the channel dimension to obtain a multi-channel aggregated feature map. Convolution and Sigmoid activation are applied to the multi-channel aggregated feature map to generate a spatial attention weight map; the spatial attention weight map and the channel recalibrated feature map are multiplied pixel by pixel to output an optimized feature map.

[0072] For example, the spatial attention submodule takes the channel recalibration feature map as input and performs feature aggregation along the channel dimension. Specifically, it performs average pooling and max pooling operations on the input channel recalibration feature map along the channel axis. The average pooling operation calculates the average value of all channels at each spatial location (i.e., each pixel) to generate an average pooled feature map that reflects the average response intensity of each location across all channels.

[0073] Max pooling extracts the maximum value of all channels at each spatial location, generating a single-channel max pooling feature map that highlights the most significant channel response at each location. Subsequently, the resulting average pooling feature map and max pooling feature map are concatenated along the channel dimension. Since both are single-channel, the concatenation generates a multi-channel aggregated feature map with two channels, summarizing the spatial importance of the feature maps from both "average" and "maximum" perspectives.

[0074] This multi-channel aggregated feature map is processed using a standard convolutional layer. This layer typically uses a 7x7 or 3x3 kernel to fuse and compress information from two channels, and learns the correlation between spatial locations through its weights, outputting a single-channel feature map. This single-channel feature map is then processed through a sigmoid activation function, normalizing each pixel value to the range of 0 to 1, thereby generating a spatial attention weight map. The value of each pixel in this map represents the importance of the corresponding spatial location for the defect detection task.

[0075] The generated spatial attention weight map is multiplied pixel by pixel with the input channel recalibration feature map. That is, the importance coefficient of each position in the spatial attention weight map is used to modulate the feature response values ​​of all channels in the channel recalibration feature map at that spatial position, thereby outputting an optimized feature map. The optimized feature map significantly highlights pixel regions that may contain defects and suppresses the interference of background noise.

[0076] In some embodiments, before training the improved neural network, the method further includes: setting a classification loss function to address extreme imbalance between positive and negative samples; setting a bounding box regression loss function to improve the accuracy of locating minor defects; and weighted summing the classification loss function and the bounding box regression loss function to form a combined loss function.

[0077] In some embodiments, a classification loss function is set to address the extreme imbalance between positive and negative samples, including: defining a balance factor and a focus factor, wherein the balance factor is used to adjust the influence between positive and negative samples, and the focus factor is used to adjust the weights of easy and difficult samples; constructing a Focal Loss classification loss function based on the balance factor and the focus factor; wherein the Focal Loss classification loss function introduces a modulation factor to focus model training on samples that are difficult to distinguish; and configuring the parameter values ​​of the balance factor and the focus factor to obtain the classification loss function, taking into account the sparse foreground characteristic of extremely small defects.

[0078] For example, the classification loss function set to address the extreme imbalance between positive and negative samples in the detection of extremely small defects is based on a targeted improvement of the standard cross-entropy loss. The setting process requires explicitly defining two core adjustment parameters: the balance factor α. t And the focus factor γ. The balance factor α. t The focus factor γ is used to adjust the relative contribution weights of positive and negative samples in the total loss, mitigating class imbalance caused by a significant increase in background pixels compared to defect pixels. This focuses the model's training process on learning more difficult samples that are hard to distinguish from the background. The balance factor α is also used to adjust the weights of easy and difficult samples, allowing the model to concentrate on learning those difficult samples during training. t Together with the focus factor γ, we construct the Focal Loss classification loss function, the specific formula of which is: ; Where, p t It is the model's predicted probability of the target class; α t γ is the balancing factor, used to adjust the influence between positive and negative samples; γ is the focus factor, used to adjust the weights of easy and difficult samples.

[0079] When a sample is correctly classified and has a high predicted probability (i.e., an easy sample), the modulation factor approaches 0, thus automatically reducing the weight of such samples in the total loss; conversely, for samples with low predicted probabilities and difficult classification, their loss weight is relatively preserved. This mechanism forces the model to focus its optimization attention on difficult-to-classify samples. For the characteristics of extremely small defect detection, where the foreground (i.e., the defect) is extremely sparse and the vast majority are simple and easily classified background samples, it is necessary to adjust the balance factor α. t The focus factor γ is specifically optimized. A typical strategy is to increase the value of the focus factor γ (e.g., set it to 2.0 or higher) to more strongly suppress the gradient contribution of a large number of simple negative samples; at the same time, the α value corresponding to positive samples can be appropriately reduced. t Value, or increase the α value corresponding to negative samples. t The value is used to finely balance the loss ratio between positive and negative samples.

[0080] In its implementation, the construction of this classification loss function involves the following steps: First, create a FocalLoss class that inherits from a neural network module base class (such as nn.Module), defining learnable or pre-defined alpha and gamma parameters in its constructor. Second, in the forward propagation function, handle the output format specific to the object detection network and calculate the basic cross-entropy loss for each predicted box. Third, substitute the calculated basic loss, object probability, and alpha and gamma parameters into the FocalLoss formula to obtain the classification loss value. The classification loss function constructed and configured in this way is specifically designed for training networks that detect extremely small defects. It effectively alleviates class imbalance and improves the model's sensitivity to sparse, small defects.

[0081] In some embodiments, a bounding box regression loss function is set to improve the accuracy of small defect localization, including: defining a generalized intersection-union (CIU) function, which, based on calculating the CIU of the predicted bounding box and the true bounding box, introduces the area of ​​the minimum bounding rectangle that can enclose both the predicted and true bounding boxes as a penalty term to provide an effective gradient signal when the predicted and true bounding boxes do not overlap; constructing a GIoU Loss bounding box regression loss function based on the CIU function; the GIoU Loss bounding box regression loss function drives the predicted bounding box to move closer to the true bounding box by minimizing the area of ​​the minimum bounding rectangle, thus obtaining the bounding box regression loss function.

[0082] For example, a bounding box regression loss function is set up to improve the accuracy of locating tiny defects, aiming to address the problems of gradient vanishing when the predicted and ground truth bounding boxes do not overlap, and insufficient supervision of the relative positional relationship of the bounding boxes by traditional loss functions. The core of this setup process is defining a Generalized Intersection over Union (GIoU) ​​function. The basic IoU function between the predicted bounding box A and the ground truth bounding box B is defined as the area of ​​their intersection divided by the area of ​​their union: ; Building upon this, the Generalized Intersection over Union (GIoU) ​​function introduces a penalty term based on the smallest bounding rectangle C that simultaneously encloses both the predicted bounding box A and the ground truth bounding box B. The formula for calculating the GIoU is as follows: ; Where |C\(A∪B)| represents the area of ​​the smallest bounding rectangle C minus the area of ​​the union of A and B, i.e., the area within C that is not covered by the two bounding rectangles. The range of GIoU is [-1, 1], and GIoU equals 1 when the two bounding rectangles perfectly overlap. When GIoU is used as a loss function, its form is: ; In the GIoU Loss bounding box regression loss function, even if the predicted bounding box and the true bounding box have no overlap (in which case the IoU is 0, and traditional IoU Loss cannot provide gradient direction), due to the penalty term... Even with the presence of this gradient signal, the loss function can still provide an effective gradient signal. This gradient signal drives the model to not only increase the overlap between the predicted and ground truth boxes during optimization (maximizing IoU), but also to reduce the area of ​​their minimum bounding rectangle C, thereby forcing the predicted boxes to move closer to the ground truth boxes in a more reasonable direction and manner.

[0083] In its implementation, constructing the bounding box regression loss function involves the following steps: Implementing a bounding box IoU calculation function (e.g., the `bbox_iou` function), which supports calculating various IoU variants, including GIoU. Internally, it needs to handle coordinate format conversion and correctly calculate the intersection, union, and area of ​​the minimum bounding rectangle C. Creating a GIoU Loss class that inherits from the neural network module base class, calling the aforementioned GIoU calculation function in its forward propagation function, and supporting reduction operations such as summation or averaging of the loss. The GIoU Loss bounding box regression loss function constructed in this way is the bounding box regression loss function. During training, it provides more robust and refined localization supervision for the model, improving the prediction accuracy for bounding boxes with very small defects.

[0084] In some embodiments, before inputting the target image to be detected into the pre-trained improved neural network, the method further includes: scanning and recording the object to be detected based on a super-oscillating optical element and a photodetector to obtain multiple initial scan images; preprocessing the initial scan images to obtain preprocessed images; labeling the preprocessed images based on pre-trained model-assisted inference and manual correction to obtain precisely labeled images; constructing a training dataset using the precisely labeled images, and performing model validation and hyperparameter calibration to determine the optimal hyperparameter combination for training the improved neural network.

[0085] Specifically, a refined annotation strategy is employed to annotate preprocessed images to obtain accurately labeled images. Considering the high cost of annotating real samples, this embodiment adopts a "pre-training-inference-correction" cyclical annotation strategy. A basic object detection model (e.g., the YOLO model) is pre-trained on a large-scale, highly diverse virtual defect dataset. This virtual dataset is generated through computer simulation and contains various known and hypothetical defect types with accurate automatic annotations. Then, the pre-trained model is used to infer on unlabeled real preprocessed images to generate candidate defect prediction boxes. Due to the domain differences between virtual and real data, these prediction results can only serve as high-confidence "approximate annotations," requiring manual verification and correction of these candidate prediction boxes to eliminate false detections, supplement missed detections, and adjust the accuracy of the bounding boxes. This approach eliminates the need for re-annotation from scratch, reducing the cost of manual annotation.

[0086] A training dataset is constructed using the precisely labeled images mentioned above, and model validation and hyperparameter calibration are performed to determine the optimal combination of hyperparameters for training and improving the neural network. A very small but precise set of real-world labeled data is used as a standard for model fine-tuning and hyperparameter search validation. Specifically, K-fold cross-validation can be used, dividing the standard dataset into K subsets, which are used alternately as the validation set to repeatedly evaluate the model's performance. Simultaneously, grid search or Bayesian optimization algorithms are combined to determine the optimal combination of hyperparameters such as learning rate, batch size, and number of fine-tuning epochs. This approach ensures that the model's detection capabilities in real-world scenarios are accurately calibrated, thereby achieving maximum performance improvement with minimal data overhead.

[0087] Please see Figure 4 , Figure 4 This is a schematic block diagram of a device for detecting minute defects according to an embodiment of this application. The device 200 is used to execute the aforementioned method for detecting minute defects. The device 200 can invoke a pre-trained improved neural network. The pre-trained improved neural network is trained using a combined loss function including Focal Loss classification loss function and GIoU Loss bounding box regression loss function. The pre-trained improved neural network includes a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head. The device 200 is used to execute any of the minute defect detection methods described in the embodiments of this application and can be configured in a server.

[0088] The server can be a standalone server, a server cluster, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.

[0089] like Figure 4 As shown, the device 200 for detecting minute defects includes: an image transmission module 201, a feature extraction module 202, a feature fusion module 203, a feature calibration module 204, and a result output module 205.

[0090] Image transmission module 201 is used to input the target image to be detected into a pre-trained improved neural network.

[0091] The feature extraction module 202 is used to extract features from the target image based on the backbone network to obtain a multi-scale feature map.

[0092] The feature fusion module 203 is used to fuse multi-scale feature maps based on a bidirectional feature pyramid network to obtain enhanced feature maps.

[0093] The feature calibration module 204 is used to recalibrate the enhanced feature map through the attention enhancement module to obtain an optimized feature map. The attention enhancement module includes a channel attention submodule and a spatial attention submodule. The channel attention submodule adjusts the weights of the feature channels, and the spatial attention submodule focuses on the spatial location of the defects.

[0094] The result output module 205 is used to process the optimized feature map through the detection head and output the category and location information of tiny defects in the target image.

[0095] In a separate embodiment, please refer to Figure 5 , Figure 5 This is a flowchart illustrating a method for constructing a detection system for minute defects provided in this application embodiment.

[0096] like Figure 5As shown, the first stage focuses on dataset construction. To overcome the physical limitations of traditional optical microscopes, which are constrained by the diffraction limit and cannot stably resolve minute features, this scheme innovatively employs computational superoscillatory optical elements to generate specific illumination light fields. This technology interacts with the sample surface, nonlinearly amplifying and modulating subwavelength-scale defect features into the detection signal, thereby directly generating high signal-to-noise ratio defect response images from a physical perspective, providing a high-quality data source for subsequent algorithm processing. Addressing the core challenge of scarce samples and difficult annotation of minute defects in real industrial production, the process introduces a refined annotation strategy. The core idea of ​​this strategy is to minimize the cost of manually annotating minute defects. It typically uses a model pre-trained on a large-scale virtual defect dataset to perform preliminary inference on real images, generating high-confidence candidate annotations. Human intervention is only required for rapid verification and correction, thus greatly improving annotation efficiency and laying the foundation for few-shot learning.

[0097] After obtaining high-quality labeled data, the process enters the core stage of optimizing the YOLOv11 model. This stage includes multiple parallel improvement paths. Firstly, a multi-scale feature fusion network is constructed. Addressing the issue of insufficient fusion of semantic and positional information for small targets in the original network, structures such as bidirectional feature pyramids are introduced to enhance the interaction and fusion between features at different levels, ensuring that the feature maps used for prediction simultaneously contain rich details and high-level semantics. Secondly, the loss function is improved, aiming to simultaneously solve two major algorithmic challenges in detecting extremely small defects: extreme imbalance between positive and negative samples and difficulty in accurate localization. Improvements to the classification loss function (such as using Focal Loss with adjusted parameters) effectively alleviate the class imbalance problem by reducing the loss weights of a large number of simple background samples, forcing the model to focus on learning defect samples that are difficult to distinguish. Improvements to the regression loss function (such as using GIoU Loss) improve the model's localization accuracy and training stability in complex backgrounds, providing effective gradient signals even when the predicted bounding box does not overlap with the ground truth bounding box, driving accurate convergence of the predicted bounding box.

[0098] Further optimization is achieved through the integration of channel and spatial attention mechanisms. These mechanisms aim to suppress background noise and amplify weak signals related to defects, thereby improving overall detection sensitivity. Specifically, the channel attention module analyzes the importance of different feature channels, selecting the most effective features for identifying minute defects; while the spatial attention module focuses on the spatial dimension of the feature map, enhancing the sensitivity and robustness of detecting minute defects, enabling the model to spontaneously focus on image regions where defects may exist.

[0099] After model improvements, the process enters the application and evolution phase. The training of the improved YOLOv11 aims to enable the model to learn patterns of minute defects from data. After deployment, the system supports real-time detection and introduces an online rapid fine-tuning mechanism. This mechanism allows the system to continuously learn from newly encountered data, achieving small-sample adaptation and improving the system's robustness to cope with potential changes in defect morphology on the production line. The entire process forms a closed loop: the system continuously optimizes the model and parameters based on detection feedback, thereby iteratively improving detection performance in practice until a stable and reliable goal of detecting minute defects is achieved. The entire diagram clearly demonstrates the technical framework of an end-to-end solution integrating physical enhancement, algorithmic innovation, and online learning.

[0100] This application provides a detection device, which includes a memory and a processor; the memory is used to store a computer program; the processor is used to execute the computer program and, when executing the computer program, implement a method for detecting minute defects as described in any of the embodiments of this application.

[0101] This application provides a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, it enables the processor to implement a method for detecting minute defects as described in any of the embodiments of this application.

[0102] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in this application, and these modifications or substitutions should all be covered within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for detecting extremely small defects, characterized in that, The method is based on a pre-trained improved neural network, which is trained by using a combined loss function that includes Focal Loss classification loss function and GIoU Loss bounding box regression loss function; The pre-trained improved neural network includes: a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head; the method includes: The target image to be detected is input into the pre-trained improved neural network; Based on the backbone network, feature extraction is performed on the target image to obtain a multi-scale feature map; The multi-scale feature map is fused based on the bidirectional feature pyramid network to obtain an enhanced feature map. The enhanced feature map is recalibrated by the attention enhancement module to obtain an optimized feature map; wherein, the attention enhancement module includes a channel attention submodule and a spatial attention submodule, the channel attention submodule is used to adjust the weights of the feature channels, and the spatial attention submodule is used to focus on the spatial location of the defect; The detection head processes the optimized feature map and outputs the category and location information of minute defects in the target image.

2. The target detection method according to claim 1, characterized in that, The step of extracting features from the target image through the backbone network to obtain a multi-scale feature map includes: The target image is subjected to initial convolution and downsampling operations to obtain the first-level shallow feature map; The first-level shallow feature map is convolved and transformed to obtain the second-level middle-layer feature map. The second-level intermediate feature map is further downsampled and transformed to obtain the third-level deep feature map. The first-level shallow feature map is enhanced by a high-resolution auxiliary branch to generate an enhanced shallow feature map; The enhanced shallow feature map, the second-level mid-level feature map, and the third-level deep feature map are collectively output as the multi-scale feature map.

3. The target detection method according to claim 2, characterized in that, The process of fusing the multi-scale feature maps through the bidirectional feature pyramid network to obtain enhanced feature maps includes: The third-level deep feature map is upsampled and fused with the second-level middle-level feature map to generate a first intermediate fused feature map; The first intermediate fused feature map is upsampled and fused with the enhanced shallow feature map to generate a second intermediate fused feature map; The second intermediate fusion feature map is downsampled and fused with the first intermediate fusion feature map to generate a third intermediate fusion feature map; The third intermediate fusion feature map is downsampled and fused with the third-level deep feature map to generate a fourth intermediate fusion feature map; The second intermediate fusion feature map and the fourth intermediate fusion feature map are aggregated to output the enhanced feature map.

4. The target detection method according to claim 1, characterized in that, The step of recalibrating the enhanced feature map through the attention enhancement module to obtain an optimized feature map includes: The enhanced feature map is input into the channel attention submodule to obtain the channel recalibration feature map, specifically including: Global average pooling and global max pooling are performed on the input features respectively to generate average pooling feature vectors and max pooling feature vectors; The average pooling feature vector and the max pooling feature vector are respectively input into the same shared multilayer perceptron for processing, and the processed feature vector is output. The two processed feature vectors are added element by element, and a channel attention weight vector is generated by using the Sigmoid activation function. The channel attention weight vector is multiplied with the input feature channel by channel to output the channel recalibration feature map. The channel recalibration feature map is input into the spatial attention submodule to obtain the optimized feature map, specifically including: In the channel dimension, average pooling and max pooling are performed on the channel recalibration feature map respectively to obtain average pooling feature map and max pooling feature map; The average pooling feature map and the max pooling feature map are concatenated along the channel dimension to obtain a multi-channel aggregated feature map; The multi-channel aggregated feature map is subjected to convolution and Sigmoid activation to generate a spatial attention weight map; The spatial attention weight map and the channel recalibration feature map are multiplied pixel by pixel to output the optimized feature map.

5. The target detection method according to claim 1, characterized in that, The method further includes the following steps before training the improved neural network: Set a classification loss function to address the extreme imbalance between positive and negative samples; Set a bounding box regression loss function to improve the accuracy of locating minute defects; The combined loss function is formed by weighted summation of the classification loss function and the bounding box regression loss function.

6. The target detection method according to claim 1, characterized in that, Before inputting the target image to be detected into the pre-trained improved neural network, the method further includes: Multiple initial scan images are obtained by scanning and recording the object to be detected using ultra-oscillating optical elements and photodetectors. The initial scanned image is preprocessed to obtain the target image to be detected.

7. The target detection method according to claim 1, characterized in that, Before inputting the target image to be detected into the pre-trained improved neural network, the method further includes: Multiple initial scan images are obtained by scanning and recording the object to be detected using ultra-oscillating optical elements and photodetectors. The initial scanned image is preprocessed to obtain a preprocessed image; Based on pre-trained model-assisted reasoning and manual correction, the pre-processed image is labeled to obtain an accurately labeled image; A training dataset is constructed using the precisely labeled images, and model validation and hyperparameter calibration are performed to determine the optimal combination of hyperparameters for training the improved neural network.

8. A device for detecting extremely small defects, characterized in that, The device for detecting minute defects can call a pre-trained improved neural network, which is trained by using a combined loss function that includes Focal Loss classification loss function and GIoU Loss bounding box regression loss function. The pre-trained improved neural network includes: a backbone network, a bidirectional feature pyramid network, an attention enhancement module, and a detection head. The device for detecting extremely small defects is used to execute the method for detecting extremely small defects as described in any one of claims 1-7. The device for detecting extremely small defects includes: The image transmission module is used to input the target image to be detected into the pre-trained improved neural network; The feature extraction module is used to extract features from the target image based on the backbone network to obtain a multi-scale feature map; The feature fusion module is used to fuse the multi-scale feature maps based on the bidirectional feature pyramid network to obtain an enhanced feature map; The feature calibration module is used to recalibrate the enhanced feature map through the attention enhancement module to obtain an optimized feature map; wherein, the attention enhancement module includes a channel attention submodule and a spatial attention submodule in sequence, the channel attention submodule is used to adjust the weight of the feature channels, and the spatial attention submodule is used to focus on the spatial location of the defect; The result output module is used to process the optimized feature map through the detection head and output the category and location information of tiny defects in the target image.

9. A testing device, characterized in that, The detection device includes a memory and a processor; The memory is used to store computer programs; The processor is configured to execute the computer program and, in executing the computer program, implement the method for detecting minute defects as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to implement the method for detecting minute defects as described in any one of claims 1 to 7.