A mask substrate defect image recognition method based on deep learning

By combining multimodal image data processing and a self-supervised pre-training few-sample fine-tuning method with a CNN-Transformer hybrid encoder and a multimodal information fusion perception network, the limitations and sample scarcity problems of mask substrate defect identification in existing technologies are solved, achieving high-precision and strong-generalization defect identification results.

CN122244050APending Publication Date: 2026-06-19HUNAN OMNISUN INFORMATION MATERIAL CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUNAN OMNISUN INFORMATION MATERIAL CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing deep learning-based mask substrate defect identification methods have limitations. They are difficult to capture the global contextual relationship between defects and surrounding areas, defect samples are scarce and annotation costs are high, they are difficult to utilize the massive number of defect-free samples readily available in factories, and the information utilization is insufficient, making it difficult to cope with complex scenarios.

Method used

A two-stage training paradigm of multimodal image data processing, self-supervised pre-training, and few-sample fine-tuning is adopted. It combines a CNN-Transformer hybrid encoder with a multimodal information fusion perception network, performs defect identification through a multi-task learning framework, and performs confidence judgment and manual verification to achieve adaptive calibration of the model.

🎯Benefits of technology

It significantly improves the accuracy and generalization ability of defect identification, and can quickly adapt to changes in defect patterns brought about by new products and processes, ensuring that the model maintains high accuracy and stability during long-term operation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244050A_ABST
    Figure CN122244050A_ABST
Patent Text Reader

Abstract

This application relates to the field of mask substrate image recognition technology, and more particularly to a deep learning-based method for mask substrate defect image recognition. The method includes: denoising and contrast enhancement of multimodal image data using differentiable operators to obtain standardized multimodal data; training a CNN-Transformer hybrid encoder and a multimodal information fusion perception network through a two-stage training paradigm of self-supervised pre-training and few-sample fine-tuning to obtain a defect recognition model; inputting the multimodal image data of the mask substrate to be detected and the corresponding design data into the trained defect recognition model, and obtaining the defect recognition result through a multi-task learning framework; judging the confidence level of the defect recognition result, and transferring results with confidence levels below a preset threshold to manual review, and updating the defect recognition model online to achieve adaptive calibration. This application helps improve the accuracy and generalization ability of defect recognition.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of mask substrate image recognition technology, and in particular to a method for mask substrate defect image recognition based on deep learning. Background Technology

[0002] The photomask substrate is a core component in semiconductor photolithography. Surface defects (such as particle contamination, pattern distortion, edge roughness, and transparency defects) can affect the transfer accuracy of wafer patterns through photolithographic amplification, leading to wafer functional failure. Especially in advanced semiconductor processes, the defect detection accuracy of the photomask substrate directly determines the yield of semiconductor products. Currently, deep learning-based photomask substrate defect identification methods have gradually replaced traditional manual inspection and traditional machine vision inspection methods, becoming the mainstream technology in the industry.

[0003] However, existing deep learning-based mask substrate defect identification methods still suffer from numerous technical bottlenecks, making it difficult to meet the stringent requirements of advanced semiconductor manufacturing: First, the model architecture has limitations. While traditional CNN models can effectively extract local features, they struggle to capture the global contextual relationship between defects and their surrounding areas, resulting in poor identification of defects such as large-area contamination and irregular scratches. Second, defect samples are scarce and labeling costs are extremely high. Existing methods are highly dependent on labeled samples and cannot utilize the massive amounts of defect-free samples readily available in factories, leading to weak model generalization ability and an inability to quickly adapt to changes in defect patterns brought about by new products and processes. Third, information utilization is insufficient. Most methods use single-modal images for defect identification, making it difficult to handle complex scenarios such as transparent defects and low-contrast defects.

[0004] Therefore, there is an urgent need for a multi-dimensional mask substrate defect identification method to address the shortcomings of existing technologies and improve the accuracy and generalization ability of defect identification. Summary of the Invention

[0005] Therefore, it is necessary to provide a deep learning-based method for mask substrate defect image recognition that can improve the accuracy and generalization ability of defect identification, in order to address the above-mentioned technical problems.

[0006] In a first aspect, this application provides a method for identifying mask substrate defects based on deep learning, the method comprising: Multimodal image data of a mask substrate is acquired, and the multimodal image data is processed by a differentiable operator for denoising and contrast enhancement to obtain standardized multimodal data. By employing a two-stage training paradigm of self-supervised pre-training and few-sample fine-tuning, a defect recognition model is obtained by training a CNN-Transformer hybrid encoder and a multimodal information fusion perception network based on the standardized multimodal data and synthetic defect data. The synthetic defect data is generated by defect synthesis technology based on GAN or diffusion model, which conforms to the physical laws of mask substrate defects. The multimodal image data of the mask substrate to be detected and the corresponding design data are input into the trained defect recognition model. Through a multi-task learning framework, the bounding box, category, and pixel-level mask of the defect are output simultaneously to obtain the defect recognition result. The confidence level of the defect identification results is determined, and results with a confidence level lower than a preset threshold are transferred to manual review. The reviewed data is used as incremental data to update the defect identification model online, so as to achieve adaptive calibration of the model.

[0007] In one embodiment, the two-stage training paradigm of self-supervised pre-training and few-sample fine-tuning, based on the standardized multimodal data and synthetic defect data, trains a CNN-Transformer hybrid encoder and a multimodal information fusion perception network to obtain a defect recognition model, including: A synthetic defect image that conforms to the physical laws of mask substrate defects is generated by a defect synthesis model, and then mixed with real labeled defect samples in the standardized multimodal data to form a training set for fine-tuning with small samples. By utilizing a large amount of unlabeled normal image data of mask substrates, the image feature extraction part of the CNN-Transformer hybrid encoder is pre-trained in a self-supervised manner to obtain a pre-trained model with strong generalization foundation. Using the pre-trained model as initialization, the CNN-Transformer hybrid encoder and the multimodal information fusion perception network are jointly supervisedly fine-tuned using the training set. During the fine-tuning process, the local and global feature extraction of the hybrid encoder and the cross-modal attention fusion weights of image features and design data in the fusion perception network are optimized simultaneously. After joint fine-tuning, the converged CNN-Transformer hybrid encoder and the multimodal information fusion perception network are integrated into an end-to-end model as a defect recognition model.

[0008] In one embodiment, the step of using a large amount of unlabeled normal image data of the mask substrate to perform self-supervised pre-training on the image feature extraction part of the CNN-Transformer hybrid encoder to obtain a pre-trained model with strong generalization capabilities includes: A large number of unlabeled normal image data of mask substrates obtained from different batches and under different process conditions in the mask substrate production line are collected to form a self-supervised pre-training dataset. A self-supervised task is constructed by contrastive learning or mask image modeling methods, and the self-supervised task is executed on the self-supervised pre-training dataset. The backbone network responsible for image feature extraction in the CNN-Transformer hybrid encoder is iteratively trained until the preset loss function converges, so that the backbone network learns the texture structure, edge contours and intrinsic consistency features between multimodal images of the normal image of the mask substrate, thereby obtaining a pre-trained model with strong generalization foundation.

[0009] In one embodiment, the construction of the self-supervised task through contrastive learning or masked image modeling methods includes: When constructing a self-supervised task through contrastive learning, a data augmentation strategy is applied to each input image in the self-supervised pre-training dataset to generate two different augmented views of the same image as positive sample pairs, and augmented views of other images as negative sample pairs. The backbone network is then optimized using a contrastive loss function so that the feature vectors of the positive sample pairs are close to each other and the feature vectors of the negative sample pairs are far apart, thereby learning a general feature representation with augmentation invariance. When constructing a self-supervised task through masked image modeling, each input image in the self-supervised pre-training dataset is divided into multiple image blocks. Some image blocks are randomly masked according to a preset ratio. A self-supervised task is constructed to restore the pixel values ​​of the masked image blocks using the context information of the unmasked image blocks. The backbone network is optimized by the reconstruction loss function to learn the local texture details and global structural prior knowledge of the mask substrate.

[0010] In one embodiment, the acquisition of multimodal image data of the mask substrate, and the denoising and contrast enhancement processing of the multimodal image data using a differentiable operator to obtain standardized multimodal data, includes: Multimodal image data of the mask substrate is acquired through a multi-sensor synchronous acquisition device, and design data corresponding to the mask substrate is obtained synchronously. The multimodal image data is input into a differentiable denoising operator for denoising, wherein the differentiable denoising operator adopts an improved variable step-size median filter structure. The denoised multimodal image data is input into a differentiable contrast enhancement operator for contrast enhancement processing to obtain standardized multimodal data, wherein the differentiable contrast enhancement operator adopts an adaptive contrast constraint structure.

[0011] In one embodiment, the step of inputting the multimodal image data into a differentiable denoising operator for denoising includes: The improved variable step size median filter structure is implemented as a differentiable denoising operator that supports gradient backpropagation, so that the differentiable denoising operator can be used as part of the neural network computation graph for forward computation and backpropagation. The local noise intensity of the multimodal image data is estimated to obtain noise distribution information in each region of the image, and the filtering step size parameter of the differentiable denoising operator is adjusted according to the noise distribution information. Based on the filtering step size parameter, the pixel values ​​of the image are sorted and weighted within the filtering window in a differentiable manner, and a feature map after denoising is output.

[0012] In one embodiment, the step of inputting the denoised multimodal image data into a differentiable contrast enhancement operator for contrast enhancement processing to obtain standardized multimodal data includes: The adaptive contrast constraint structure is implemented as a differentiable contrast enhancement operator that supports gradient backpropagation, so that the differentiable contrast enhancement operator can be used as part of the neural network computation graph for forward computation and backpropagation. Local grayscale distribution analysis is performed on the denoised multimodal image data to identify potential defect regions and background regions, and the grayscale contrast difference between the potential defect regions and the background regions is calculated. Based on the grayscale contrast difference, the contrast gain parameter of the differentiable contrast enhancement operator is adjusted, and based on the contrast gain parameter, a differentiable contrast stretching and limiting operation is performed pixel by pixel on the denoised multimodal image data to output a feature map that has undergone contrast enhancement processing.

[0013] In one embodiment, the defect recognition model, which inputs multimodal image data of the mask substrate to be detected and corresponding design data into the trained model, synchronously outputs the bounding box, category, and pixel-level mask of the defect through a multi-task learning framework, to obtain the defect recognition result, includes: The multimodal image data of the mask substrate to be tested is processed by a differentiable operator for denoising and contrast enhancement to obtain the standardized multimodal data of the mask substrate to be tested. The standardized multimodal data of the mask substrate to be detected is input into the CNN-Transformer hybrid encoder. Local texture and edge features are extracted through CNN branches, global context dependencies are modeled through the Transformer encoder, and multi-scale image features are output through multi-scale feature fusion. The design data is transformed to obtain design baseline features, and the multi-modal information fusion perception network in the defect identification model is used to adaptively align and fuse the multi-scale image features with the design baseline features to obtain joint features. The joint features are input into a multi-task learning framework, which simultaneously outputs the bounding box, category, and pixel-level mask of the defect to obtain the defect recognition result.

[0014] In one embodiment, the step of inputting the joint features into a multi-task learning framework and simultaneously outputting the defect's bounding box, category, and pixel-level mask to obtain the defect recognition result includes: The joint features are input into the shared detection head of the multi-task learning framework, and joint inference is performed through the localization branch, classification branch and segmentation branch; The positioning branch outputs the defect bounding box, the classification branch outputs the defect category, and the segmentation branch outputs the pixel-level defect mask; The localization results, classification results, and segmentation results are post-processed and fused to remove duplicate detections and false defects, resulting in the final defect identification result.

[0015] In summary, this application includes the following beneficial technical effects: Denoising and contrast enhancement are performed on multimodal image data using differentiable operators, preserving defect edges and texture details while suppressing noise, significantly improving the clarity and recognizability of defect features and enhancing defect recognition accuracy from the source. Through self-supervised pre-training and small-sample fine-tuning, a large number of defect-free normal samples are fully utilized, reducing defect labeling dependence and improving generalization. The use of a CNN-Transformer hybrid encoder and a multimodal information fusion perception network balances local details and global context, accurately distinguishing real defects. Joint inference via a multi-task learning framework allows for mutual constraints and enhancements between localization, classification, and segmentation, improving defect localization accuracy and classification accuracy. Combined with design data, it can accurately identify structural defects such as missing graphics and line width deviations. Confidence judgment and manual review to correct erroneous samples continuously improve the model's recognition accuracy and stability. The reviewed data is used as incremental data for online updates to the defect recognition model, enabling it to adapt to new processes, new batches, and new defect patterns, maintaining strong generalization ability over the long term. Attached Figure Description

[0016] Figure 1 This is a flowchart illustrating a deep learning-based mask substrate defect image recognition method in one embodiment. Figure 2 This is a flowchart illustrating a deep learning-based mask substrate defect image recognition method in another embodiment. Detailed Implementation

[0017] This invention provides a method for identifying mask substrate defects based on deep learning.

[0018] The embodiments of the present invention will now be described in more detail with reference to the accompanying drawings. While some embodiments of the present invention are shown in the drawings, it should be understood that the present invention can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the accompanying drawings and embodiments are for illustrative purposes only and are not intended to limit the scope of protection of the present invention.

[0019] In the description of the embodiments disclosed in this invention, the term "comprising" and similar terms should be understood as open-ended inclusion, i.e., "including but not limited to". The term "based on" should be understood as "at least partially based on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.

[0020] For ease of understanding, the specific process of the embodiments of the present invention is described below. Please refer to [link / reference]. Figure 1 One embodiment of the mask substrate defect image recognition method based on deep learning in this invention includes: S100 acquires multimodal image data of the mask substrate and performs denoising and contrast enhancement processing on the multimodal image data using a differentiable operator to obtain standardized multimodal data.

[0021] Specifically, firstly, multimodal image data of the mask substrate is acquired, including optical images, scanning electron microscope images, and images acquired under multispectral or multi-light source conditions. Visual information of the same mask substrate under different imaging modalities is obtained through a multi-sensor synchronous acquisition device to compensate for the insufficient ability of a single imaging modality to identify specific defect types. Then, the multimodal image data is processed for denoising and contrast enhancement using differentiable operators. Specifically, differentiable denoising operators and differentiable contrast enhancement operators are embedded in the front end of the neural network as components of the network computation graph, and are trained end-to-end along with the subsequent detection model. During training, the parameters of the differentiable denoising and contrast enhancement operators are not fixed preset values, but are adaptively optimized using a backpropagation algorithm based on the loss function of the final defect identification task. This ensures that the denoising intensity and contrast enhancement level are matched to the identification task objective, effectively preserving the edge contours and texture details of defects while suppressing imaging noise, thereby improving the clarity and recognizability of defect features from the data source.

[0022] S200 uses a two-stage training paradigm of self-supervised pre-training and few-sample fine-tuning to train a CNN-Transformer hybrid encoder and a multimodal information fusion perception network based on standardized multimodal data and synthetic defect data, thus obtaining a defect recognition model.

[0023] Specifically, after data preprocessing, a two-stage training paradigm of "self-supervised pre-training + few-sample fine-tuning" is adopted. Based on standardized multimodal data and synthetic defect data, a CNN-Transformer hybrid encoder and a multimodal information fusion perception network are trained to obtain a defect recognition model. In the self-supervised pre-training stage, a large amount of unlabeled normal image data of the mask substrate is used. Through self-supervised tasks such as contrastive learning or mask image modeling, the image feature extraction backbone network in the hybrid encoder learns the general visual representation of the mask substrate image, obtaining a pre-trained model with strong generalization foundation. In the few-sample fine-tuning stage, the pre-trained model is used as the initial parameters, and joint supervised fine-tuning is performed using a small number of labeled real defect samples and synthetic defect data generated based on generative adversarial networks or diffusion models. The synthetic defect data conforms to the physical laws of mask substrate defects and covers various defects with different locations, sizes, and transparency, effectively expanding the diversity and scale of training samples. Through the two-stage training paradigm, the model can quickly adapt to the defect recognition task using only a small number of labeled samples, significantly reducing the dependence on large-scale defect labeled data.

[0024] S300 inputs the multimodal image data of the mask substrate to be inspected and the corresponding design data into the trained defect recognition model. Through a multi-task learning framework, it synchronously outputs the bounding box, category, and pixel-level mask of the defect to obtain the defect recognition result.

[0025] Specifically, during the model inference stage, the multimodal image data of the mask substrate to be inspected and the corresponding design data are input into the trained defect recognition model. Through a multi-task learning framework, the model simultaneously outputs the defect bounding box, defect category, and pixel-level mask to obtain the defect recognition result. Using the shared detection head of the multi-task learning framework, the three tasks of defect localization, classification, and segmentation are executed in parallel, completing a comprehensive defect analysis in a single forward propagation. Simultaneously, the multimodal information fusion perception network performs cross-modal fusion comparison of image features and design data features, accurately distinguishing real defects from the design's structure and background noise. The final output defect recognition result includes the defect's spatial location information, category label, and pixel-level contour range, providing complete diagnostic data support for subsequent mask substrate process adjustments.

[0026] S400 assesses the confidence level of the defect identification results and transfers results with a confidence level below a preset threshold to manual review. The reviewed data is then used as incremental data to update the defect identification model online, thereby achieving adaptive calibration of the model.

[0027] Specifically, to further improve the model's long-term stability and adaptability, the confidence level of defect identification results is assessed, and results with confidence levels below a preset threshold are automatically transferred to a manual review process. High-quality labeled data after manual review is used as incremental data for online incremental updates and optimization of the defect identification model. This allows the model to continuously learn the feature distributions of new processes, new batches, and new defect patterns, preventing performance degradation due to changes in operating conditions or process drift. Simultaneously, the model's identification performance indicators are monitored in real time under different production batches and process conditions. When a significant performance decline occurs, a model calibration mechanism is automatically triggered to achieve adaptive calibration, ensuring the defect identification system maintains optimal detection performance throughout long-term operation.

[0028] In one embodiment, such as Figure 2 As shown, S200 includes: S210 generates a synthetic defect image that conforms to the physical laws of mask substrate defects through a defect synthesis model, and mixes it with real labeled defect samples in standardized multimodal data to form a training set for small sample fine-tuning. S220 utilizes a large amount of unlabeled normal image data of mask substrates to perform self-supervised pre-training on the image feature extraction part of the CNN-Transformer hybrid encoder, thereby obtaining a pre-trained model with strong generalization foundation. S230, using a pre-trained model as initialization, performs joint supervised fine-tuning of the CNN-Transformer hybrid encoder and the multimodal information fusion perception network through the training set, and simultaneously optimizes the local and global feature extraction of the hybrid encoder and the cross-modal attention fusion weights of image features and design data in the fusion perception network during the fine-tuning process; S240, after joint fine-tuning, integrates the converged CNN-Transformer hybrid encoder with the multimodal information fusion perception network into an end-to-end model as a defect recognition model.

[0029] Specifically, firstly, a defect synthesis model built using generative adversarial networks or diffusion models automatically generates synthetic defect images that conform to the physical laws of mask substrate defects by learning the morphological characteristics and distribution patterns of real defects. These synthetic defect images are then mixed with real, labeled defect samples from standardized multimodal data to form a complete training set for subsequent small-sample fine-tuning. Secondly, before proceeding to supervised fine-tuning, a self-supervised pre-training phase is performed to fully utilize the abundant unlabeled normal mask substrate image data readily available in semiconductor manufacturing lines. Specifically, a large number of normal multimodal images generated under different batches and process conditions on the mask substrate production line are collected; these images can be used directly without any manual defect labeling. For the backbone network part of the CNN-Transformer hybrid encoder responsible for image feature extraction, self-supervised learning methods such as contrastive learning or mask image modeling are used to construct a self-supervised task, which is then iteratively trained on the unlabeled dataset. Taking contrastive learning as an example, random data augmentation is applied to the same input image, generating two different augmented views as positive sample pairs. Simultaneously, augmented views of other images in the dataset are used as negative sample pairs. The backbone network parameters are optimized using a contrastive loss function, ensuring that positive sample pairs are close to each other in the feature space and negative sample pairs are far apart. This forces the backbone network to learn a universal visual representation invariant to image augmentation operations. After sufficient iterative training until the preset loss function converges, the backbone network has fully extracted the inherent texture structure, edge contours, and intrinsic consistency features between different imaging modalities in the normal image of the mask substrate. The obtained pre-trained model weights provide a high-quality feature initialization foundation for subsequent fine-tuning for defect recognition tasks, effectively reducing the model's dependence on large-scale defect-labeled samples. Then, based on the pre-trained model, a small-sample supervised fine-tuning stage is entered. The model weights saved in the self-supervised pre-training stage are used as initialization parameters for the CNN-Transformer hybrid encoder and the multimodal information fusion perception network, and supervised joint fine-tuning training is performed on the training set composed of a mixture of synthetic defect images and real defect samples.During fine-tuning, all learnable parameters of the model participate in gradient updates, specifically including three key components: First, the parameters of the lightweight CNN branch in the CNN-Transformer hybrid encoder, which is responsible for extracting local texture and edge detail features of the image. Optimizing this branch helps enhance the model's sensitivity to tiny defects such as grain-like or pinhole-like defects. Second, the parameters of the Transformer encoder, which models long-distance dependencies between feature blocks through a self-attention mechanism. Optimizing this component helps the model accurately capture the global contrast patterns between defects such as large-area contamination and irregular scratches and the surrounding normal areas. Third, the parameters of the cross-modal attention fusion module in the multimodal information fusion perception network. This module achieves adaptive alignment and fusion of image features and design data features through a cross-attention mechanism. Optimizing this module helps the model accurately distinguish between real defects and background noise introduced by the design structure itself. Through the simultaneous joint optimization of these three parameters, the model can quickly converge to a parameter space with high discriminative ability for various defects on the mask substrate, while retaining the general visual priors learned during the pre-training stage. Finally, after the loss function of the joint supervised fine-tuning stage converges and the model performance meets the preset indicators, the fine-tuned CNN-Transformer hybrid encoder and the multimodal information fusion perception network are integrated into a unified end-to-end network architecture. This end-to-end model can directly receive raw multimodal image data and corresponding design data as input. After denoising and enhancement processing by the differentiable preprocessing module, it sequentially passes through the feature extraction of the hybrid encoder and the cross-modal information alignment of the multimodal fusion network, finally outputting the localization, classification, and segmentation results of defects. This end-to-end model is used as the output of the trained defect recognition model for subsequent online inference and quality judgment of the mask substrate to be detected. Through the above two-stage training paradigm of "self-supervised pre-training + small sample fine-tuning", this invention can train a defect recognition model with high accuracy and high generalization ability with only a small number of real defect labeled samples, effectively solving the contradiction between the scarcity of labeled data and the generalization requirements of the model in the field of mask substrate defect detection.

[0030] In one embodiment, a pre-trained model with strong generalization capabilities is obtained by using a large amount of unlabeled normal image data of mask substrates to perform self-supervised pre-training on the image feature extraction part of the CNN-Transformer hybrid encoder, including: A large amount of unlabeled normal image data of mask substrates obtained from different batches and under different process conditions in the mask substrate production line is collected to form a self-supervised pre-training dataset. A self-supervised task is constructed by contrastive learning or mask image modeling methods, and the self-supervised task is executed on the self-supervised pre-training dataset. The backbone network responsible for image feature extraction in the CNN-Transformer hybrid encoder is iteratively trained until the preset loss function converges, so that the backbone network learns the texture structure, edge contours and intrinsic consistency features between multimodal images of normal mask substrates, thereby obtaining a pre-trained model with strong generalization foundation.

[0031] Specifically, firstly, a self-supervised pre-training dataset is constructed. A large amount of multimodal image data of normal products generated under different production batches and process conditions is collected from the mask substrate production line. Since these images are all defect-free normal product images, they can be used directly without any manual defect annotation, resulting in extremely low data acquisition costs. This fully utilizes the massive normal sample resources easily accumulated in semiconductor manufacturing lines, effectively circumventing the industry bottleneck of scarce defect samples and high annotation costs in the field of mask substrate defect detection. Then, a self-supervised learning task is constructed and executed on the constructed self-supervised pre-training dataset. Specifically, contrastive learning or mask image modeling methods can be used to construct the self-supervised task. After the self-supervised task is constructed, it is applied to the self-supervised pre-training dataset to iteratively train the backbone network responsible for image feature extraction in the CNN-Transformer hybrid encoder. During training, the convergence of the preset loss function is continuously monitored. When the loss function value tends to stabilize and no longer decreases significantly, the training is considered to have reached convergence. Through the aforementioned self-supervised training process, the backbone network autonomously mines and learns the inherent texture structure features, edge contour features, and intrinsic consistency features between different imaging modalities in the normal image of the mask substrate without any manual annotation. At this point, the feature representation learned by the backbone network has a stable perception capability and a strong generalization foundation for the normal image structure of the mask substrate. The obtained pre-trained model weights provide a high-quality feature initialization foundation for subsequent fine-tuning of small samples for defect recognition tasks, enabling the model to quickly converge to a parameter space with high discriminative ability for various defects using only a small number of labeled defect samples.

[0032] In one embodiment, constructing a self-supervised task using contrastive learning or masked image modeling methods includes: When constructing a self-supervised task through contrastive learning, a data augmentation strategy is applied to each input image in the self-supervised pre-training dataset to generate two different augmented views of the same image as positive sample pairs and augmented views of other images as negative sample pairs. The backbone network is optimized using a contrastive loss function to make the feature vectors of positive sample pairs closer to each other and the feature vectors of negative sample pairs farther apart, thereby learning a general feature representation with augmentation invariance. When constructing a self-supervised task through masked image modeling, each input image in the self-supervised pre-training dataset is divided into multiple image blocks. Some image blocks are randomly masked according to a preset ratio, and a self-supervised task is constructed to restore the pixel values ​​of the masked image blocks using the context information of the unmasked image blocks. The backbone network is optimized using a reconstruction loss function to learn the local texture details and global structural prior knowledge of the mask substrate.

[0033] Specifically, when constructing a self-supervised task using contrastive learning, data augmentation strategies, including random cropping, color jittering, and Gaussian blur, are applied to each input image in the self-supervised pre-training dataset to generate two different augmented views of the same image. These two augmented views constitute a positive sample pair. Simultaneously, augmented views of other images in the dataset are used as negative sample pairs. Based on the constructed positive and negative sample pairs, the backbone network is optimized using a contrastive loss function. The mechanism of this contrastive loss function is to calculate the similarity between the feature vectors of the positive sample pairs and the feature vectors of the negative sample pairs in the feature space. The backbone network parameters are adjusted through gradient backpropagation so that the feature vectors of the positive sample pairs are closer to each other in the feature space, while the feature vectors of the negative sample pairs are further apart. Through iterative contrastive learning with a large number of positive and negative sample pairs, the backbone network gradually learns a universal feature representation that is invariant to image augmentation operations. That is, regardless of the data augmentation transformation of the input image, the network can extract stable features that reflect the essential structure of the image, thereby capturing the common texture and structural information of the normal image of the mask substrate across batches and process conditions. When constructing a self-supervised task using masked image modeling, each input image in the self-supervised pre-training dataset is divided into multiple equally sized image patch sequences according to a preset size. A portion of these image patches is randomly selected from the sequence for masking according to a preset masking ratio. This involves setting the pixel values ​​of the selected image patches to zero or replacing them with random noise, resulting in a damaged input image that retains some visible image patches. The self-supervised task based on this involves feeding the damaged input image into a backbone network. The network is trained to predict and reconstruct the original pixel values ​​corresponding to the masked image patches based on the contextual information provided by the unmasked image patches. By calculating the reconstruction loss function between the reconstructed pixel values ​​and the true original pixel values, the backbone network parameters are iteratively optimized using the backpropagation algorithm, gradually enabling the network to infer global missing information based on locally visible information. After sufficient mask reconstruction training, the backbone network can deeply understand the pixel-level local texture details and global semantic structure prior knowledge in the normal image of the mask substrate, providing strong feature support for the refined identification of minute defects in subsequent defect detection tasks.

[0034] In one embodiment, multimodal image data of a mask substrate is acquired, and the multimodal image data is processed by a differentiable operator for denoising and contrast enhancement to obtain standardized multimodal data, including: Multimodal image data of the mask substrate is acquired using a multi-sensor synchronous acquisition device, and design data corresponding to the mask substrate is obtained simultaneously. The multimodal image data is input into a differentiable denoising operator for denoising, wherein the differentiable denoising operator adopts an improved variable step-size median filtering structure. The denoised multimodal image data is input into a differentiable contrast enhancement operator for contrast enhancement processing to obtain standardized multimodal data, wherein the differentiable contrast enhancement operator adopts an adaptive contrast limiting structure.

[0035] Specifically, a multi-sensor synchronous acquisition device is used to acquire multimodal image data of the mask substrate, while simultaneously acquiring the corresponding design data, including CAD design drawings or the GDSII design rule database, to provide an ideal design benchmark for cross-modal comparison in the subsequent defect identification process. Then, the acquired multimodal image data is input into a differentiable denoising operator for denoising processing. This differentiable denoising operator employs an improved variable-step median filtering structure, which transforms traditional median filtering into a differentiable operation that supports gradient backpropagation. This allows it to be embedded as a pre-component of the neural network computation graph into the front end of the detection model, establishing trainable connections with subsequent network layers. During denoising, the operator adaptively adjusts the filtering step size parameter according to the local noise distribution of the input image. It automatically increases the step size in noise-dense areas to enhance smoothing, and automatically decreases the step size in defect edge areas to protect detailed features from blurring, thereby effectively suppressing imaging noise while preserving the defect boundary contours and texture information to the maximum extent. Next, the denoised multimodal image data is input into a differentiable contrast enhancement operator for contrast enhancement. This operator employs an adaptive contrast constraint structure and is also embedded in the front end of a neural network as part of the computational graph. By analyzing the grayscale distribution of local image regions, the operator automatically identifies the grayscale difference between potential defect areas and the background area, and adaptively adjusts the contrast gain parameter based on this difference. A larger gain is applied to low-contrast areas where the grayscale difference between the defect and the background is small to enhance the defect's discernibility, while the gain amplitude is limited in high-contrast areas to avoid information distortion caused by over-enhancement. After these two steps of denoising and contrast enhancement, standardized multimodal data is finally output, providing a high-quality input data foundation for subsequent feature extraction and defect identification.

[0036] In one embodiment, inputting multimodal image data into a differentiable denoising operator for denoising includes: The improved variable step-size median filter structure is implemented as a differentiable denoising operator that supports gradient backpropagation, enabling the differentiable denoising operator to be used as part of the neural network computation graph for forward computation and backpropagation. Local noise intensity is estimated for multimodal image data to obtain noise distribution information for each region of the image, and the filtering step-size parameter of the differentiable denoising operator is adjusted according to the noise distribution information. Based on the filtering step-size parameter, the pixel values ​​of the image are sorted and weighted within the filtering window in a differentiable manner, and the denoised feature map is output.

[0037] Specifically, firstly, the improved variable step-size median filter structure is made differentiable to support gradient backpropagation, thus enabling it to function as a neural network computation tool. Figure 1 A partially differentiable denoising operator, embedded in the front end of a neural network, can directly receive raw multimodal image data as input and participate in forward computation and backward propagation. Its internal parameters are no longer fixed preset values ​​but can be iteratively optimized using gradient signals during model training. In the denoising process, local noise intensity is first estimated from the input multimodal image data. By analyzing the statistical characteristics of pixel values ​​in different regions of the image, noise distribution information at different spatial locations is obtained. This noise distribution information reflects which areas in the image are more severely affected by noise and which areas are relatively clear. Then, based on the obtained noise distribution information, the filtering step size parameter of the differentiable denoising operator is adaptively adjusted. Specifically, in areas with high noise intensity, a larger filtering step size is automatically used to expand the coverage of the filtering window and enhance the smoothing and denoising effect; in areas with defect edges or rich textures, a smaller filtering step size is automatically used to narrow the effective range of the filtering window, avoiding blurring of defect edges or loss of detail due to excessive smoothing. Through this content-based adaptive step size adjustment mechanism, a dynamic balance between denoising intensity and detail preservation is achieved. Finally, using the determined adaptive step size parameters, a differentiable sorting and weighted aggregation operation is performed on the pixel values ​​of the input image within the corresponding filtering window. This sorting and weighted aggregation operation maintains differentiability with respect to the input variables, making the output of the entire filtering process differentiable with respect to the input. This ensures that the gradient of the loss function can be smoothly backpropagated from subsequent network layers to the input layer during end-to-end training, thus completing the joint optimization of the denoising operator parameters.

[0038] In one embodiment, the denoised multimodal image data is input into a differentiable contrast enhancement operator for contrast enhancement processing to obtain standardized multimodal data, including: The adaptive contrast constraint structure is implemented as a differentiable contrast enhancement operator that supports gradient backpropagation, enabling the differentiable contrast enhancement operator to be used as part of the neural network computation graph for forward computation and backpropagation. Local gray-level distribution analysis is performed on the denoised multimodal image data to identify potential defect regions and background regions, and the gray-level contrast difference between the potential defect regions and background regions is calculated. Based on the gray-level contrast difference, the contrast gain parameter of the differentiable contrast enhancement operator is adjusted, and based on the contrast gain parameter, differentiable contrast stretching and constraint operations are performed pixel by pixel on the denoised multimodal image data, outputting a feature map with contrast enhancement processing.

[0039] Specifically, firstly, the adaptive contrast constraint structure is implemented as a differentiable contrast enhancement operator supporting backpropagation, enabling it to be used as part of the neural network computation graph for both forward computation and backpropagation. This differentiable contrast enhancement operator is embedded in the front end of the neural network, following the differentiable denoising operator, and receives the denoised feature map as input. During the contrast enhancement process, local gray-level distribution analysis is first performed on the denoised multimodal image data. By segmenting the image into blocks or scanning a sliding window, the gray-level histogram distribution characteristics of each local region are statistically analyzed. Based on these gray-level distribution characteristics, potential defect regions and normal background regions in the image are automatically identified and distinguished. Potential defect regions typically manifest as local abnormal regions where the gray-level values ​​deviate from the surrounding background. After region identification, the gray-level contrast difference between each potential defect region and its adjacent background region is further calculated. This difference directly reflects the identifiability of the defect in the current image. Then, based on the calculated gray-level contrast difference value, the contrast gain parameter of the differentiable contrast enhancement operator is adaptively adjusted. Specifically, for areas with small gray-level contrast differences, indicating that the gray-level values ​​of the defect and the background are relatively close and difficult for the human eye or detection algorithm to distinguish, a larger contrast gain parameter is automatically applied to stretch the gray-level dynamic range of the local area, enhancing the visibility and recognizability of the defect. For areas with large gray-level contrast differences, a smaller contrast gain parameter is used or the original contrast level is maintained to avoid over-enhancement that could cause image distortion or introduce artifacts. Finally, based on the determined contrast gain parameter, differentiable contrast stretching and constraint operations are performed pixel-by-pixel on the denoised multimodal image data. The contrast stretching operation performs a linear or non-linear mapping transformation of the pixel gray-level values ​​according to the gain parameter, while the constraint operation imposes upper and lower bound constraints on the stretched pixel values ​​to prevent pixel value overflow. The entire contrast stretching and constraint operation maintains differentiability with respect to the input variables, ensuring that the gradient can be smoothly backpropagated, allowing the contrast gain parameter to be continuously optimized in end-to-end joint training. Finally, a contrast-enhanced feature map is output as standardized multimodal data for subsequent modules.

[0040] In one embodiment, the multimodal image data of the mask substrate to be detected and the corresponding design data are input into the trained defect recognition model. Through a multi-task learning framework, the model synchronously outputs the defect bounding box, category, and pixel-level mask, obtaining the defect recognition results, including: The multimodal image data of the mask substrate to be inspected is processed by a differentiable operator for denoising and contrast enhancement to obtain standardized multimodal data of the mask substrate to be inspected. The standardized multimodal data of the mask substrate to be inspected is input into a CNN-Transformer hybrid encoder. Local texture and edge features are extracted through CNN branches, global context dependencies are modeled through the Transformer encoder, and multi-scale image features are output through multi-scale feature fusion. The design data is transformed to obtain design baseline features, and the multi-scale image features and design baseline features are adaptively aligned and fused through the multimodal information fusion perception network in the defect recognition model to obtain joint features. The joint features are input into a multi-task learning framework, which simultaneously outputs the bounding box, category, and pixel-level mask of the defect to obtain the defect recognition result.

[0041] Specifically, the acquired multimodal image data of the mask substrate to be detected is first input into a differentiable denoising operator and a differentiable contrast enhancement operator embedded in the front end of a neural network. Denoising and contrast enhancement are performed sequentially to obtain standardized multimodal data of the mask substrate to be detected. The parameters of the aforementioned differentiable preprocessing operator have been optimized to their optimal state through end-to-end joint optimization during the model training phase. Therefore, it can adaptively standardize image quality based on the noise distribution characteristics and gray-level contrast features of the current image to be detected, obtaining standardized multimodal data of the mask substrate to be detected, providing high-quality input for subsequent feature extraction. Then, the standardized multimodal data is input into a trained CNN-Transformer hybrid encoder for multi-scale image feature extraction. Within the hybrid encoder, a lightweight CNN branch first rapidly extracts local texture and edge features from the input image, obtaining feature maps rich in low-level details. Subsequently, the feature maps output by the CNN branch are segmented into sequential feature blocks and fed into the Transformer encoder. A multi-head self-attention mechanism calculates the association weights between these feature blocks, modeling long-distance dependencies between different spatial locations in the image and capturing global contextual contrast information between defects and surrounding normal areas. Finally, a multi-scale feature fusion module extracts feature maps at different depths from the encoder, weighting and fusing shallow high-resolution detail features, mid-level semantic features, and deep global abstract features to generate multi-scale image features that possess both detail representation capabilities and global semantic understanding capabilities. Simultaneously, the design data corresponding to the mask substrate to be detected is input into the multi-modal information fusion perception network of the defect recognition model for processing. The design data passes through the feature transformation layer of the design data branch, transforming the ideal geometric information stored in the CAD design drawings or GDSII design rule database into a design baseline feature map that matches the image feature dimensions. This design baseline feature map encodes design rule information such as the standard pattern layout, line width specifications, and graphic positions of the mask substrate. After obtaining multi-scale image features and design baseline features, the cross-modal attention fusion module in the multimodal information fusion perception network adaptively aligns and deeply fuses the two types of features. This module employs a cross-attention mechanism, using image features as query vectors and design baseline features as key-value vectors, to calculate the semantic correlation between each spatial location in the image and the corresponding design standard. Based on this, the image features and design features are weighted and fused to generate joint features. Through this cross-modal fusion process, actual imaging information and ideal design information are organically integrated, and the differences between the two are explicitly encoded into the joint features, providing a basis for accurately distinguishing real defects from the structural features and background noise of the design itself.Finally, the fused joint features are input into a multi-task learning framework. Through one forward propagation calculation, the three tasks of defect localization, classification and segmentation are completed simultaneously, and the complete defect recognition result including defect bounding box coordinates, defect category label and pixel-level defect mask is output.

[0042] In one embodiment, the joint features are input into a multi-task learning framework, which simultaneously outputs the defect's bounding box, category, and pixel-level mask, resulting in defect recognition results including: The joint features are input into the shared detection head of the multi-task learning framework, and joint inference is performed through localization, classification and segmentation branches. The localization branch outputs the defect bounding box, the classification branch outputs the defect category and the segmentation branch outputs the pixel-level defect mask. The localization result, classification result and segmentation result are post-processed and fused to remove duplicate detection and false defects, and the final defect recognition result is obtained.

[0043] Specifically, the multi-task learning framework adopts a shared detection head design in its structure. This involves building a shared feature transformation layer on top of the joint features to further abstract and integrate the fused joint features. Then, three functionally independent task branches are connected in parallel on top of the shared features: a localization branch, a classification branch, and a segmentation branch. These three branches share the same input features but are independent in parameters, each focusing on completing a different recognition subtask. During inference, the joint features obtained after cross-modal fusion are input into the shared detection head. The localization branch encodes the target location information in the joint features, predicting the center point coordinates, width, and height of each potential defect region through a bounding box regression network, and outputting the bounding box coordinates of the defect. The classification branch discriminates the semantic category information in the joint features, calculating the probability score of each detected defect region belonging to each category through a fully connected classification layer, and outputting the category label of the defect. Defect categories include particle contamination, image distortion, edge roughness, and transparent defects. The segmentation branch performs pixel-level dense prediction on the joint features, outputting a two-dimensional mask map with the same resolution as the input image through pixel-by-pixel classification. Each pixel is assigned a label indicating either a foreground defect or background, forming a pixel-accurate description of the defect contour and extent. After obtaining the initial output results from the three branches, the localization, classification, and segmentation results are post-processed and fused. Specifically, the pixel-level mask output from the segmentation branch is cropped based on the bounding box output from the localization branch, ensuring that each detected defect instance is associated with its corresponding spatial range. Simultaneously, the detection results are filtered using the category confidence score output from the classification branch, eliminating low-quality detection boxes with confidence scores below a preset threshold. Furthermore, a non-maximum suppression algorithm is used to remove duplicate detection boxes for the same defect region, eliminating redundant results. In addition, connected component analysis of the segmentation mask further filters out pseudo-defect regions that are too small or whose shapes do not conform to the physical laws of defects. After these post-processing fusion steps, the final defect identification result is generated. This result includes the precise bounding box, reliable category label, and fine pixel-level contour mask for each defect instance, providing comprehensive and accurate defect diagnostic information for subsequent process adjustments and quality traceability of the mask substrate.

[0044] The above are all preferred embodiments of this application, and are not intended to limit the scope of protection of this application. Therefore, all equivalent changes made in accordance with the structure, shape and principle of this application should be covered within the scope of protection of this application.

Claims

1. A method for image recognition of mask substrate defects based on deep learning, characterized in that, include: Multimodal image data of a mask substrate is acquired, and the multimodal image data is processed by a differentiable operator for denoising and contrast enhancement to obtain standardized multimodal data. By employing a two-stage training paradigm of self-supervised pre-training and few-sample fine-tuning, a defect recognition model is obtained by training a CNN-Transformer hybrid encoder and a multimodal information fusion perception network based on the standardized multimodal data and synthetic defect data. The synthetic defect data is generated by defect synthesis technology based on GAN or diffusion model, which conforms to the physical laws of mask substrate defects. The multimodal image data of the mask substrate to be detected and the corresponding design data are input into the trained defect recognition model. Through a multi-task learning framework, the bounding box, category, and pixel-level mask of the defect are output simultaneously to obtain the defect recognition result. The confidence level of the defect identification results is determined, and results with a confidence level lower than a preset threshold are transferred to manual review. The reviewed data is used as incremental data to update the defect identification model online, so as to achieve adaptive calibration of the model.

2. The method for identifying mask substrate defects based on deep learning according to claim 1, characterized in that, The two-stage training paradigm, employing self-supervised pre-training and few-sample fine-tuning, trains a CNN-Transformer hybrid encoder and a multimodal information fusion perception network based on the standardized multimodal data and synthetic defect data to obtain a defect recognition model, including: A synthetic defect image that conforms to the physical laws of mask substrate defects is generated by a defect synthesis model, and then mixed with real labeled defect samples in the standardized multimodal data to form a training set for fine-tuning with small samples. By utilizing a large amount of unlabeled normal image data of mask substrates, the image feature extraction part of the CNN-Transformer hybrid encoder is pre-trained in a self-supervised manner to obtain a pre-trained model with strong generalization foundation. Using the pre-trained model as initialization, the CNN-Transformer hybrid encoder and the multimodal information fusion perception network are jointly supervisedly fine-tuned using the training set. During the fine-tuning process, the local and global feature extraction of the hybrid encoder and the cross-modal attention fusion weights of image features and design data in the fusion perception network are optimized simultaneously. After joint fine-tuning, the converged CNN-Transformer hybrid encoder and the multimodal information fusion perception network are integrated into an end-to-end model as a defect recognition model.

3. The method for identifying mask substrate defects based on deep learning according to claim 2, characterized in that, The method of using a large amount of unlabeled normal image data of mask substrates to perform self-supervised pre-training on the image feature extraction part of the CNN-Transformer hybrid encoder to obtain a pre-trained model with strong generalization foundation includes: A large number of unlabeled normal image data of mask substrates obtained from different batches and under different process conditions in the mask substrate production line are collected to form a self-supervised pre-training dataset. A self-supervised task is constructed by contrastive learning or mask image modeling methods, and the self-supervised task is executed on the self-supervised pre-training dataset. The backbone network responsible for image feature extraction in the CNN-Transformer hybrid encoder is iteratively trained until the preset loss function converges, so that the backbone network learns the texture structure, edge contours and intrinsic consistency features between multimodal images of the normal image of the mask substrate, thereby obtaining a pre-trained model with strong generalization foundation.

4. The method for identifying mask substrate defects based on deep learning according to claim 3, characterized in that, The self-supervised task constructed through contrastive learning or masked image modeling methods includes: When constructing a self-supervised task through contrastive learning, a data augmentation strategy is applied to each input image in the self-supervised pre-training dataset to generate two different augmented views of the same image as positive sample pairs, and augmented views of other images as negative sample pairs. The backbone network is then optimized using a contrastive loss function so that the feature vectors of the positive sample pairs are close to each other and the feature vectors of the negative sample pairs are far apart, thereby learning a general feature representation with augmentation invariance. When constructing a self-supervised task through masked image modeling, each input image in the self-supervised pre-training dataset is divided into multiple image blocks. Some image blocks are randomly masked according to a preset ratio. A self-supervised task is constructed to restore the pixel values ​​of the masked image blocks using the context information of the unmasked image blocks. The backbone network is optimized by the reconstruction loss function to learn the local texture details and global structural prior knowledge of the mask substrate.

5. The method for identifying mask substrate defects based on deep learning according to claim 1, characterized in that, The process of acquiring multimodal image data of the mask substrate and performing denoising and contrast enhancement processing on the multimodal image data using a differentiable operator to obtain standardized multimodal data includes: Multimodal image data of the mask substrate is acquired through a multi-sensor synchronous acquisition device, and design data corresponding to the mask substrate is obtained synchronously. The multimodal image data is input into a differentiable denoising operator for denoising, wherein the differentiable denoising operator adopts an improved variable step-size median filter structure. The denoised multimodal image data is input into a differentiable contrast enhancement operator for contrast enhancement processing to obtain standardized multimodal data, wherein the differentiable contrast enhancement operator adopts an adaptive contrast constraint structure.

6. The method for identifying mask substrate defects based on deep learning according to claim 5, characterized in that, The step of inputting the multimodal image data into a differentiable denoising operator for denoising includes: The improved variable step size median filter structure is implemented as a differentiable denoising operator that supports gradient backpropagation, so that the differentiable denoising operator can be used as part of the neural network computation graph for forward computation and backpropagation. The local noise intensity of the multimodal image data is estimated to obtain noise distribution information in each region of the image, and the filtering step size parameter of the differentiable denoising operator is adjusted according to the noise distribution information. Based on the filtering step size parameter, the pixel values ​​of the image are sorted and weighted within the filtering window in a differentiable manner, and a feature map after denoising is output.

7. The mask substrate defect image recognition method based on deep learning according to claim 5, characterized in that, The step of inputting the denoised multimodal image data into a differentiable contrast enhancement operator for contrast enhancement processing to obtain standardized multimodal data includes: The adaptive contrast constraint structure is implemented as a differentiable contrast enhancement operator that supports gradient backpropagation, so that the differentiable contrast enhancement operator can be used as part of the neural network computation graph for forward computation and backpropagation. Local grayscale distribution analysis is performed on the denoised multimodal image data to identify potential defect regions and background regions, and the grayscale contrast difference between the potential defect regions and the background regions is calculated. Based on the grayscale contrast difference, the contrast gain parameter of the differentiable contrast enhancement operator is adjusted, and based on the contrast gain parameter, a differentiable contrast stretching and limiting operation is performed pixel by pixel on the denoised multimodal image data to output a feature map that has undergone contrast enhancement processing.

8. The method for image recognition of mask substrate defects based on deep learning according to claim 1, characterized in that, The defect recognition model, trained by inputting multimodal image data of the mask substrate to be detected and corresponding design data, synchronously outputs the defect bounding box, category, and pixel-level mask through a multi-task learning framework, resulting in defect recognition results including: The multimodal image data of the mask substrate to be tested is processed by a differentiable operator for denoising and contrast enhancement to obtain the standardized multimodal data of the mask substrate to be tested. The standardized multimodal data of the mask substrate to be detected is input into the CNN-Transformer hybrid encoder. Local texture and edge features are extracted through CNN branches, global context dependencies are modeled through the Transformer encoder, and multi-scale image features are output through multi-scale feature fusion. The design data is transformed to obtain design baseline features, and the multi-modal information fusion perception network in the defect identification model is used to adaptively align and fuse the multi-scale image features with the design baseline features to obtain joint features. The joint features are input into a multi-task learning framework, which simultaneously outputs the bounding box, category, and pixel-level mask of the defect to obtain the defect recognition result.

9. The method for identifying mask substrate defects based on deep learning according to claim 8, characterized in that, The process of inputting the joint features into a multi-task learning framework and simultaneously outputting the defect's bounding box, category, and pixel-level mask to obtain the defect recognition result includes: The joint features are input into the shared detection head of the multi-task learning framework, and joint inference is performed through the localization branch, classification branch and segmentation branch; The positioning branch outputs the defect bounding box, the classification branch outputs the defect category, and the segmentation branch outputs the pixel-level defect mask; The localization results, classification results, and segmentation results are post-processed and fused to remove duplicate detections and false defects, resulting in the final defect identification result.