A cross-category industrial defect detection method, device and medium
By using a pre-trained normalized flow model and a masked autoencoder, and by generating anomaly score maps through image patch masking and reconstruction, the problem of cross-category detection in existing technologies is solved, achieving efficient defect detection applicable to diverse product categories.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NINGBO HAITANG INFORMATION TECH CO LTD
- Filing Date
- 2022-12-23
- Publication Date
- 2026-06-16
Smart Images

Figure CN116152174B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of defect detection technology, specifically to a cross-category industrial defect detection method, device, and medium. Background Technology
[0002] Currently, there are two main directions in the field of industrial product defect detection: object detection and anomaly detection. Object detection-based defect detection methods have been extensively studied, but addressing defect detection from the perspective of anomaly detection presents many challenges. Defect detection involves several difficulties and challenges: uncertainty—defects are associated with many uncertainties, such as uncertain visual features; some defects are uncertain until they actually occur. Scarcity—defect samples are usually quite scarce; collecting a large dataset of labeled defects is extremely difficult and almost impossible. Heterogeneity—defects are irregular; therefore, one type of defect may exhibit completely different visual features from another, and even defects of the same type may show variations in features. These characteristics of defects sometimes prevent object detection-based methods from working effectively, but anomaly detection-based methods can address these difficulties and challenges to some extent.
[0003] Currently, anomaly detection is mainly based on normal feature modeling. These methods typically only require normal samples for network training and focus on the features of normal samples. During detection, samples whose features deviate significantly from normal features are considered anomalies. The biggest advantage of this type of method is that it does not require labeled anomaly samples as training samples, which can greatly reduce the manpower and financial costs of data collection. This makes it more attractive when anomalies are not fully known and data collection is expensive. However, the problem with this type of method is that it may classify all samples with features different from normal features as anomalies.
[0004] However, considering the issues of category independence and generality, existing anomaly detection methods typically require training a specific model for each industrial product category. This one-to-one paradigm incurs higher computational and memory overhead, and in practical applications, it demands more resources to store different model weights. Furthermore, new product categories often emerge in real-world scenarios, but these trained models cannot be directly applied to new categories. This can lead to system failures in new scenarios, but maintaining the system through retraining or fine-tuning is not cost-effective. Therefore, existing anomaly detection methods remain unsatisfactory for real-world scenarios. Summary of the Invention
[0005] The technical problem to be solved by the present invention is to provide a cross-category industrial defect detection method, device and medium that addresses the current state of the existing technology. The method obtains suspected defect regions in the image through a pre-trained normalized flow model and performs image block masking. The masked image blocks are reconstructed using the remaining unmasked image blocks. Finally, the defect regions of the sample to be detected are determined through the anomaly score map, so that the industrial defect detection method has cross-category characteristics.
[0006] The technical solution adopted by the present invention to solve the above-mentioned technical problems is as follows:
[0007] A first aspect of the present invention provides a cross-category industrial defect detection method, comprising:
[0008] S1: The acquired sample image to be detected is processed by a pre-trained normalized flow model to obtain the suspected defect region in the image, and the image block corresponding to the suspected defect region is masked.
[0009] S2: Based on the trained masking autoencoder, the masked image blocks are reconstructed using the remaining unmasked image blocks;
[0010] S3: The uncertainty between the visual words of the reconstructed image block and the original visual words of the masked image block is used as the calculation method for the reconstruction error. The corresponding anomaly score map is generated, and the defect area of the sample to be detected is determined based on the anomaly score map.
[0011] Furthermore, the steps for obtaining suspected defect regions in an image using a pre-trained normalized flow model include:
[0012] S11: For the input sample image to be detected, the sample is represented in the form of local image patches and the features of each image patch are extracted by a convolutional neural network.
[0013] S12: Subtract the corresponding prototype features from the features in each image patch to obtain the residual features;
[0014] S13: Input the residual features into the pre-trained normalized flow model to generate the likelihood score of the image patch;
[0015] S14: Sort all image blocks in the sample image to be detected according to their likelihood scores, and determine the image blocks with corresponding scores as the masking areas by a pre-set score threshold.
[0016] Furthermore, after extracting the features of each image block from the normal image sample set using a convolutional neural network, key features are extracted using a core set sampling mechanism as the normal prototype feature set. The core set sampling mechanism selects the feature closest to the normal feature set from the existing core set as a reference feature, and then selects the feature farthest from the reference feature from the non-core set as a new core feature based on the reference feature, until a sufficient number of normal features are selected as the normal prototype feature set.
[0017] Furthermore, the steps for training the normalized flow model include:
[0018] T1: Normal image samples in the training set are represented as local image patches, and the features of each image patch are extracted by a convolutional neural network to obtain a normal feature set;
[0019] T2: Utilize the core set sampling mechanism to extract key features from the normal feature set as normal prototype features;
[0020] T3: The residual features obtained by subtracting the features of each image patch in the training set from its corresponding prototype features are input into the normalized flow model and transformed into the latent feature space. By constraining the latent feature space to conform to a Gaussian distribution, the model generates high likelihood scores for normal residual features.
[0021] Furthermore, the masking autoencoder includes an encoder and a decoder. The encoder is used to extract features of unmasked image blocks in the input image sequence, and the decoder combines the features of unmasked image blocks with the special encoded words of masked image blocks to reconstruct visual words for the masked image blocks.
[0022] Furthermore, the step of training the masked autoencoder includes:
[0023] A1: The input normal image training samples are represented in the form of local image patches, and some image patches are masked. The masked image patches are transformed into visual word representations by a pre-trained word encoder to obtain the original visual word representations.
[0024] A2: Input the unmasked image block into the encoder to extract features from the image block;
[0025] A3: Reconstruct the visual lexical representation of the masked region image based on the features in the unmasked image patch to obtain the visual lexical representation of the reconstructed image patch.
[0026] A4: The masking autoencoder is trained based on the visual lexical representations of the reconstructed image patches and the original visual lexical representations, so that the masking autoencoder can reconstruct masked image patches using unmasked image patches.
[0027] Furthermore, the visual lexical representation is obtained by converting the pixel values of the image blocks.
[0028] Furthermore, when training the masking autoencoder, the masking of image blocks employs one of the following masking strategies, or a combination of two or more masking strategies: random masking strategy, continuous region-based masking strategy, region-restricted masking strategy, frequency-based masking strategy, or dynamic masking strategy.
[0029] In a second aspect, the present invention provides a cross-category industrial defect detection device, comprising at least one processor and at least one memory, wherein the memory stores a computer program that, when executed by the processor, enables the processor to perform a cross-category industrial defect detection method.
[0030] A third aspect of the present invention provides a computer-readable storage medium that, when instructions in the storage medium are executed by a processor within a device, enables the device to perform a cross-category industrial defect detection method.
[0031] Compared with the prior art, the present invention has at least the following beneficial effects:
[0032] (1) The cross-category industrial defect detection method described in this invention can be modeled using only normal samples, and the sample defects can be detected and located, thus avoiding the need for a large-scale labeled defect sample dataset.
[0033] (2) The suspected defect region in the image is obtained by pre-trained normalized flow model, and the image block corresponding to the suspected defect region is masked. The masked image block is reconstructed using the unmasked image block. For newly emerging industrial product categories, the model under this method can be used for defect detection of the category without retraining. It is more suitable for application in scenarios such as industrial defect detection with diversified product categories. Attached Figure Description
[0034] Figure 1 This is a flowchart of the cross-category industrial defect detection method provided in the embodiments of the present invention;
[0035] Figure 2 This is a flowchart illustrating the steps for obtaining suspected defect areas in an image, as provided in an embodiment of the present invention.
[0036] Figure 3 This is a flowchart of the steps for training a normalized flow model provided in an embodiment of the present invention;
[0037] Figure 4 This is a schematic diagram of the framework for training a normalized flow model provided in an embodiment of the present invention;
[0038] Figure 5 This is a flowchart of the steps for training the shielded autoencoder provided in an embodiment of the present invention;
[0039] Figure 6 This is a schematic diagram of the framework for training the masked autoencoder provided in an embodiment of the present invention. Detailed Implementation
[0040] The following are specific embodiments of the present invention, which are described in conjunction with the accompanying drawings. However, the present invention is not limited to these embodiments.
[0041] Example 1
[0042] like Figure 1 As shown, the present invention provides a cross-category industrial defect detection method, comprising the following steps:
[0043] S1: The acquired sample image to be detected is processed by a pre-trained normalized flow model to obtain the suspected defect region in the image, and the image block corresponding to the suspected defect region is masked.
[0044] S2: Based on the trained masking autoencoder, the masked image blocks are reconstructed using the remaining unmasked image blocks;
[0045] S3: The uncertainty between the visual words of the reconstructed image block and the original visual words of the masked image block is used as the calculation method for the reconstruction error. The corresponding anomaly score map is generated, and the defect area of the sample to be detected is determined based on the anomaly score map.
[0046] Attention networks such as ViT and Swing Transformer can be used as feature extraction networks in the normalized flow model. The suspected defect regions of the sample image to be detected are obtained through the pre-trained normalized flow model and used as masking regions. The masked image patches are reconstructed using the remaining unmasked image patches. Finally, the reconstruction error is calculated by using the uncertainty between the visual words of the reconstructed image patches and the original visual words of the masked image patches as the method of calculating the reconstruction error of the masked region and generating an anomaly score map.
[0047] The normalized flow model and the masked autoencoder model designed in this invention are both class-independent. For newly emerging industrial product categories, the model can be used for that category without retraining, making it suitable for application scenarios such as industrial defect detection with diverse product categories, thus avoiding the need for large-scale labeled defect sample datasets.
[0048] Furthermore, such as Figure 2 As shown, the steps for obtaining suspected defect regions in an image using a pre-trained normalized flow model include:
[0049] S11: For the input sample image to be detected, the sample is represented in the form of local image patches and the features of each image patch are extracted by a convolutional neural network.
[0050] S12: Subtract the corresponding prototype features from the features in each image patch to obtain the residual features;
[0051] S13: Input the residual features into the pre-trained normalized flow model to generate the likelihood score of the image patch;
[0052] S14: Sort all image blocks in the sample image to be detected according to their likelihood scores, and determine the image blocks with corresponding scores as the masking areas by a pre-set score threshold.
[0053] For the input sample to be detected, it is represented as a local image patch. Then, the features of each image patch are extracted. The features of the image patch are then subtracted from the nearest prototype features to obtain residual features. The residual features are then input into the normalized flow model to generate the likelihood score of the image patch. For normal image patches, the network generally generates a higher likelihood score, while for defective image patches, the network generally generates a lower likelihood score. Then, all image patches in the sample are sorted in ascending order of likelihood score. The image patches that are ranked higher are more likely to be suspected defective regions. Therefore, the top m% of image patches can be selected as the masking region. By setting an appropriate hyperparameter m, the number of suspected defective regions can be selected as much as possible. For samples of a new category, the network does not need to be retrained. It only needs to extract the normal prototype features of that category. The normal prototype features are combined with the image patch features to generate residual features, which are then input into the normalized flow network to directly generate the likelihood scores of image patches in the new category of samples, thereby further generating the masking region.
[0054] Furthermore, after extracting the features of each image block from the normal image sample set using a convolutional neural network, key features are extracted using a core set sampling mechanism as the normal prototype feature set. The core set sampling mechanism selects the feature closest to the normal feature set from the existing core set as a reference feature, and then selects the feature farthest from the reference feature from the non-core set as a new core feature based on the reference feature, until a sufficient number of normal features are selected as the normal prototype feature set.
[0055] like Figure 3 and Figure 4 As shown, the steps for training a normalized flow model include:
[0056] T1: Normal image samples in the training set are represented as local image patches, and the features of each image patch are extracted by a convolutional neural network to obtain a normal feature set;
[0057] T2: Utilize the core set sampling mechanism to extract key features from the normal feature set as normal prototype features;
[0058] T3: The residual features obtained by subtracting the features of each image patch in the training set from its corresponding prototype features are input into the normalized flow model and transformed into the latent feature space. By constraining the latent feature space to conform to a Gaussian distribution, the model generates high likelihood scores for normal residual features.
[0059] During the training phase of the normalized flow model, preprocessing is required, including scaling and cropping the image to the specified resolution, and then normalizing the RGB channels of the image to the range [0,1] using the mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. During the training phase of the masked autoencoder network for image block-level reconstruction, the preprocessing randomly crops a region from the image, scales that region, and then normalizes the RGB channels of the image to the range [0,1] using the mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. During the testing phase, the image preprocessing is the same as above: cropping and scaling to the specified resolution, followed by normalization. By using a convolutional neural network pre-trained on ImageNet as the feature extraction network, normal samples in the training set are represented as local image patches. Features of each image patch are extracted to obtain a normal feature set. Key features are then extracted from the normal feature set as normal prototype features using the minimax kernel set sampling mechanism.
[0060] Specifically, for normal feature sets The normal prototype feature set is generated in the following manner.
[0061]
[0062] Then, the features of image patches are extracted from the input samples in the same way. For the features of each image patch Each can be matched with a nearest neighbor feature from the normal prototype feature set as a reference feature:
[0063]
[0064] By using image patch features and reference features Subtraction yields the residual characteristics:
[0065]
[0066] The training process utilizes a normalized flow network. Transform residual features into latent feature space Then, the latent feature space is constrained to follow a Gaussian distribution, and the corresponding loss function is as follows:
[0067]
[0068] Where N represents the total number of image patches. Let d represent the Jacobian matrix, and d denote the dimension of the feature.
[0069] Then, by constraining the latent feature space to conform to a Gaussian distribution, the network is able to generate high likelihood scores for normal residual features.
[0070] In the normalized flow model, the log-likelihood score of each image patch can be measured as follows:
[0071]
[0072] Then, by subtracting the likelihood score of each image patch from the maximum likelihood score, the likelihood feature is transformed into image patch anomaly:
[0073]
[0074] in Represents all residual characteristics, This represents the likelihood score of an image patch. For normal image patches, the network generally generates a higher likelihood score, while for abnormal image patches, it generally generates a lower likelihood score. Then, all image patches in the sample are sorted in ascending order of likelihood score or descending order of abnormality. Image patches ranked higher are more likely to be suspected defect areas. Therefore, the top m% of image patches can be selected as masked regions. By setting a reasonable hyperparameter m, suspected defect areas can be selected as much as possible.
[0075] In one specific embodiment, the hyperparameter m set when generating the shielding area can be set to an empirical value. For different product categories, since the defect sizes are different, the set m will also be different. Generally, a larger m can be set for product categories with larger defects, while a smaller m can be set for product categories with smaller defects.
[0076] Furthermore, the masking autoencoder includes an encoder and a decoder. The encoder is used to extract features of unmasked image blocks in the input image sequence, and the decoder combines the features of unmasked image blocks with the special encoded words of masked image blocks to reconstruct visual words for the masked image blocks.
[0077] Both the encoder and decoder of the masked autoencoder are built based on the Transformer module. The decoder has more network layers than the encoder, which enables it to better reconstruct the visual lexical representation of the masked region. Before the unmasked image block features and the special encoded lexical [M] of the masked image block are fed into the decoder, they are first added with positional encoding features to indicate the position information of each image block in the image. This helps the decoder to better reconstruct the masked image block.
[0078] like Figure 5 and Figure 6 As shown, the steps for training the masked autoencoder include:
[0079] A1: The input normal image training samples are represented in the form of local image patches, and some image patches are masked. The masked image patches are transformed into visual word representations by a pre-trained word encoder to obtain the original visual word representations.
[0080] A2: Input the unmasked image block into the encoder to extract features from the image block;
[0081] A3: Reconstruct the visual lexical representation of the masked region image based on the features in the unmasked image patch to obtain the visual lexical representation of the reconstructed image patch.
[0082] A4: The masking autoencoder is trained based on the visual lexical representations of the reconstructed image patches and the original visual lexical representations, so that the masking autoencoder can reconstruct masked image patches using unmasked image patches.
[0083] A set of normal sample images is fed into a pre-trained ViT network to extract a normal feature set. A core set sampling mechanism is then used to extract key features from the normal feature set as normal prototype features.
[0084] In the training of a patch-level masking autoencoder network, normal samples in the training set are represented as local image patches. The goal of training is to enable the masking autoencoder to infer masked image patches using unmasked image patches. The training loss function of the masking autoencoder can be expressed as follows:
[0085]
[0086] in Indicates the number of masked image blocks. This represents the probability that the m-th image patch belongs to the i-th visual word. In the image patch-level reconstruction masked autoencoder model, the sample is input into the masked autoencoder, which reconstructs the masked image patch using the remaining unmasked image patches. Because the masked autoencoder has learned how to infer the masked region using normal image patches when trained on the normal sample set, the masked image patch in the test sample will be reconstructed in normal mode.
[0087] Furthermore, visual lexical representations are obtained by transforming the pixel values of image blocks.
[0088] Visual lexical units are similar to word vector representations in the field of natural language processing. They are high-level discrete semantic representations of images. By converting the pixel values of the original image patches into visual lexical units, the reconstructed target has more high-level semantic characteristics and can effectively avoid interference from high-frequency detail features.
[0089] Furthermore, when training the masking autoencoder, the masking of image blocks employs one of the following masking strategies, or a combination of two or more masking strategies: random masking strategy, continuous region-based masking strategy, region-restricted masking strategy, frequency-based masking strategy, or dynamic masking strategy.
[0090] The random masking strategy is the most basic and direct masking strategy, which involves randomly selecting image blocks for masking.
[0091] Based on the continuous region masking strategy, defective regions appear continuously in the image. By selecting adjacent image blocks for masking, the continuous region masking strategy can better simulate the occurrence of defects, so that the network will not produce large reconstruction errors for continuous normal masked regions, and thus misjudge continuous normal regions as defective regions.
[0092] The region-restricted masking strategy, by limiting the masked region to the target foreground region, allows the network to focus more on reconstructing the foreground target and increases its ability to reconstruct complex foreground regions.
[0093] Frequency-based shielding strategies, by increasing the probability of high-frequency regions being shielded, can make the network focus more on high-frequency regions that are more difficult to reconstruct, thereby increasing the network's reconstruction robustness.
[0094] A dynamic masking strategy records the network's reconstruction effect on each region during network training, and then increases the masking probability of regions that are currently difficult for the network to reconstruct. This allows the network to focus more on regions that are currently difficult to reconstruct, thereby increasing the network's reconstruction robustness.
[0095] By combining the various masking methods mentioned above, the combined masking strategy can enable the trained masked autoencoder to have a stronger reconstruction capability and better detect defects.
[0096] In the image patch-level reconstruction masked autoencoder model, the input samples come from the samples whose masked regions were masked in the previous stage. The samples are input into the masked autoencoder, which uses the remaining unmasked image patches to reconstruct the masked image patches. Because the masked autoencoder has learned how to infer the masked regions from the normal image patches when it is trained on the normal sample set, the masked image patches in the samples will be reconstructed in the normal mode.
[0097] The uncertainty between the visual words of a detected sample and its reconstructed visual words can be used as a measure of anomaly score. The uncertainty of each image patch is calculated as follows:
[0098]
[0099] Where p i This represents the probability that an image patch belongs to the i-th word. Since there is greater uncertainty between visual words in defective regions, defect detection and localization can be achieved.
[0100] Anomaly score maps can be generated by calculating the uncertainty between the visual terms of the reconstructed samples and the original samples. Based on the final anomaly score map, image-level and pixel-level anomaly thresholds are used to determine defective products and defective regions. In the final anomaly score map, points or regions with anomaly scores greater than the set thresholds are considered defective products, and the corresponding regions are considered defective regions; if in the final anomaly score map, the anomaly scores of all points or regions are not greater than the set thresholds, they are considered good products.
[0101] The above embodiments of the present invention effectively achieve cross-category defect detection by designing a normalized flow model and using a masked autoencoder network with image block-level reconstruction. Because the models in both stages are category-independent, for new product categories, the algorithm only needs to use some normal samples to extract normal prototype feature sets without retraining the model. This is suitable for application scenarios such as industrial defect detection with diverse product categories.
[0102] In another embodiment of the present invention, a cross-category industrial defect detection device is also provided, including at least one processor and at least one memory, wherein the memory stores a computer program that, when executed by the processor, enables the processor to perform the cross-category industrial defect detection method described above.
[0103] In another embodiment of the invention, a computer-readable storage medium is also provided, which, when the instructions in the storage medium are executed by a processor within a device, enables the device to perform the cross-category industrial defect detection method described in any of the preceding claims.
[0104] Those skilled in the art will understand that embodiments of the present invention can be provided as methods or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0105] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0106] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0107] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0108] Furthermore, it should be noted that all directional indications (such as up, down, left, right, front, back, etc.) in the embodiments of the present invention are only used to explain the relative positional relationship and movement of each component in a certain specific posture (as shown in the figure). If the specific posture changes, the directional indication will also change accordingly.
[0109] Furthermore, in this invention, descriptions involving terms such as "first," "second," and "a" are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0110] In this invention, unless otherwise explicitly specified and limited, the terms "connection," "fixed," etc., should be interpreted broadly. For example, "fixed" can mean a fixed connection, a detachable connection, or an integral part; it can mean a mechanical connection or an electrical connection; it can mean a direct connection or an indirect connection through an intermediate medium; it can mean the internal communication of two components or the interaction between two components, unless otherwise explicitly limited. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0111] Furthermore, the technical solutions of the various embodiments of the present invention can be combined with each other, but only if they are feasible for those skilled in the art. If the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.
[0112] The specific embodiments described herein are merely illustrative examples illustrating the spirit of the invention. Those skilled in the art to which this invention pertains may make various modifications or additions to the described specific embodiments or use similar methods to substitute them, without departing from the scope defined by the spirit of the invention.
Claims
1. A cross-category industrial defect detection method, characterized in that, include: S1: The acquired sample image to be detected is processed by a pre-trained normalized flow model to obtain the suspected defect region in the image, and the image block corresponding to the suspected defect region is masked. S2: Based on the trained masking autoencoder, the masked image blocks are reconstructed using the remaining unmasked image blocks; S3: The uncertainty between the visual words of the reconstructed image block and the original visual words of the masked image block is used as the calculation method for the reconstruction error. The corresponding anomaly score map is generated, and the defect area of the sample to be detected is determined based on the anomaly score map. The steps for training the normalized flow model include: T1: Normal image samples in the training set are represented as local image patches, and the features of each image patch are extracted by a convolutional neural network to obtain a normal feature set; T2: Utilize the core set sampling mechanism to extract key features from the normal feature set as normal prototype features; T3: The residual features obtained by subtracting the features of each image patch in the training set from its corresponding prototype features are input into the normalized flow model and transformed into the latent feature space. By constraining the latent feature space to conform to a Gaussian distribution, the model generates high likelihood scores for normal residual features. The core set sampling mechanism is to select the feature closest to the normal feature set from the existing core set as the reference feature, and then select the feature farthest from the reference feature from the non-core set as the new core feature, until a sufficient number of normal features are selected as the normal prototype feature set.
2. A cross-category industrial defect detection method according to claim 1, characterized in that, The steps for obtaining suspected defect regions in an image using a pre-trained normalized flow model include: S11: For the input sample image to be detected, the sample is represented in the form of local image patches and the features of each image patch are extracted by a convolutional neural network. S12: Subtract the corresponding prototype features from the features in each image patch to obtain the residual features; S13: Input the residual features into the pre-trained normalized flow model to generate the likelihood score of the image patch; S14: Sort all image blocks in the sample image to be detected according to their likelihood scores, and determine the image blocks with corresponding scores as the masking areas by a pre-set score threshold.
3. A cross-category industrial defect detection method according to claim 1, characterized in that, The masking autoencoder includes an encoder and a decoder. The encoder is used to extract features of unmasked image blocks in the input image sequence. The decoder combines the features of the unmasked image blocks with the special encoded words of the masked image blocks to reconstruct visual words for the masked image blocks.
4. A cross-category industrial defect detection method according to claim 3, characterized in that, The steps for training the masked autoencoder include: A1: The input normal image training samples are represented in the form of local image patches, and some image patches are masked. The masked image patches are transformed into visual word representations by a pre-trained word encoder to obtain the original visual word representations. A2: Input the unmasked image block into the encoder to extract features from the image block; A3: Reconstruct the visual lexical representation of the masked region image based on the features in the unmasked image patch to obtain the visual lexical representation of the reconstructed image patch. A4: The masking autoencoder is trained based on the visual lexical representations of the reconstructed image patches and the original visual lexical representations, so that the masking autoencoder can reconstruct masked image patches using unmasked image patches.
5. A cross-category industrial defect detection method according to claim 4, characterized in that, The visual lexical representation is obtained by converting the pixel values of image blocks.
6. A cross-category industrial defect detection method according to claim 4, characterized in that, When training the masking autoencoder, the image blocks are masked using one of the following masking strategies, or a combination of two or more masking strategies: random masking strategy, continuous region-based masking strategy, region-restricted masking strategy, frequency-based masking strategy, or dynamic masking strategy.
7. A cross-category industrial defect detection device, comprising at least one processor and at least one memory, wherein, The memory stores a computer program that, when executed by the processor, enables the processor to perform the cross-category industrial defect detection method according to any one of claims 1-6.
8. A computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor within a device, enable the device to perform the cross-category industrial defect detection method of any one of claims 1-6.