A zero-shot industrial anomaly detection method based on alignment visual perception and evidence chain reasoning

By using the AlignAD visual perception module and evidence chain reasoning, high-quality anomaly priors are generated and structured evidence is output. This solves the problems of insufficient fine-grained visual perception, modal gap and knowledge transfer difficulties, and lack of executable closed-loop decision-making in existing zero-sample industrial anomaly detection, and achieves pixel-level precise positioning and executable closed-loop decision output.

CN122242731APending Publication Date: 2026-06-19BEIJING UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING UNIV OF TECH
Filing Date
2026-03-12
Publication Date
2026-06-19

Smart Images

  • Figure CN122242731A_ABST
    Figure CN122242731A_ABST
Patent Text Reader

Abstract

This invention discloses a zero-shot industrial anomaly detection method based on aligned visual perception and evidence chain reasoning. The method acquires anomaly heatmaps and scores through the AlignAD aligned visual perception module, compresses them into structured evidence via an evidence parser, and then drives a multimodal large model to output anomaly judgment, description, causal analysis, and maintenance suggestions. This invention improves the quality of anomaly perception through a location-content-consistency three-dimensional alignment mechanism, achieving closed-loop reasoning from detection to decision-making, significantly improving the accuracy and interpretability of zero-shot industrial anomaly detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of industrial visual quality inspection and anomaly detection, and in particular to a zero-sample industrial anomaly detection method based on aligned visual perception and evidence chain reasoning in production scenarios where anomaly samples are scarce or unknown defects frequently occur. This method is used to locate, interpret, and trace product surface defects, and supports linkage with production line re-inspection, rework, and process improvement processes. Background Technology

[0002] Industrial anomaly detection (IAD) is a crucial aspect of quality control in modern manufacturing. Its goal is to automatically identify and locate various defects on the surface, structure, or assembly process of products during large-scale production. Existing industrial anomaly detection methods mostly employ unsupervised or weakly supervised paradigms, distinguishing between normal and abnormal samples by learning the feature distribution of normal samples. Technical approaches primarily include memory / probabilistic model methods based on feature embedding, as well as reconstruction-based autoencoder, GAN, or diffusion model methods. However, these traditional methods typically follow a closed-world "one class, one model" approach, heavily relying on large amounts of good product data and manual thresholding. The models can only handle categories and defect forms that have appeared during the training phase. When production lines, products, or processes change, retraining and parameter tuning are often required, resulting in high deployment and maintenance costs, making it difficult to meet the demands for flexible quality inspection and low-cost implementation.

[0003] To overcome the limitations of a closed world, the Zero-shot Anomaly Detection (ZSAD) paradigm has begun to incorporate pre-trained visual-language models (VLMs). By constructing textual prompts such as "normal / abnormal," anomaly detection is transformed into image-text similarity matching, thereby enabling the detection and localization of unseen categories and defects under the condition of relying solely on prior knowledge of good products. However, existing CLIP-based ZSAD methods often heavily rely on predefined templates and are essentially still performing binary discrimination in the semantic space: their semantic coverage of prompts is limited, making it difficult to cover open-ended unknown defects; and they lack the fine-grained localization and semantic interpretation capabilities required for industrial scenarios, easily leading to spatial ambiguity and insensitivity to minor defects.

[0004] In recent years, Multimodal Large Models (MLLMs) have demonstrated advantages in instruction compliance and open reasoning. Some studies have attempted to input anomaly heatmaps or image features into MLLMs to generate anomaly presence judgments and natural language descriptions, thus alleviating the problem of traditional methods "only outputting scores / heatmaps and lacking explanations." However, directly applying general-purpose MLLMs to industrial anomaly detection still has significant shortcomings: on the one hand, general-purpose MLLMs lack the ability to perceive fine-grained visual defects in small, high-resolution industrial scenarios; on the other hand, IAD-related knowledge is often embedded in the text domain, making it difficult to effectively project onto visual representations, resulting in a modal gap and difficulties in knowledge transfer; furthermore, even if natural language descriptions can be generated, there is still a lack of stable, procedural outputs regarding causal hypotheses and implementable maintenance suggestions, making it difficult to form an executable decision-making loop.

[0005] To compensate for the limitations of MLLM in visual detail, some methods introduce "Vision Experts (VEs)" to output anomaly prior maps (such as pixel-level anomaly maps / saliency cues), which are then input into MLLM along with textual prompts to enhance anomaly understanding. However, existing solutions mostly still rely on single-channel anomaly maps or local saliency cues as priors, resulting in problems such as static and coarse priors and sensitivity to localization noise. Furthermore, they lack sufficient constraints on the consistency between anomaly locations and global patterns, and lack auditable and reproducible structured semantic inputs for industrial operations, making it difficult to support the stable closed-loop analysis and decision support required by production lines.

[0006] In summary, existing technologies face at least three critical issues that urgently need to be addressed:

[0007] Insufficient fine-grained visual perception. Industrial defects often exhibit characteristics such as "small size, weak contrast, thin / oblique shape, and irregular edges." Traditional ZSAD prompt matching and general MLLM are unstable in their response to such fine-grained evidence, and are prone to background diffusion, discontinuous boundaries, or missed detections, making it difficult to meet the requirements for pixel-level accurate positioning.

[0008] Modal gap and difficulty in knowledge transfer. The category semantics, location priors, morphological priors and industry knowledge required for anomaly detection are mostly contained in text or empirical rules. However, existing methods lack mechanisms to explicitly inject and align this knowledge to pixel-level visual representations, resulting in loose coupling between cue semantics and local defect evidence, leading to problems such as spatial ambiguity or insensitivity to minor defects.

[0009] There is a lack of executable closed-loop decision outputs. Existing methods mostly stop at outputting anomaly scores / heatmaps or one-time natural language descriptions, which makes it difficult to transform pixel-level evidence into auditable and reproducible structured semantics to drive subsequent reasoning. As a result, there are gaps in the process-oriented decision chain of "whether it is abnormal - what is the abnormality - why it happened - how to deal with it", making it difficult to adapt to real production line operation and maintenance scenarios.

[0010] To address the aforementioned problems in existing zero-shot industrial anomaly detection methods, this invention proposes a zero-shot industrial anomaly detection method based on aligned visual perception and evidence chain reasoning. The method comprises an aligned visual perception stage and an evidence chain reasoning stage: In the aligned visual perception stage, high-quality anomaly priors with location-aware dynamic cues (PA-DyPrompt) and content-adaptive multi-shape convolution (SK-CMI) are generated, and global-local consistency constraints (GLAC) are combined to align image-level judgment with pixel-level anomaly responses, reducing false positives and false negatives caused by background noise. In the evidence chain reasoning stage, pixel-level evidence such as the anomaly prior image and anomaly scores is further parsed into structured anomaly evidence (including at least location, area ratio, confidence level, severity, morphological and texture statistics), and input together with visual cues into a multimodal large model for evidence chain reasoning, outputting anomaly conclusions, defect descriptions, cause ranking, and maintenance suggestions, thereby forming an executable closed-loop report. Summary of the Invention

[0011] This invention proposes a zero-shot industrial anomaly detection method based on aligned visual perception and evidence chain reasoning. The design process is as follows: First, an industrial image to be detected is acquired and input into the AlignAD aligned visual perception module to obtain pixel-level anomaly heatmaps and image-level anomaly scores, etc., as raw anomaly evidence. Then, an evidence parser compresses the pixel-level evidence into interpretable and reproducible structured anomaly evidence. Subsequently, a prompt generator fills the structured evidence into a predefined task template to generate text prompt words, which are then input together with the visual prompt words obtained from the same image through a visual encoder and query converter into a multimodal large model. Finally, under the constraint of the evidence chain, the multimodal large model outputs anomaly judgment, defect description, cause analysis, and maintenance suggestions, thus forming an end-to-end closed-loop reasoning process. Detailed steps are shown below:

[0012] (1) Aligned visual perception module AlignAD: Improves the quality of anomalous priors through a three-dimensional alignment mechanism of “position-content-consistency”, including position-aware dynamic cues PA-DyPrompt, content-adaptive multi-shape convolution SK-CMI and global-local consistency constraints GLAC.

[0013] 1) Position-Aware Dynamic Prompt (PA-DyPrompt): The image space is divided into a nine-grid position prior, and position phrases are explicitly encoded in the prompt template, making the text prototype sensitive to "defect positions"; at the same time, a meta-network is introduced to dynamically modulate the learnable context vector according to the global semantics of the image, thereby realizing "context-tunable prompt generation".

[0014]

[0015] in, As the initial learnable context vector, This is the dynamically adjusted context vector. The change in the context vector. For global features of the image, For meta-network, is the set of learnable parameters (weight matrix and bias vector) of the meta-network, which is obtained through random initialization and end-to-end optimization via backpropagation.

[0016] 2) Content-Adaptive Multi-Shape Convolution (SK-CMI): SK-CMI employs multi-branch, multi-shape, and multi-scale convolutional kernels for parallel modeling (including isotropic kernels). Different scales of square kernels and asymmetric rectangular cores for horizontal / vertical slender defects. And it achieves adaptive reweighting of content through selective kernel fusion.

[0017]

[0018]

[0019]

[0020] in, For the first Output by one branch For weighted sum, This is global average pooling, where z is the channel description vector obtained through global average pooling. For activation functions, MLP It is a multilayer perceptron. The set of learnable parameters (weight matrix and bias vector) for the multilayer perceptron is randomly initialized during the training phase and obtained through end-to-end backpropagation and optimizer updates. Choose a weight matrix for the branch. For the first Branch weights of each branch This represents the final output after weighting. This mechanism is used to improve the modeling capabilities for slender, oblique, and irregularly shaped defects.

[0021] 3) Global-Local Consistency Constraint (GLAC): The GLAC mechanism solves the misalignment problem between visual features and text semantics through two stages: global-local alignment at the feature level and pixel-global consistency constraint at the decision level.

[0022] (a) Global-local alignment: Before calculating the anomaly score, the visual features are dynamically recalibrated using text prototypes to ensure semantic consistency of features.

[0023] Global feature alignment: Aligning global features of an image Interacting with the text prototype T, the aligned global features are generated through the residual structure. :

[0024]

[0025] Local Feature Alignment: Generating Affine Parameters Using Text Prototypes The FiLM mechanism is used to analyze local features at each layer. Gated recalibration is performed to amplify the response in potentially anomalous regions, thereby obtaining modulation features aligned with the text prototype. :

[0026]

[0027] (b) Pixel-Global Consistency Constraint: Scores are calculated based on aligned features, and image-level decisions are forced to maintain logical consistency with pixel-level evidence.

[0028] Anomaly scoring calculation: At the image level, we calculate... Similarity to the text prototype yields the image-level anomaly probability. At the pixel level, we calculate Similarity to the text prototype is analyzed through upsampling and inter-layer maximization to extract the most significant pixel evidence, s. The final total score is then calculated. as follows:

[0029]

[0030] PGC-Loss Consistency Constraint: First, for the... Pixel-level anomaly probability map at various scales Top-K pooling is used to obtain the evidence strength of a single-scale pixel. Subsequently, pixel evidence from all scales is averaged and fused to obtain the final pixel evidence strength. :

[0031]

[0032] Where k is the number of pixels involved in evidence aggregation (taken as 5366 here). To calculate the mean of the k largest pixel values ​​in the input probability map, where L is the number of layers involved in the aggregation, which is 7 in this case.

[0033] In addition, to strengthen the logical consistency between image-level judgment and pixel-level evidence, we use the pixel evidence strength obtained by multi-scale Top-K aggregation. Constraining image-level anomaly probability And define the pixel-global consistency loss as:

[0034]

[0035] (2) Evidence analysis stage: The evidence analyzer compresses the image-level anomaly score and pixel-level anomaly heatmap output by AlignAD into a set of interpretable and reproducible key-value pair evidence, including confidence level, area ratio, location label, severity level, size level and several morphological / texture statistical features.

[0036] 1) Unified Confidence: We construct an intermediate value by combining the image-level score and the heatmap peak, and then obtain the normalized confidence value using the Sigmoid function.

[0037]

[0038] in, The image-level anomaly score is represented by X, which is a pixel-level heatmap. To balance the weighting coefficients, a value of 0.55 is set to give appropriate attention to global image features during the fusion process. m is the intermediate fusion score, and C is the normalized confidence level.

[0039] 2) Binary mask to area ratio: Take the first... Using percentiles as thresholds, masks are obtained and area percentages are calculated:

[0040]

[0041] in, The percentile parameter used to determine the segmentation threshold is 97. To determine the pixel response distribution based on the current heatmap The adaptive segmentation threshold obtained from quantile statistics varies with the input image. This is a binary indicator function (it takes the value 1 if the condition is met, otherwise it takes the value 0). For the generated binary anomaly mask, H and W are the height and width of the feature map, respectively. R represents the total area of ​​abnormal pixels in the mask, and R is the calculated percentage of the abnormal region area.

[0042] 3) Location Labels: First, the geometric centroid of the anomaly region is calculated based on a binary mask. Then, the coordinates of this centroid are mapped to a pre-defined nine-grid space to determine the location labels. The centroid coordinates are calculated as follows:

[0043]

[0044] in, The coordinates of the geometric centroid of the anomaly region. The x and y coordinates of the image pixels. For a binary mask in coordinates The value at that location.

[0045] 4) Severity rating: A comprehensive severity score is obtained by weighting and fusing the area proportion and image-level anomaly scores.

[0046]

[0047] in, To assess the overall severity, The coefficient is used to adjust the area proportion and anomaly score weight, and is set to 0.6. This makes the severity assessment focus more on the actual physical coverage of the defect, while also considering the model's overall anomaly discrimination ability. R is the area proportion of the anomaly region. Image-level anomaly score.

[0048] 5) Size rating and statistical features: Small, medium and large are defined by the pixel length of the major axis of the circumscribed ellipse, and geometric shape, boundary sharpness and texture variation statistics are extracted from the mask and neighborhood to form unified structured evidence.

[0049] (3) Closed-loop report output stage: The prompt generator fills the above structured evidence into the predefined task template to generate text prompt words, and splices them with the visual prompt words output by the query converter to form a unified prompt space for the multimodal big model; under the prompt constraints, the multimodal big model completes the inference output of anomaly judgment, defect description, cause ranking and maintenance / repair suggestions in sequence, thus forming a closed loop of "discovery-quantification-diagnosis-disposal".

[0050] (4) To achieve coordinated optimization of pixel-level localization and image-level discrimination, we adopt a joint training strategy of pixel-level segmentation loss and image-level discrimination loss for the AlignAD alignment visual perception module. The overall objective function can be expressed as:

[0051]

[0052] in, Pixel-level segmentation loss is used to improve the localization accuracy and boundary consistency of abnormal regions; It is an image-level discrimination loss used to improve the global normal / abnormal discrimination capability and constrain the consistency between global prediction and local response; This is the weighting coefficient, which is set to 0.5 here.

[0053] 1) Pixel-level segmentation loss In one embodiment, the multi-scale anomaly response map is converted into a foreground probability map using softmax, and combined with class imbalance suppression, region consistency, and boundary constraints, the pixel-level segmentation loss is defined as:

[0054]

[0055]

[0056]

[0057]

[0058]

[0059] in, For the first A true binary mask after scale alignment For the first Pixel-level response at each scale For the first Pixel-level anomaly probability map at each scale FocalCE is used to alleviate foreground / background sample imbalance. This is a pixel-level binary classification cross-entropy operator that calculates the average of pixel responses and ground truth masks pixel by pixel. The foreground-background balance coefficient is set to 0.25. This is the focusing factor, with a value of 2; Dice is used to improve the consistency of region overlap. For overlap coefficient operators; Used to enhance boundary sharpness and structural consistency WBCE is the boundary indication map obtained by passing the truth mask through the edge detection operator. The pixel-level binary cross-entropy is weighted according to boundary weights; , Weighting coefficients (here) Take 0.2, Take 0.1).

[0060] 2) Image-level discrimination loss In one embodiment, the image-level discrimination loss consists of a binary classification supervision term, a ranking constraint term, and a pixel-global consistency term.

[0061]

[0062]

[0063]

[0064] in, For image-level anomaly scoring, Image-level labels; The cross-entropy loss is used for binary classification. For the ranking constraint loss, To calculate the arithmetic mean of all sample pairs in the abnormal sample set A and the normal sample set B, These are the sample indices in the abnormal sample set and the normal sample set, respectively. and These are the discrimination scores for the corresponding samples; The pixel-global consistency loss is given by equation (9); , Weighting coefficients (here) Take 0.5, Take 0.2).

[0065] 3) Two-stage training strategy: In addition, AlignAD training is preferably divided into two stages:

[0066] (a) Backbone training stage: Freeze the visual-language pre-training backbone, update the parameters of the decoding head, cue learner and fusion head, and optimize using the joint loss shown in Equation (14);

[0067] (b) Adaptation fine-tuning stage: Fix the backbone parameters and only fine-tune the vision adapter in small steps to further adapt to industrial anomaly distribution and improve cross-domain robustness.

[0068] The inventiveness of this invention is mainly reflected in:

[0069] (1) This invention proposes an alignment-based visual perception module AlignAD, which introduces PA-DyPrompt, SK-CMI and GLAC to improve the prior quality of anomalies from the source and achieve system alignment of the "location-content-consistency" of industrial defects.

[0070] (2) The present invention constructs an evidence parsing and prompt generation mechanism, compresses pixel-level abnormal evidence into auditable and reproducible structured semantic tags, and drives a multimodal large model to perform evidence chain reasoning, thereby realizing a closed-loop output from anomaly detection to executable suggestions.

[0071] (3) This invention explicitly constrains the consistency between image-level judgment and pixel-level evidence by using pixel-global consistency loss PGC-Loss (Top-k aggregation), thereby reducing the interference of background noise and pseudo-high response on localization and interpretation.

[0072] (4) The present invention adopts a two-stage training paradigm (backbone training + adaptation fine-tuning) to enhance the adaptability to industrial abnormal distributions while maintaining the stability of pre-trained representations, and improve the robustness in zero-shot scenarios. Attached Figure Description

[0073] Figure 1 This is the overall architecture diagram of the present invention.

[0074] Figure 2 This is an architecture diagram of the AlignAD alignment visual perception module of the present invention. Detailed Implementation

[0075] This invention designs a zero-sample industrial anomaly detection method based on aligned visual perception and evidence chain reasoning. The method generates high-quality pixel-level anomaly priors through visual expert AlignAD, and further parses the anomaly priors into auditable structured anomaly evidence. Then, the cue generator fills the evidence into a predefined task template. The visual cue words and text cue words jointly drive the multimodal large model to output anomaly judgment, defect description, cause ranking and maintenance suggestions, thereby forming a closed-loop decision-making process of "discovery-quantification-diagnosis-disposal".

[0076] Experimental data comes from two publicly available industrial anomaly detection datasets: MVTec-AD and VisA. MVTec-AD contains 15 categories, with a total of 3629 training samples and 1725 test samples; VisA more closely resembles the complex background of real production lines, containing 9621 normal samples and 1200 anomaly samples, covering 12 categories. To test cross-domain generalization, we trained on MVTec-AD and tested on VisA, or vice versa.

[0077] The present invention adopts the following technical methods and implementation steps:

[0078] First, the industrial image to be inspected is acquired and input into the AlignAD alignment-based visual perception module to obtain raw anomaly evidence such as pixel-level anomaly heatmaps and image-level anomaly scores. Then, the pixel-level evidence is compressed into interpretable and reproducible structured anomaly evidence by the evidence parser. Subsequently, the cue generator fills the structured evidence into a predefined task template to generate text cue words, which are then input into the multimodal large model along with the visual cue words obtained from the same image through the visual encoder and query converter. Finally, under the constraint of the evidence chain, the multimodal large model outputs anomaly determination, defect description, cause analysis, and maintenance suggestions, thus forming an end-to-end closed-loop reasoning process.

[0079] The detailed steps are as follows:

[0080] (1) Aligned visual perception module AlignAD: Improves the quality of anomalous priors through a three-dimensional alignment mechanism of “position-content-consistency”, including position-aware dynamic cues PA-DyPrompt, content-adaptive multi-shape convolution SK-CMI and global-local consistency constraints GLAC.

[0081] 1) Position-Aware Dynamic Prompt (PA-DyPrompt): The image space is divided into a nine-grid position prior, and position phrases are explicitly encoded in the prompt template, making the text prototype sensitive to "defect positions"; at the same time, a meta-network is introduced to dynamically modulate the learnable context vector according to the global semantics of the image, thereby realizing "context-tunable prompt generation".

[0082]

[0083] in, As the initial learnable context vector, This is the dynamically adjusted context vector. The change in the context vector. For global features of the image, For meta-network, is the set of learnable parameters (weight matrix and bias vector) of the meta-network, which is obtained through random initialization and end-to-end optimization via backpropagation.

[0084] 2) Content-Adaptive Multi-Shape Convolution (SK-CMI): SK-CMI employs multi-branch, multi-shape, and multi-scale convolutional kernels for parallel modeling (including isotropic kernels). Different scales of square kernels and asymmetric rectangular cores for horizontal / vertical slender defects. And it achieves adaptive reweighting of content through selective kernel fusion.

[0085]

[0086]

[0087]

[0088] in, For the first Output by one branch For weighted sum, This is global average pooling, where z is the channel description vector obtained through global average pooling. For activation functions, MLP It is a multilayer perceptron. The set of learnable parameters (weight matrix and bias vector) for the multilayer perceptron is randomly initialized during the training phase and obtained through end-to-end backpropagation and optimizer updates. Choose a weight matrix for the branch. For the first Branch weights of each branch This represents the final output after weighting. This mechanism is used to improve the modeling capabilities for slender, oblique, and irregularly shaped defects.

[0089] 3) Global-Local Consistency Constraint (GLAC): The GLAC mechanism solves the misalignment problem between visual features and text semantics through two stages: global-local alignment at the feature level and pixel-global consistency constraint at the decision level.

[0090] (a) Global-local alignment: Before calculating the anomaly score, the visual features are dynamically recalibrated using text prototypes to ensure semantic consistency of features.

[0091] Global feature alignment: Aligning global features of an image Interacting with the text prototype T, the aligned global features are generated through the residual structure. :

[0092]

[0093] Local Feature Alignment: Generating Affine Parameters Using Text Prototypes The FiLM mechanism is used to analyze local features at each layer. Gated recalibration is performed to amplify the response in potentially anomalous regions, thereby obtaining modulation features aligned with the text prototype. :

[0094]

[0095] (b) Pixel-Global Consistency Constraint: Scores are calculated based on aligned features, and image-level decisions are forced to maintain logical consistency with pixel-level evidence.

[0096] Anomaly scoring calculation: At the image level, we calculate... Similarity to the text prototype yields the image-level anomaly probability. At the pixel level, we calculate Similarity to the text prototype is analyzed through upsampling and inter-layer maximization to extract the most significant pixel evidence, s. The final total score is then calculated. as follows:

[0097]

[0098] PGC-Loss Consistency Constraint: First, for the... Pixel anomaly probability map at each scale Top-K pooling is used to obtain the evidence strength of a single-scale pixel. Subsequently, pixel evidence from all scales is averaged and fused to obtain the final pixel evidence strength. :

[0099]

[0100] Where k is the number of pixels involved in evidence aggregation (taken as 5366 here). To calculate the mean of the k largest pixel values ​​in the input probability map, where L is the number of layers involved in the aggregation, which is 7 in this case.

[0101] In addition, to strengthen the logical consistency between image-level judgment and pixel-level evidence, we use the pixel evidence strength obtained by multi-scale Top-K aggregation. Constraining image-level anomaly probability And define the pixel-global consistency loss as:

[0102]

[0103] (2) Evidence analysis stage: The evidence analyzer compresses the image-level anomaly score and pixel-level anomaly heatmap output by AlignAD into a set of interpretable and reproducible key-value pair evidence, including confidence level, area ratio, location label, severity level, size level and several morphological / texture statistical features.

[0104] 1) Unified Confidence: We construct an intermediate value by combining the image-level score and the heatmap peak, and then obtain the normalized confidence value using the Sigmoid function.

[0105]

[0106] in, The image-level anomaly score is represented by X, which is a pixel-level heatmap. To balance the weighting coefficients, a value of 0.55 is set to give appropriate attention to global image features during the fusion process. m is the intermediate fusion score, and C is the normalized confidence level.

[0107] 2) Binary mask to area ratio: Take the first... Using percentiles as thresholds, masks are obtained and area percentages are calculated:

[0108]

[0109] in, The percentile parameter used to determine the segmentation threshold is 97. To determine the pixel response distribution based on the current heatmap The adaptive segmentation threshold obtained from quantile statistics varies with the input image. This is a binary indicator function (it takes the value 1 if the condition is met, otherwise it takes the value 0). For the generated binary anomaly mask, H and W are the height and width of the feature map, respectively. R represents the total area of ​​abnormal pixels in the mask, and R is the calculated percentage of the abnormal region area.

[0110] 3) Location Labels: First, the geometric centroid of the anomaly region is calculated based on a binary mask. Then, the coordinates of this centroid are mapped to a pre-defined nine-grid space to determine the location labels. The centroid coordinates are calculated as follows:

[0111]

[0112] in, The coordinates of the geometric centroid of the anomaly region. The x and y coordinates of the image pixels. For a binary mask in coordinates The value at that location.

[0113] 4) Severity rating: A comprehensive severity score is obtained by weighting and fusing the area proportion and image-level anomaly scores.

[0114]

[0115] in, To assess the overall severity, The coefficient is used to adjust the area proportion and anomaly score weight, and is set to 0.6. This makes the severity assessment focus more on the actual physical coverage of the defect, while also considering the model's overall anomaly discrimination ability. R is the area proportion of the anomaly region. Image-level anomaly score.

[0116] 5) Size rating and statistical features: Small, medium and large are defined by the pixel length of the major axis of the circumscribed ellipse, and geometric shape, boundary sharpness and texture variation statistics are extracted from the mask and neighborhood to form unified structured evidence.

[0117] (3) Closed-loop report output stage: The prompt generator fills the above structured evidence into the predefined task template to generate text prompt words, and splices them with the visual prompt words output by the query converter to form a unified prompt space for the multimodal big model; under the prompt constraints, the multimodal big model completes the inference output of anomaly judgment, defect description, cause ranking and maintenance / repair suggestions in sequence, thus forming a closed loop of "discovery-quantification-diagnosis-disposal".

[0118] (4) To achieve coordinated optimization of pixel-level localization and image-level discrimination, we adopt a joint training strategy of pixel-level segmentation loss and image-level discrimination loss for the AlignAD alignment visual perception module. The overall objective function can be expressed as:

[0119]

[0120] in, Pixel-level segmentation loss is used to improve the localization accuracy and boundary consistency of abnormal regions; It is an image-level discrimination loss used to improve the global normal / abnormal discrimination capability and constrain the consistency between global prediction and local response; This is the weighting coefficient, which is set to 0.5 here.

[0121] 1) Pixel-level segmentation loss In one embodiment, the multi-scale anomaly response map is converted into a foreground probability map using softmax, and combined with class imbalance suppression, region consistency, and boundary constraints, the pixel-level segmentation loss is defined as:

[0122]

[0123]

[0124]

[0125]

[0126]

[0127] in, For the first A true binary mask after scale alignment For the first Pixel-level response at each scale For the first Pixel-level anomaly probability map at each scale FocalCE is used to alleviate foreground / background sample imbalance. This is a pixel-level binary classification cross-entropy operator that calculates the average of pixel responses and ground truth masks pixel by pixel. The foreground-background balance coefficient is set to 0.25. This is the focusing factor, with a value of 2; Dice is used to improve the consistency of region overlap. For overlap coefficient operators; Used to enhance boundary sharpness and structural consistency WBCE is the boundary indication map obtained by passing the truth mask through the edge detection operator. The pixel-level binary cross-entropy is weighted according to boundary weights; , Weighting coefficients (here) Take 0.2, Take 0.1).

[0128] 2) Image-level discrimination loss In one embodiment, the image-level discrimination loss consists of a binary classification supervision term, a ranking constraint term, and a pixel-global consistency term.

[0129]

[0130]

[0131]

[0132] in, For image-level anomaly scoring, Image-level labels; The cross-entropy loss is used for binary classification. For the ranking constraint loss, To calculate the arithmetic mean of all sample pairs in the abnormal sample set A and the normal sample set B, These are the sample indices in the abnormal sample set and the normal sample set, respectively. and These are the discrimination scores for the corresponding samples; The pixel-global consistency loss is given by equation (9); , Weighting coefficients (here) Take 0.5, Take 0.2).

[0133] 3) Two-stage training strategy: In addition, AlignAD training is preferably divided into two stages:

[0134] (a) Backbone training stage: Freeze the visual-language pre-training backbone, update the parameters of the decoding head, cue learner and fusion head, and optimize using the joint loss shown in Equation (14);

[0135] (b) Adaptation and fine-tuning stage: With the main parameters fixed, only the vision adapter is fine-tuned in small steps to further adapt to industrial anomalies and improve cross-domain robustness. Hyperparameter settings are shown in Table 1.

[0136] Table 1 Hyperparameter Settings

[0137] The training results are shown in Tables 2 and 3.

[0138] Table 2 Image-level performance comparison results

[0139]

[0140] Table 2 presents the comparative experimental results of the proposed method and existing technologies on the MVTec-AD and VisA datasets for image-level detection tasks. On the MVTec-AD dataset, the proposed method demonstrates superior detection performance, achieving Image-AUC, AP, and F1-Max of 92.1%, 96.8%, and 93.0%, respectively. Compared to the state-of-the-art method FiLo, the proposed method improves Image-AUC by 0.9%, and all metrics outperform all compared methods. This significant improvement validates the effectiveness of the position-aware dynamic cueing (PA-DyPrompt) and content-adaptive multi-shape convolution (SK-CMI) techniques in this invention: by explicitly encoding the nine-grid position prior in the cue template and introducing multi-scale deformable convolution kernels, the model's response strength to defect regions is effectively enhanced, and accurate characterization of abnormal morphological features is achieved.

[0141] On the VisA dataset, although the dataset contains complex textured backgrounds and significant reflective interference, causing the global ranking metric Image-AUC (83.1%), which is sensitive to background noise, to be slightly lower than FiLo (83.9%), the proposed method achieves AP and F1-Max scores of 86.2% and 81.0%, respectively, which measure detection accuracy and robustness, significantly outperforming other comparative methods. This indicates that the proposed method has significant advantages in suppressing background interference, improving the ability to focus on abnormal regions, and handling the boundary consistency of weak contrast and small defects. In summary, the experimental data fully demonstrate that the zero-shot detection method based on alignment-based visual perception proposed in this invention can maintain high detection accuracy and robustness in different industrial scenarios, effectively solving the problems of high false detection rate and difficulty in locating small defects in existing technologies under complex backgrounds.

[0142] Table 3 Pixel-level performance comparison results

[0143] Table 3 presents the comparative experimental results of the proposed method and existing technologies on the MVTec-AD and VisA datasets for pixel-level localization tasks. On the MVTec-AD dataset, the proposed method exhibits superior localization performance, with Pixel-AUC, PRO, and F1-Max reaching 92.5%, 84.7%, and 51.1%, respectively. On the VisA dataset, the proposed method also achieves the best results, with the above three metrics reaching 96.0%, 87.2%, and 35.0%, respectively. Experimental data show that the proposed method outperforms all comparative methods in all key metrics on both datasets. This significant performance improvement is mainly attributed to the "position-content-consistency" three-dimensional alignment mechanism constructed in the proposed method: position-aware dynamic cues enhance the alignment accuracy between text semantics and spatial location, enabling the model to more sensitively capture abnormal signals in specific regions; content-adaptive multi-shape convolution strengthens the consistency between semantic features and defect morphology, improving the modeling ability for slender or irregular defects; and global-local consistency constraints effectively unify pixel-level response and image-level judgment, avoiding logical fragmentation. In summary, the experimental results fully verify that the present invention can still generate compact, coherent and clearly defined anomaly heatmaps even in complex background interference and low contrast scenarios, thereby significantly improving the pixel-level positioning accuracy and reliability of industrial anomaly detection.

[0144] To verify the anomaly understanding, cause inference, and decision support capabilities of the method of this invention in industrial scenarios, we set up two types of interactive commands: one is the "potential cause analysis" command, and the other is the "correction and maintenance disposal suggestion" command. The method of this invention was compared with comparative models (such as AnomalyGPT, MiniGPT-4, Qwen2.5-VL, etc.) under the same input conditions to examine whether the models can transform pixel-level anomaly evidence into traceable and actionable diagnostic conclusions and disposal plans. Since "whether it is an anomaly / defect description" is a basic perception output, and all models can provide acceptable answers in some examples, we focus on explaining the "cause analysis" and "disposal suggestion" outputs, which are more closely aligned with production line operation and maintenance decisions, to demonstrate the closed-loop decision support effect of this invention.

[0145] Potential Cause Analysis. As shown in Table 4, under the "Potential Cause" command, the outputs of the contrast models MiniGPT-4 and Qwen2.5-VL are mostly generalized enumerations, lacking a correspondence with specific defect evidence, making it difficult to form a traceable priority ranking. Although the contrast model AnomalyGPT can point out the presence of anomalies in the image, its explanation of the causes of the anomalies is still too general, making it difficult to provide a clear path for engineering troubleshooting. In contrast, the method of this invention can generate a ranked list of candidate causes based on pixel-level evidence of the anomaly region and provide specific hypotheses related to the industrial process, thereby providing verifiable diagnostic clues for subsequent workstation troubleshooting and equipment / process verification.

[0146] Table 4. Qualitative comparison of the proposed method with three other large models regarding the anomaly cause task.

[0147] Corrective and maintenance recommendations. As shown in Table 5, under the "Corrective / Maintenance Recommendation" instruction, the output of the comparative model often provides general recommendations or weakly relevant generalized recommendations due to a lack of structured evidence related to the scenario, making it difficult to directly guide the handling of the situation. The comparative model AnomalyGPT also struggles to generate step-by-step, implementable maintenance plans. In contrast, the method of this invention can further transform diagnostic conclusions into a hierarchical action list, including at least immediate measures, corrective measures, and preventive measures, thereby achieving a closed-loop output from "discovering anomalies" to "guiding handling," improving on-site execution efficiency and traceability.

[0148] Table 5. Qualitative comparison of the proposed method with three other major models regarding anomaly maintenance tasks.

[0149]

Claims

1. A zero-shot industrial anomaly detection method based on aligned visual perception and chain-of-evidence reasoning, characterized in that, Includes the following steps: (1) Acquire the industrial image to be detected, input it into the AlignAD alignment vision perception module, and output pixel-level anomaly heatmap and image-level anomaly score; (2) The pixel-level anomaly heatmap and image-level anomaly score are compressed into structured anomaly evidence by an evidence parser; (3) The structured abnormal evidence is filled into a predefined task template by the prompt generator to generate text prompt words; (4) The text prompt words are concatenated with the visual prompt words extracted from the same image by the visual encoder and query converter, and then input into the multimodal large model; (5) The multimodal large model outputs anomaly determination, defect description, cause analysis and maintenance suggestions under the constraint of evidence chain.

2. The method according to claim 1, characterized in that, The AlignAD alignment-based visual perception module includes: The PA-DyPrompt module is used to dynamically generate text prototypes based on prior spatial location in the image; the SK-CMI module is used to model defects of different shapes through multi-branch convolution; and the GLAC module is used to constrain the consistency between image-level decision and pixel-level response. (1) Position-aware dynamic prompts (PA-DyPrompt): The image space is divided into a nine-grid position prior, and the position phrase is explicitly encoded in the prompt template, making the text prototype sensitive to the "defect position"; at the same time, a meta-network is introduced to dynamically modulate the learnable context vector according to the global semantics of the image, thereby realizing "context-adjustable prompt generation"; ;; in, As the initial learnable context vector, This is the dynamically adjusted context vector. The change in the context vector. For global features of the image, For meta-network, The set of learnable parameters for the meta-network, namely the weight matrix and bias vector, is obtained through random initialization and end-to-end optimization via backpropagation. (2) Content-Adaptive Multi-Shape Convolution (SK-CMI): SK-CMI uses multi-branch, multi-shape, and multi-scale convolutional kernels for parallel modeling (including isotropic kernels). Different scales of square kernels and asymmetric rectangular cores for horizontal / vertical slender defects. And, adaptive reweighting of content is achieved through selective kernel fusion; ; ; ; in, For the first Output by one branch For weighted sum, This is global average pooling, where z is the channel description vector obtained through global average pooling. For activation functions, MLP It is a multilayer perceptron. The set of learnable parameters for the multilayer perceptron, namely the weight matrix and bias vector, is randomly initialized during the training phase and obtained through end-to-end backpropagation and optimizer updates. Choose a weight matrix for the branch. For the first Branch weights of each branch This represents the final output after weighting. (3) Global-Local Consistency Constraint (GLAC): The GLAC mechanism solves the problem of misalignment between visual features and text semantics through two stages: global-local alignment at the feature level and pixel-global consistency constraint at the decision level. 1) Global-Local Alignment: Before calculating anomaly scores, the visual features are dynamically recalibrated using text prototypes to ensure semantic consistency of features; Global feature alignment: Aligning global features of an image Interacting with the text prototype T, the aligned global features are generated through the residual structure. : ; Local Feature Alignment: Generating Affine Parameters Using Text Prototypes The FiLM mechanism is used to analyze local features at each layer. Gated recalibration is performed to amplify the response in potentially anomalous regions, thereby obtaining modulation features aligned with the text prototype. : ; 2) Pixel-Global Consistency Constraint: Calculate scores based on aligned features and enforce logical consistency between image-level decisions and pixel-level evidence; Anomaly scoring calculation: At the image level, we calculate... Similarity to the text prototype yields the image-level anomaly probability. At the pixel level, we calculate Similarity to the text prototype is analyzed through upsampling and inter-layer maximization to extract the most significant pixel evidence s; the final total score is then calculated. as follows: ; PGC-Loss Consistency Constraint: First, for the... Pixel anomaly probability map at each scale Top-K pooling is used to obtain the evidence strength of a single-scale pixel. Subsequently, pixel evidence from all scales is averaged and fused to obtain the final pixel evidence strength. : ; Where k is the number of pixels involved in evidence aggregation (taken as 5366 here). To calculate the mean of the k largest pixel values ​​in the input probability map, L is the number of layers involved in the aggregation, which is 7 in this case; In addition, to strengthen the logical consistency between image-level judgment and pixel-level evidence, we use the pixel evidence strength obtained by multi-scale Top-K aggregation. Constraining image-level anomaly probability And define the pixel-global consistency loss as: 。 3. The method according to claim 1, characterized in that, The structured anomaly evidence includes at least one of the following: confidence level, area ratio, location label, severity level, size level, morphological statistical features, and texture statistical features; (1) Unified confidence level: We construct an intermediate value by combining the image-level score and the heatmap peak value, and then obtain the normalized confidence level by applying the Sigmoid function: ; in, The image-level anomaly score is represented by X, which is a pixel-level heatmap. To balance the weighting coefficients, a value of 0.55 is set to give appropriate attention to global image features during the fusion process. m is the intermediate fusion score, and C is the normalized confidence score. (2) Binary mask to area ratio: take the first... Using percentiles as thresholds, masks are obtained and area percentages are calculated: ; in, The percentile parameter used to determine the segmentation threshold is 97. To determine the pixel response distribution based on the current heatmap The adaptive segmentation threshold obtained from quantile statistics varies with the input image. This is a binary indicator function; it takes the value 1 if the condition is met, and 0 otherwise. For the generated binary anomaly mask, H and W are the height and width of the feature map, respectively. R represents the total area of ​​abnormal pixels in the mask, and R is the calculated percentage of the abnormal region area. (3) Location Labels: First, the geometric centroid of the abnormal region is calculated based on the binary mask. Then, the coordinates of the centroid are mapped to a preset nine-grid space to determine the location labels. The centroid coordinates are calculated as follows: ; in, The coordinates of the geometric centroid of the anomaly region. The x and y coordinates of the image pixels. For a binary mask in coordinates The value at that location; (4) Severity level: A comprehensive severity score is obtained by weighting and fusing the area ratio and image-level anomaly score. ; in, To assess the overall severity, The coefficient is used to adjust the area proportion and anomaly score weight, and is set to 0.

6. This makes the severity assessment focus more on the actual physical coverage of the defect, while also considering the model's overall anomaly discrimination ability. R is the area proportion of the anomaly region. Image-level anomaly scoring; (5) Size levels and statistical characteristics: small, medium and large are divided by the pixel length of the major axis of the circumscribed ellipse, and geometric shape, boundary sharpness and texture change statistics are extracted from the mask and neighborhood to form unified structured evidence.

4. The method according to claim 1, characterized in that, The training of the multimodal large model adopts a two-stage strategy: The first stage freezes the visual-language pre-training backbone and updates the decoding head and cue learner; the second stage fixes the backbone parameters.

5. The method according to claim 1, characterized in that, The AlignAD alignment visual perception module is trained using... The joint training strategy of pixel-level segmentation loss and image-level discrimination loss can be expressed as follows: ; in, Pixel-level segmentation loss is used to improve the localization accuracy and boundary consistency of abnormal regions; It is an image-level discrimination loss used to improve the global normal / abnormal discrimination capability and constrain the consistency between global prediction and local response; This is the weighting coefficient, which is set to 0.5 here; (1) Pixel-level segmentation loss In one embodiment, the multi-scale anomaly response map is converted into a foreground probability map using softmax, and combined with class imbalance suppression, region consistency, and boundary constraints, the pixel-level segmentation loss is defined as: ; ; ; ; ; in, For the first A true binary mask after scale alignment For the first Pixel-level response at each scale For the first Pixel-level anomaly probability map at each scale FocalCE is used to alleviate foreground / background sample imbalance. This is a pixel-level binary classification cross-entropy operator that calculates the average of pixel responses and ground truth masks pixel by pixel. The foreground-background balance coefficient is set to 0.

25. This is the focusing factor, with a value of 2; Dice is used to improve the consistency of region overlap. For overlap coefficient operators; Used to enhance boundary sharpness and structural consistency WBCE is the boundary indication map obtained by passing the truth mask through the edge detection operator. The pixel-level binary cross-entropy is weighted according to boundary weights; , The weighting coefficients are here. Take 0.2, Take 0.1; (2) Image-level discrimination loss Image-level discrimination loss consists of a binary classification supervision term, a ranking constraint term, and a pixel-global consistency term. ; ; ; in, For image-level anomaly scoring, Image-level labels; The cross-entropy loss is used for binary classification. For the ranking constraint loss, To calculate the arithmetic mean of all sample pairs in the abnormal sample set A and the normal sample set B, These are the sample indices in the abnormal sample set and the normal sample set, respectively. and These are the discrimination scores for the corresponding samples; The pixel-global consistency loss is given by equation (9); , These are the weighting coefficients. Take 0.5, Take 0.2.