Method and system for dynamic semantic alignment defect detection based on material texture perception
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2026-05-28
- Publication Date
- 2026-06-23
Smart Images

Figure CN122265293A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of surface defect detection technology, and in particular to a dynamic semantic alignment defect detection method and system based on material texture perception. Background Technology
[0002] Industrial surface defect detection is a core component of the manufacturing quality control process, widely used in surface quality monitoring of products such as automotive parts and metal castings. With the development of multimodal pre-trained large-scale models, zero-shot anomaly localization using Vision-Language Models (VLMs) has become an important technological direction in this field. Its core advantage lies in its ability to accurately identify randomly occurring unknown defects in industrial production without the need for pre-collecting and labeling a large number of defect samples.
[0003] Currently, the main technical approaches for detecting industrial surface defects include: Supervised learning methods, exemplified by the YOLO series, train the system by manually collecting and labeling a large number of images containing specific defects (such as cracks, porosity, and scratches). This method heavily relies on expensive manual labeling and cannot identify novel defects outside the training set.
[0004] Zero-shot detection methods based on the CLIP architecture, such as the existing WinCLIP technique, predefine a set of standard static text cue words (e.g., normal surface photos, defective surface photos), extract features using a text encoder, and perform sliding window alignment similarity matching with patch features extracted by a visual encoder. During the inference phase, the model performs a global scan of the image to be inspected and outputs the localization result based on the learned defect features.
[0005] However, in the complex industrial production line environment, existing static zero-shot inspection technologies based on standard CLIP have significant drawbacks: Semantic drift caused by material environment: Industrial parts are made of a variety of materials (such as highly reflective aluminum alloy and dark-textured gray cast iron), and their surface background texture features are very different.
[0006] Projection bias of static cue words: Due to the use of fixed static cue words, their semantic projection in the feature space will exhibit significant drift bias due to changes in material background. For example, a defect definition description that performs well on a rough cast iron surface may be misjudged as noise interference by the model on a smooth aluminum alloy surface due to changes in lighting, leading to serious false detections. Existing technologies lack the ability to perceive the material environment of the current detection object and cannot dynamically describe criteria based on material properties, resulting in poor detection stability and difficulty in meeting the requirements of industrial applications. Furthermore, during the inference phase, the text description of standard CLIP cannot dynamically drift with the physical environment. Summary of the Invention
[0007] Technical problems to be solved In view of the above-mentioned shortcomings and deficiencies of the prior art, the present invention provides a dynamic semantic alignment defect detection method and system based on material texture perception, which solves the technical problem of standard CLIP semantic offset caused by interference from mixed materials in the production line.
[0008] To achieve the above objectives, the main technical solutions adopted by the present invention include: In a first aspect, the present invention provides a dynamic semantic alignment defect detection method based on material texture awareness, comprising the following steps: Step 1: Input the image to be detected into the visual coding branch and the material context awareness branch respectively, so as to extract the visual patch feature set and material context vector in parallel; Step 2: Use a text encoder to extract the initial semantic feature vector of the preset defect text prompt words; Step 3: Input the material context vector into the pre-trained cue agent network, calculate the material offset, and fuse the initial semantic feature vector with the material offset to generate evolutionary semantic features; Step 4: Calculate the similarity between each visual patch feature in the visual patch feature set and the evolved semantic features; Step 5: Based on the similarity, calculate the probability score of each patch belonging to the anomaly category using the temperature parameter, and generate an anomaly heatmap based on the probability scores of all patches; Step 6: Post-process the abnormal heat map to output the location bounding box and confidence level of the defect.
[0009] As a further improvement to the method of the present invention, in step 1, The visual coding branch extracts local visual patch feature sets of the image through a pre-trained visual encoder, namely the ViT backbone network in the visual encoder CLIP architecture. Image I is defined as being divided into The extracted local visual patch feature set is: [Number] patches, [Number] patches. Among them, visual features ;
[0010] The material context-aware branch employs a lightweight convolution operator, inputting image I into the material / texture context extractor in parallel. By capturing the surface's brightness distribution, roughness patterns, and reflection characteristics, it outputs a material context vector representing the physical properties of the current detection environment. .
[0011] As a further improvement to the method of the present invention, in step 2, the text encoder Etext is used to generate a preset set of defective text prompts. Mapping to a high-dimensional feature space generates a vector containing normal states. With abnormal state vector Initial semantic benchmark ;
[0012] Using a text encoder Extract the initial semantic feature vector, including:
[0013] ;
[0014] In the formula, Includes normal vectors as defined by the standard and anomaly vector .
[0015] As a further improvement to the method of the present invention, in step 3, the material context vector is... The input is a cue agent network composed of a multilayer perceptron (MLP), which transforms physical environment attributes into dynamic offsets in the feature space. :
[0016] ;
[0017] By fusing the initial semantic vector with the material offset, an evolutionary semantic feature adapted to the current specific material environment is derived. This is to achieve a mapping from general semantics to physical environment semantics.
[0018] ;
[0019] In the formula, Used to fine-tune the boundary between abnormal and normal in the feature space based on material properties;
[0020] The evolved semantic features Includes normal vectors adapted to the current material. and anomaly vector .
[0021] As a further improvement to the method of the present invention, in step 4, based on the evolved semantic features... Perform patch-level alignment calculations and compare local image features with semantic benchmarks under current material constraints using the cosine similarity formula;
[0022] The formula for calculating cosine similarity is:
[0023] ;
[0024] As a further improvement to the method of the present invention, in step 5, a temperature-based parameter is used. The Softmax function calculates the probability score of each pixel belonging to the anomaly category. :
[0025] ;
[0026] Probability score for all patches Restore the anomaly heatmap to its original resolution based on its spatial location.
[0027] As a further improvement to the method of the present invention, in step 6, the post-processing includes: using a density clustering algorithm to denoise the abnormal heat map and performing binarization processing according to a preset threshold.
[0028] In a second aspect, the present invention provides a dynamic semantic alignment defect detection system based on material texture awareness, used to execute the dynamic semantic alignment defect detection method based on material texture awareness described in any one of the first aspects above, comprising:
[0029] The feature extraction module inputs the image to be detected into the visual coding branch and the material context awareness branch respectively, so as to extract the visual patch feature set and material context vector in parallel;
[0030] The semantic construction module is used to extract the initial semantic feature vector of preset standard text prompt words using a text encoder;
[0031] The semantic evolution module is used to input the material context vector into a pre-trained cue agent network, and fuse the generated material offset with the initial semantic feature vector to output the evolved semantic features;
[0032] The similarity calculation module is used to calculate the similarity between the visual patch feature set and the evolutionary semantic features;
[0033] The probability reconstruction module is used to calculate the probability score of each patch belonging to the anomaly category based on the similarity using temperature parameters, and to generate an anomaly heatmap based on the probability scores of all patches.
[0034] The defect output module performs post-processing on the abnormal heatmap and outputs the location bounding box and confidence level of the defect.
[0035] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed, implements the dynamic semantic alignment defect detection method based on material texture perception as described in any of the first aspects above.
[0036] Fourthly, the present invention provides a storage device, including a storage medium and a processor, wherein the storage medium stores a computer program, and the program, when executed by the processor, implements the dynamic semantic alignment defect detection method based on material texture perception as described in any one of the first aspects above.
[0037] Beneficial effects
[0038] The beneficial effects of this invention are:
[0039] To address the issue of semantic offset in standard CLIP caused by mixed material interference in production lines, a dynamic multimodal feature alignment method and system combining physical mechanism-based purification guidance path is proposed. By introducing different material-aware prompt agent mechanisms, the zero-shot detection model is endowed with the ability to adaptively adjust the detection criteria and semantic projection boundary based on the physical properties of the workpiece surface (such as reflectivity and roughness).
[0040] This invention overcomes the shortcomings of existing static descriptions based on standard CLIPs, which suffer from poor detection stability, high false alarms, and insufficient sensitivity in environments with varying materials. Without requiring any defect sample annotation, this invention utilizes a dynamic meta-cue evolution and purification mechanism to control the fluctuation range of detection accuracy to within 3.0%, exhibiting stable performance. This is superior to the 15%–25% high fluctuations produced by traditional static cue word schemes, demonstrating robustness and adaptability to various materials.
[0041] For pinhole-like micropores and microcracks with a diameter of less than 0.3 mm on the surface of automobile cylinder blocks, the detection rate of micro-defects in this invention reaches 60.8%, which is nearly 18.8% higher than the 42.0% of the prior art. Through semantic attribute purification and score reconstruction differential amplification positioning, FB-CLIP accurately locates defects such as pores, cracks, and scratches, and also purifies pseudo-anomaly areas caused by texture drift interference.
[0042] It employs a mechanism-guided purification path to suppress structural interference and a lightweight design for the prompting agent network, with the core MLP having less than 1.5% of the parameters of a standard ViT. It has low computational overhead, a runtime of less than 50ms, and is easily integrated for online real-time detection at edge devices. Attached Figure Description
[0043] Figure 1 A flowchart of a dynamic semantic alignment defect detection method based on material texture perception provided in an embodiment of the present invention;
[0044] Figure 2 An architecture diagram of a dynamic semantic alignment defect detection method based on material texture perception provided in an embodiment of the present invention;
[0045] Figure 3This is a schematic diagram of patch probability score restoration and anomaly heatmap synthesis based on material-aware dynamic prompt guidance provided in an embodiment of the present invention;
[0046] Figure 4 This is a schematic diagram of DBSCAN density clustering and denoising based on material-aware dynamic prompts provided in an embodiment of the present invention;
[0047] Figure 5 This is a schematic diagram of the binarization and final detection result output process based on material-aware dynamic prompts provided in an embodiment of the present invention.
[0048] Figure 6 This is the training convergence curve of the multimodal sample anomaly localization model on a mixed dataset in this embodiment of the invention;
[0049] Figure 7 This refers to the original input image in this embodiment of the invention;
[0050] Figure 8 This is a defect truth map in an embodiment of the present invention;
[0051] Figure 9 This is a diagram showing the results of the traditional zero-sample localization method in an embodiment of the present invention.
[0052] Figure 10 This is a schematic diagram of the qualitative localization results of FB-CLIP in an embodiment of the present invention;
[0053] Figure 11 This is a module architecture diagram of a dynamic semantic alignment defect detection system based on material texture perception, provided for an embodiment of the present invention. Detailed Implementation
[0054] To better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that the present invention can be understood more clearly and thoroughly, and that the scope of the present invention can be fully conveyed to those skilled in the art.
[0055] Firstly, such as Figure 1 and Figure 2 As shown, this embodiment of the invention provides a dynamic semantic alignment defect detection method based on material texture awareness, including the following steps: Step 1: Input the image to be detected into the visual coding branch and the material context awareness branch respectively, so as to extract the visual patch feature set and material context vector in parallel.
[0056] The visual feature extraction is achieved by extracting local patch feature sets of the image through a pre-trained visual encoder, namely the ViT backbone network in the CLIP architecture.
[0057] Image I is defined as being divided into The visual feature set extracted from the patch is: Among them, visual features .
[0058] This visual feature set not only includes the visual representation of the image, but also couples the distribution patterns of spatial texture.
[0059] The material context-aware branch employs a lightweight convolution operator (such as RFAConv receptive field attention convolution), which explicitly characterizes the anisotropic brightness reflectivity and micro-geometric roughness of the workpiece surface by calculating the attention distribution weights of the pixel-level receptive field in the feature space.
[0060] Image I is input in parallel into the material / texture context extractor, which captures the surface's brightness distribution, roughness patterns, and reflection properties, and outputs a material context vector representing the physical properties of the current detection environment. .
[0061] This step achieves explicit decoupling between high-dimensional visual representation and physical environment attributes through a parallel branching architecture, thereby effectively filtering out common-mode noise interference caused by background texture in the underlying logic.
[0062] Step 2: Use a text encoder to extract the initial semantic feature vector of the preset defective text prompt words.
[0063] Specifically, the text encoder Etext is used to generate a pre-defined set of defective text prompts. Mapping to a high-dimensional feature space generates a vector containing normal states. With abnormal state vector Initial semantic benchmark .
[0064] Using a text encoder Extract the initial semantic feature vector:
[0065] ;
[0066] In the formula, Includes normal vectors as defined by the standard and anomaly vector .
[0067] Among them, the defect text prompt word set Includes: Normal state description (e.g., "a photo of a perfect metal surface") and descriptions of abnormal conditions (As in "a photo of a metal surface with cracks").
[0068] To ensure the robustness of the dynamic modulation path, a spatial geometric constraint mechanism is introduced, and the objective function is optimized to achieve... and Geometric separation is achieved within the feature space. This constraint reserves sufficient feature correction manifold space for subsequent dynamic semantic evolution, preventing the risk of mapping failure due to model initialization bias.
[0069] Step 3: Input the material context vector into the pre-trained cue agent network, calculate the material offset, and fuse the initial semantic feature vector with the material offset to generate evolutionary semantic features.
[0070] To address the semantic drift issue caused by background material, a semantic prompt agent, Prompt Agent, is introduced to dynamically modulate the initial semantic vector.
[0071] Specifically, the material context vector The input is a cue agent network composed of a multilayer perceptron (MLP), which transforms physical environment attributes into dynamic offsets in the feature space. :
[0072] ;
[0073] Subsequently, the initial semantic vector is fused with the material offset to derive the evolutionary semantic features adapted to the current specific material environment. This is to achieve a mapping from general semantics to physical environment semantics.
[0074] ;
[0075] In the formula, Its function is to fine-tune the boundary between abnormal and normal in the feature space based on material properties (such as the high reflectivity of aluminum alloy or the messy texture of cast iron). This includes automatically compensating for feature deviations caused by highlighted areas in highly reflective environments, preventing the model from misjudging normal reflections as defects.
[0076] The evolved semantic features Includes normal vectors adapted to the current material. and anomaly vector .
[0077] Step 4: Calculate the similarity between each visual patch feature in the visual patch feature set and the evolutionary semantic feature.
[0078] Specifically, based on the evolved semantic features Patch-level alignment calculation is performed. The local features of the image are compared patch by patch with the semantic benchmark under the current material constraints using the cosine similarity formula.
[0079] The formula for calculating cosine similarity is:
[0080] ;
[0081] This step compares each spatial patch with the dynamically adjusted anomaly description to obtain the initial response intensity, which not only reflects the shallow visual similarity, but also accurately quantifies the semantic anomaly degree of the patch features after excluding material interference, thus achieving deep physical consistency alignment between visual features and evolutionary semantics.
[0082] Step 5: Based on the similarity, calculate the probability score of each patch belonging to the anomaly category using the temperature parameter, and generate an anomaly heatmap based on the probability scores of all patches.
[0083] like Figure 3 As shown, to further enhance the positioning sensitivity, a temperature-based parameter is used. The Softmax function calculates the probability score of each pixel belonging to the anomaly category. :
[0084] ;
[0085] This formula achieves differential amplification by comparing the evolved normal and abnormal semantics. Specifically, because... and All have undergone material adaptive correction; the contrast between the numerator and denominator effectively cancels out common-mode interference caused by background textures, making... It can accurately focus on abnormal texture mutations (i.e. actual defects).
[0086] Then, the probability scores for all patches were calculated. Restore the anomaly heatmap to its original resolution based on its spatial location.
[0087] This step uses temperature-based parameters. The Softmax function performs nonlinear smoothing and sharpening of the similarity distribution. This differential amplification effect significantly widens the score threshold between the evolved abnormal semantic response and the normal response, causing tiny defects that were originally hidden in the background texture (such as pinholes with a diameter <0.3mm) to exhibit a significant highlight response on the probability score map, thereby effectively suppressing the interference of common-mode background texture.
[0088] Furthermore, temperature parameters Adaptive dynamic optimization is performed based on the global feature distribution of the current image. Utilizing the exponential property of the Softmax function, a local texture mutation differential amplification mechanism is applied to the similarity differences. This transforms the extremely weak similarity fluctuations caused by minute defects with diameters < 0.3 mm into nonlinear stretching within the probability mapping space, effectively suppressing common-mode background noise. This allows minute defects to be transformed into significant highlight responses on the score distribution map, solving the signal submersion problem that is easily generated by traditional zero-shot models.
[0089] Step 6: Post-process the abnormal heat map to output the location bounding box and confidence level of the defect.
[0090] like Figure 4 As shown, the post-processing includes: using a density clustering algorithm to denoise the abnormal heatmap and performing binarization processing according to a preset threshold.
[0091] To address the potential discrete artifacts that may exist in the initial scoring map due to minor defects, spatial reconstruction of the anomaly probability scoring map is performed. Post-processing logic based on density clustering (DBSCAN) is employed, with a density threshold set. and neighborhood points This method spatially and topologically correlates high-response defect regions with environmental noise. This approach effectively filters out fragmented false alarms caused by texture drift, ensuring that the final output defect localization bounding box has high confidence and morphological accuracy, and achieving robust identification of minute defects on the surface of automotive cylinder blocks.
[0092] The DBSCAN density clustering algorithm searches for clusters with density that meets the criteria. Neighborhood clustering performs spatial morphology topology correlation verification in pixel space. Unlike traditional binarization methods, this post-processing logic automatically filters out discrete artifacts caused by local random reflections of the material, ambient light fluctuations, or sensor noise, ensuring that only true defect regions with spatial continuity and structural morphology correlation are retained. The final output defect localization bounding boxes are all robustly verified by this topological space, ensuring high accuracy and low false alarm rate in complex industrial production line environments.
[0093] Subsequently, the final defect location bounding box and defect confidence score are output, such as... Figure 5 As shown.
[0094] To verify the effectiveness of the dynamic semantic alignment defect detection method and system based on material texture perception provided in this embodiment of the invention, a comprehensive verification experiment was conducted on three representative industrial surface datasets: AES3D (aluminum alloy), VisA (multi-material), and KolektorSDD2 (rough cast iron).
[0095] The training configuration on the actual production line hybrid dataset (AES3D, VisA, KolektorSDD2) is as follows:
[0096] Optimizer: AdamW, weight decay factor set to 0.01.
[0097] Learning Rate: The initial learning rate is set to... Cosine Annealing is used for attenuation.
[0098] Batch Size: 32.
[0099] Training epochs: Based on the aforementioned convergence analysis, it is set to 1000 epochs.
[0100] Furthermore, to ensure that the algorithm can run in real time in industrial settings, deployment tests were conducted in the following environments:
[0101] Processor CPU: Intel Core i9-12900K @ 3.2GHz.
[0102] Graphics card GPU: NVIDIA GeForce RTX 3090 (24GB VRAM).
[0103] Software framework: PyTorch 2.0.1, CUDA 11.8.
[0104] Single-frame inference latency: In processing When processing images, the total runtime, including material awareness, semantic evolution, and similarity alignment, is less than [amount missing]. (about The above fully meets the needs of online real-time detection of workpieces in actual production lines.
[0105] During the experiment, a multimodal architecture based on CLIP (Contrastive Language-Image Pre-training) was used as the backbone network, and its specific hyperparameter configuration is shown in Table 1:
[0106] Table 1 Hyperparameter Configuration Table
[0107] The core component of the method proposed in this embodiment, namely the material-aware cueing proxy model, was trained offline on the aforementioned hybrid dataset. The performance evolution during the training process is as follows: Figure 6 As shown in the figure. Experimental results show that the model achieves initial convergence after approximately 700 rounds.
[0108] To further validate and ensure the model's robustness in handling extreme reflections and complex textures, an additional 300 training iterations were performed after convergence.
[0109] Specifically, the loss metrics showed a significant and rapid downward trend in the first 100 training rounds, and then plateaued and stabilized around the 700th round.
[0110] To further ensure the model's stability when switching between different materials (such as from highly reflective aluminum alloy to dark-textured cast iron), training continued until the 1000th epoch. Key evaluation metrics, including precision, recall, and AUROC, showed high numerical stability on both the training and validation sets between epochs 700 and 800. However, experiments observed that with continued overtraining, although the loss function on the training set still decreased slightly, the evaluation metrics on the validation set began to decline slowly, indicating that the model may have experienced potential overfitting.
[0111] Therefore, this invention ultimately selects the parameters around the 800th round, where the performance is most stable, as the optimal model state. This process ensures that the material-aware branch can accurately extract the physical context and guide the dynamic evolution of the semantic vector, thereby achieving optimal generalization performance in subsequent zero-shot detection tasks.
[0112] like Figures 7 to 10 As shown, compared with the traditional static prompt word WinCLIP, the method proposed in this invention has improved in terms of accuracy, stability and ability to detect minute defects. It can accurately locate minute defects and has no false defect noise, thus possessing the ability to distinguish between true and false defects.
[0113] Table 2 Comparison of Detection Accuracy
[0114] As shown in Table 3, the method proposed in this embodiment has a high degree of adaptability to different physical materials (high reflectivity and rough texture).
[0115] Table 3 Detection rates for different physical materials
[0116] Secondly, such as Figure 11 As shown, this embodiment of the invention provides a dynamic semantic alignment defect detection system based on material texture awareness, used to execute the dynamic semantic alignment defect detection method based on material texture awareness described in any one of the first aspects above, including:
[0117] The feature extraction module is used to input the image to be detected into the visual coding branch and the material perception branch respectively, so as to extract the visual patch feature set and material context vector in parallel;
[0118] The semantic construction module is used to extract the initial semantic feature vector of preset defective text prompt words using a text encoder;
[0119] The semantic evolution module is used to input the material context vector into a pre-trained cue agent network, and fuse the generated material offset with the initial semantic feature vector to output the evolved semantic features;
[0120] The similarity calculation module is used to calculate the similarity between the visual patch feature set and the evolutionary semantic features;
[0121] The probability reconstruction module is used to calculate the probability score of each patch belonging to the anomaly category based on the similarity using temperature parameters, and to generate an anomaly heatmap based on the probability scores of all patches.
[0122] The defect output module performs post-processing on the abnormal heatmap and outputs the location bounding box and confidence level of the defect.
[0123] Thirdly, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed, implements the dynamic semantic alignment defect detection method based on material texture perception as described in any of the first aspects above.
[0124] In a fourth aspect, embodiments of the present invention provide a storage device, including a storage medium and a processor, wherein the storage medium stores a computer program, and when the program is executed by the processor, it implements the dynamic semantic alignment defect detection method based on material texture perception as described in any of the first aspects above.
[0125] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0126] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, then this invention should also include these modifications and variations.
[0127] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make modifications, alterations, substitutions and variations to the above embodiments within the scope of the present invention.
Claims
1. A dynamic semantic alignment defect detection method based on material texture perception, characterized in that, Includes the following steps: Step 1: Input the image to be detected into the visual coding branch and the material context awareness branch respectively, so as to extract the visual patch feature set and material context vector in parallel; Step 2: Use a text encoder to extract the initial semantic feature vector of the preset defect text prompt words; Step 3: Input the material context vector into the pre-trained cue agent network, calculate the material offset, and fuse the initial semantic feature vector with the material offset to generate evolutionary semantic features; Step 4: Calculate the similarity between each visual patch feature in the visual patch feature set and the evolutionary semantic feature; Step 5: Based on the similarity, calculate the probability score of each patch belonging to the anomaly category using the temperature parameter, and generate an anomaly heatmap based on the probability scores of all patches; Step 6: Post-process the abnormal heat map to output the location bounding box and confidence level of the defect.
2. The dynamic semantic alignment defect detection method based on material texture awareness according to claim 1, characterized in that, In step 1, The visual coding branch extracts local visual patch feature sets of the image through a pre-trained visual encoder, namely the ViT backbone network in the visual encoder CLIP architecture. Image I is defined as being divided into The visual feature set extracted from the patch is: Among them, visual features ; The material context-aware branch employs a lightweight convolution operator, inputting image I into the material / texture context extractor in parallel. By capturing the surface's brightness distribution, roughness patterns, and reflection characteristics, it outputs a material context vector representing the physical properties of the current detection environment. .
3. The dynamic semantic alignment defect detection method based on material texture perception according to claim 2, characterized in that, In step 2, the text encoder Etext is used to generate a preset set of defective text prompts. Mapping to a high-dimensional feature space generates a vector containing normal states. With abnormal state vector Initial semantic benchmark ; Using a text encoder Extract the initial semantic feature vector, including: ; In the formula, Includes normal vectors as defined by the standard and anomaly vector .
4. The dynamic semantic alignment defect detection method based on material texture awareness according to claim 3, characterized in that, In step 3, the material context vector The input is a cue agent network composed of a multilayer perceptron (MLP), which transforms physical environment attributes into dynamic offsets in the feature space. : ; By fusing the initial semantic vector with the material offset, an evolutionary semantic feature adapted to the current specific material environment is derived. This is to achieve a mapping from general semantics to physical environment semantics. ; In the formula, Used to fine-tune the boundary between abnormal and normal in the feature space based on material properties; The evolved semantic features Includes normal vectors adapted to the current material. and anomaly vector .
5. The dynamic semantic alignment defect detection method based on material texture awareness according to claim 4, characterized in that, In step 4, based on the evolved semantic features Patch-level alignment calculations are performed, and the local features of the image are compared patch by patch with the semantic benchmark under the current material constraints using the cosine similarity formula. The formula for calculating cosine similarity is: 。 6. The dynamic semantic alignment defect detection method based on material texture awareness according to claim 5, characterized in that, In step 5, a temperature-based parameter is used. The Softmax function calculates the probability score of each pixel belonging to the anomaly category. : ; Probability score for all patches Restore the anomaly heatmap to its original resolution based on its spatial location.
7. The dynamic semantic alignment defect detection method based on material texture awareness according to claim 6, characterized in that, In step 6, the post-processing includes: using a density clustering algorithm to denoise the abnormal heatmap and performing binarization processing according to a preset threshold.
8. A dynamic semantic alignment defect detection system based on material texture perception, characterized in that, A method for performing a dynamic semantic alignment defect detection method based on material texture awareness as described in any one of claims 1 to 7 includes: The feature extraction module inputs the image to be detected into the visual coding branch and the material context awareness branch respectively, so as to extract the visual patch feature set and material context vector in parallel; The semantic construction module is used to extract the initial semantic feature vector of preset defective text prompt words using a text encoder; The semantic evolution module is used to input the material context vector into a pre-trained cue agent network, and fuse the generated material offset with the initial semantic feature vector to output the evolved semantic features; The similarity calculation module is used to calculate the similarity between the visual patch feature set and the evolutionary semantic features; The probability reconstruction module is used to calculate the probability score of each patch belonging to the anomaly category based on the similarity using temperature parameters, and to generate an anomaly heatmap based on the probability scores of all patches. The defect output module performs post-processing on the abnormal heatmap and outputs the location bounding box and confidence level of the defect.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the dynamic semantic alignment defect detection method based on material texture perception as described in any one of claims 1 to 7.
10. A storage device comprising a storage medium and a processor, the storage medium storing a computer program, characterized in that, When the processor executes the computer program, it implements the dynamic semantic alignment defect detection method based on material texture perception as described in any one of claims 1 to 7.