A transformer defect detection method based on pixel-text matching of a CLIP model

By employing a pixel-to-text matching method based on the CLIP model and utilizing Vision Transformer and Text Transformer for transformer defect detection, the problem of insufficient datasets is solved, achieving efficient defect detection and recognition, and improving the accuracy and flexibility of detection.

CN118918095BActive Publication Date: 2026-06-19ANHUI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ANHUI UNIV
Filing Date
2024-08-13
Publication Date
2026-06-19

Smart Images

  • Figure CN118918095B_ABST
    Figure CN118918095B_ABST
Patent Text Reader

Abstract

This invention discloses a transformer defect detection method based on the CLIP model and pixel-text matching, belonging to the field of transformer equipment defect detection technology. It addresses the problem of effectively detecting transformer defects when there is insufficient relevant dataset. The invention transforms the original image-text matching in CLIP into pixel-text matching and uses pixel-text score maps to guide transformer defect detection. Defect data of transformer equipment is collected and converted into image-text pairs, which are then input into the model. Multimodal data is mapped to the same multimodal space, and image embeddings and text embeddings representing "normal" and "abnormal" states are extracted. Pixel-text score maps are calculated and fed into the FPN image decoder for supervision using real labels. After training, the model is applied to a dataset of transformer equipment defects to obtain the final segmentation results of the defects.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of transformer equipment defect detection technology, and relates to a transformer defect detection method based on CLIP model pixel-text matching. Background Technology

[0002] Transformers are indispensable equipment in power transmission systems. Their use reduces energy loss during transmission, improves transmission efficiency, and facilitates the conversion and distribution of electrical energy to meet the needs of different users. However, transformers may experience abnormalities or hidden dangers during operation, which could lead to malfunctions in the power transmission system. These defects include oil leaks, metal corrosion, and component damage. Oil leaks can cause overheating and increase the risk of insulation aging, while metal corrosion can make the transformer more susceptible to mechanical damage. If these defects are not repaired promptly, they can lead to power outages and even affect the effectiveness and reliability of power supply. Therefore, regular inspection of transformers is crucial to detect and address potential problems early, ensuring the safe operation of the power system.

[0003] Defect detection is particularly challenging due to the difficulty in obtaining image segmentation data for transformer equipment defects. In defect detection, models need to generalize to anomalies across different domains, where foreground features, background features, and anomaly regions can vary significantly, such as defects on different products or different power equipment. Zero-shot detection (ZSAD) requires a detection model trained on auxiliary data to detect anomalies without any training samples from the target dataset. Recently, pre-trained visual language models, such as CLIP (Contrastive Language–Image Pre-training), have demonstrated strong zero-shot recognition capabilities on various visual tasks. However, CLIP's effectiveness in zero-shot anomaly detection is limited by a lack of defect-related knowledge and the complexity of transferring image-text pair matching to pixel prediction. Summary of the Invention

[0004] The technical solution of this invention is used to solve the problem of how to effectively detect defects in transformers when there is not enough relevant dataset.

[0005] The present invention solves the above-mentioned technical problems through the following technical solutions:

[0006] A transformer defect detection method based on CLIP model pixel-text matching includes:

[0007] Step 1: Collect a dataset of transformer defects as a training set for pre-training the CLIP model;

[0008] Step 2: In terms of visual feature extraction, Vision Transformer is selected as the visual encoder to extract image features of power equipment defects and a linear layer is added to adapt to the transformer defect detection task; in terms of text prompts, the text prompts are written in the form of "normal" and "abnormal", and Text Transformer is selected as the text encoder to extract text embeddings representing "normal" and "abnormal", so that the model focuses on the abnormal regions of the image rather than the semantics of the object.

[0009] Step 3: Use pixel-text score maps to guide the CLIP model in dense prediction;

[0010] Step 4: Use segmentation loss as the training objective between dense prediction results and true labels, and use pixel-text matching loss to minimize the distance between positive text and image, and maximize the distance between negative text and image.

[0011] Step 5: Collect target defect data and convert it into multimodal data. Input the multimodal data into the CLIP model to obtain the final segmentation result.

[0012] Furthermore, the method described in step 2 for selecting the Vision Transformer as a visual encoder to extract image features of power equipment defects and adding a linear layer is as follows:

[0013] (1) Select Vision Transformer as the visual backbone, and represent the features of layers 1 to 12 as follows: The features of the last layer are extracted as the visual features for dense prediction, represented as... H 12 W 12 C and C correspond to the height, width, and number of channels of the 12th layer main trunk, respectively;

[0014] (2) Global features are obtained by performing global average pooling on the features of the 12 layers, thereby obtaining ;

[0015] (3) By concatenating the global features with the feature map and passing them to the multi-head self-attention layer, a combined representation is obtained. The formula for the combined representation is:

[0016] (1)

[0017] in, , This represents the features of the 12th layer of the Vision Transformer. This indicates the global features obtained by performing global average pooling on the features from the 12 layers;

[0018] (4) Using combined representations As the output of the image encoder, a linear layer is added after the image encoder.

[0019] Furthermore, the method described in step 2 for writing the text prompts in the forms of "normal" and "abnormal", and selecting TextTransformer as the text encoder to extract the text embeddings representing "normal" and "abnormal", is as follows:

[0020] Using learnable textual context, the concepts of "normal" and "abnormal" are learned by directly optimizing the context through backpropagation. Descriptive cues containing "normal" and "abnormal" states are used, and the average descriptors associated with the "normal" and "abnormal" states are calculated as follows:

[0021] (2)

[0022] Where, m k It is the number of text expressions representing the state. The embedding of the i-th text expression representing the state;

[0023] The input to the text encoder is represented as follows:

[0024] (3)

[0025] Where, e1, e2∈ Let p represent the embeddings of the "normal" and "abnormal" states, respectively; and p∈ It is the corresponding learnable textual context.

[0026] Furthermore, the method for guiding the CLIP model to perform dense prediction using pixel-text score maps described in step 3 is as follows:

[0027] After creating different text embeddings for "normal" and "abnormal" states, the text features t∈ are obtained. Subsequently, the pixel-text score map is calculated using the language-compatible feature map z and the text feature t as follows:

[0028] (4)

[0029] in and It is in the channel dimension The normalized version, pixel-text score map, represents the results of pixel-text matching;

[0030] The fractional map is concatenated to the last feature map and used as input to the image decoder to incorporate language priors. =[ ,s] ;

[0031] Using Semantic FPN as the image decoder enables the model to better recover image details; that is:

[0032] (5)

[0033] in, () represents the image decoder, M ;

[0034] We choose binary cross-entropy loss as the training objective between dense prediction results and true labels:

[0035] (6)

[0036] in, The loss is the binary cross-entropy loss, where M represents the model's prediction result and y represents the true label. The loss is 0 when the model predicts correctly and a large loss when the prediction is incorrect.

[0037] The pixel-to-text matching loss used aims to transform image-level features into pixel-level features, replacing the cosine similarity in the contrastive loss function of the original CLIP model with a pixel-to-text score map.

[0038] Furthermore, the pixel-text matching loss is calculated as follows:

[0039] (7)

[0040] in, A pixel-text score map representing positive samples. Pixel-text score map representing negative samples. It's a hyperparameter.

[0041] An electronic device includes a memory and a processor, the memory being used to store a program that supports the processor in executing the above-described pixel-text matching transformer defect detection method based on the CLIP model, the processor being configured to execute the program stored in the memory.

[0042] A storage medium storing a computer program, which, when executed by a processor, performs the steps of the above-described pixel-text matching transformer defect detection method based on the CLIP model.

[0043] The advantages of this invention are:

[0044] This invention transforms the original image-text matching in CLIP into pixel-text matching and uses pixel-text score maps to guide the detection of transformer defects. First, defect data of transformer equipment is collected and converted into image-text pairs, which are then input into the model to map multimodal data into the same multimodal space. Next, image embeddings and text embeddings for representing "normal" and "abnormal" states are extracted, and then pixel-text score maps are calculated. These score maps are fed into the FPN image decoder and supervised with real labels. After training, the model is applied to the transformer equipment defect dataset to obtain the final segmentation result of transformer equipment defects. This invention transfers CLIP's powerful zero-shot recognition capability to the field of image segmentation, enabling effective transformer defect detection even without sufficient relevant datasets. The advantages are as follows: 1) Through multimodal learning, it fully utilizes the information from multimodal data, improving the accuracy of defect data annotation; 2) Leveraging CLIP's powerful zero-shot recognition capability, under natural language supervision, it can accurately detect the same defect even with significant scene variations, further improving the accuracy of transformer defect detection; 3) In the language domain, it independently learns text representations for abnormal and typical scenes, helping to bridge knowledge gaps and guiding the model to learn the concepts of "normal" and "abnormal"; 4) In the image domain, visual features are fine-tuned, and the combination of global features and feature maps is connected as the output of the image encoder, preserving sufficient spatial information and aligning well with language features; a linear layer is added after the image encoder, and fine-tuning the linear layer further refines the visual representation to adapt to the transformer defect detection task. Attached Figure Description

[0045] Figure 1 This is a flowchart of the transformer defect detection method based on CLIP model pixel-text matching according to Embodiment 1 of the present invention. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0047] The technical solution of the present invention will be further described below with reference to the accompanying drawings and specific embodiments:

[0048] Example 1

[0049] like Figure 1As shown in this embodiment, a transformer defect detection method based on CLIP model pixel-text matching includes the following:

[0050] 1. Visual Feature Extraction

[0051] The CLIP model, pre-trained with a large number of image-text pairs, learns strong semantic information, associating image content with extensive natural language descriptions during training. However, transferring the knowledge of the CLIP model to dense prediction is challenging, as the former requires consideration of both image and text representations, while the latter expects pixel-level output. To better adapt to dense prediction tasks, this embodiment utilizes pixel-text score maps to guide the CLIP model in dense prediction.

[0052] In this embodiment, the Vision Transformer is selected as the visual backbone, and the features of layers 1 to 12 are represented as follows: The features of the last layer are extracted as the visual features for dense prediction, represented as... H 12 W 12 C and C correspond to the height, width, and number of channels of the 12th layer backbone, respectively. Global features are obtained by performing global average pooling on the features of the 12 layers, thus yielding... By concatenating global features with feature maps and passing them to a multi-head self-attention layer (MHSA), a richer combined representation is obtained, the formula for which is:

[0053] (1)

[0054] in, , This represents the features of the 12th layer of the Vision Transformer. This indicates the global features obtained by performing global average pooling on the features of the 12 layers.

[0055] Subsequently, this embodiment uses a combined representation. As the output of the image encoder, the combined representation not only preserves sufficient spatial information but also aligns well with language features. Furthermore, this embodiment adds a linear layer after the image encoder; fine-tuning this linear layer can further refine the visual representation to suit transformer defect detection tasks.

[0056] 2. Text prompts

[0057] Reducing the domain gap between vision and language can significantly improve the performance of CLIP models on downstream tasks, and one effective approach is to improve text prompts.

[0058] The CLIP model has demonstrated strong zero-shot recognition capabilities in image classification tasks, but its zero-shot anomaly detection capabilities are not good. One reason for this is that previous methods focused more on modeling the semantic categories of foreground objects rather than the anomalies or normalities in the image. This embodiment employs a Text Transformer as the text encoder, learning object-independent text cues. It captures general normality and anomalies in the image regardless of the foreground objects. This allows the model to focus on anomalous regions of the image rather than object semantics. Trained on a relevant dataset containing anomalies, it can better identify normal and anomalies for different types of objects.

[0059] Unlike the original CLIP model, which uses manually designed templates such as "a photo of a [cls]", CoOp introduces learnable textual context as text cues, resulting in better portability in downstream classification tasks. Inspired by CoOp, this embodiment uses learnable textual context to learn the concepts of "normal" and "abnormal" by directly optimizing the context through backpropagation. This approach not only improves the model's flexibility and accuracy in handling different tasks but also enhances its ability to be applied in complex scenarios. This embodiment uses descriptive cues that include "normal" and "abnormal" states. For example, descriptors representing the "normal" state might contain words such as "perfect" or "normal". Conversely, descriptors representing the "abnormal" state might contain words such as "damaged" or "broken". The average descriptor associated with the "normal" and "abnormal" states is calculated:

[0060] (2)

[0061] Where, m k It is the number of text expressions for that state. This represents the embedding of the i-th text expression of the state.

[0062] The input to the text encoder is represented as follows:

[0063] (3)

[0064] Where, e1, e2∈ These represent the embeddings of the "normal" and "abnormal" states, respectively. Also, p∈ It is the corresponding learnable textual context.

[0065] 3. Semantic-guided pixel-by-pixel prediction

[0066] After creating different text embeddings for the "normal" and "abnormal" states, the module concatenates them to obtain the text features t∈ Subsequently, this embodiment uses the language-compatible feature map z and the text feature t to calculate the pixel-text score map, enabling the model to perform pixel-level predictions.

[0067] (4)

[0068] in and It is in the channel dimension In the normalized version, the pixel-text score map represents the result of pixel-to-text matching. The score map can be viewed as a lower-resolution segmentation result. Furthermore, this embodiment concatenates the score map to the last feature map as input to the image decoder to incorporate language priors and allow the image decoder to learn more information. =[ ,s] This embodiment uses Semantic FPN as the image decoder, enabling the model to better recover image details.

[0069] (5)

[0070] in, () represents the image decoder, M .

[0071] This embodiment uses segmentation loss to better utilize pixel-text score maps for image segmentation. This embodiment selects binary cross-entropy loss as the training objective between dense prediction results and ground truth labels.

[0072] (6)

[0073] in, The loss is the binary cross-entropy loss, where M represents the model's prediction result and y represents the true label. The loss is 0 when the model predicts correctly and a large loss when the prediction is incorrect.

[0074] The original CLIP model minimizes the distance between positive sample image features and text features, and maximizes the distance between negative sample image features and text features through a contrastive loss function. To address the dense prediction problem adapted to transformer equipment defect detection, this embodiment uses a pixel-to-text matching loss designed to transform image-level features into pixel-level features, replacing the cosine similarity in the contrastive loss function of the original CLIP model with a pixel-to-text score map. The pixel-to-text matching loss is calculated as follows:

[0075] (7)

[0076] in A pixel-text score map representing positive samples. Pixel-text score map representing negative samples. It is a hyperparameter, with a default value of 0.05.

[0077] This invention employs multimodal data representation to depict defect features in transformer equipment, improving defect recognition accuracy. It utilizes a contrastive image-text pair paradigm to learn high-quality visual representations from natural language supervision. This method leverages pre-trained knowledge from CLIP to transfer the learned knowledge from image-text pairs to dense prediction for transformer defect detection. Using learnable textual context, it directly optimizes the context through backpropagation to learn the concepts of "normal" and "abnormal," where textual cues include both normal and abnormal information independent of the foreground object. This results in better defect detection compared to current textual cues methods used for object semantic alignment. By utilizing pixel-text score maps, the image-text matching problem is transformed into a pixel-text matching problem, guiding the model to perform pixel-by-pixel predictions. Simultaneously, the contrastive loss function is changed to a pixel-text matching loss function to adapt to the dense prediction task. This invention connects the pixel-text score map to the last feature map as input to the image decoder to incorporate language priors and allow the image decoder to learn more information.

[0078] Example 2

[0079] An electronic device includes a memory and a processor, the memory being used to store a program that supports the processor in executing the pixel-text matching transformer defect detection method based on the CLIP model in Embodiment 1, the processor being configured to execute the program stored in the memory.

[0080] Example 3

[0081] A storage medium storing a computer program, which, when executed by a processor, performs the steps of the pixel-text matching transformer defect detection method based on the CLIP model in Embodiment 1.

[0082] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A transformer defect detection method based on CLIP model pixel-text matching, characterized in that, include: Step 1: Collect a dataset of transformer defects as a training set for pre-training the CLIP model; Step 2: In terms of visual feature extraction, Vision Transformer is selected as the visual encoder to extract image features of power equipment defects and a linear layer is added to adapt to the transformer defect detection task. Regarding text prompts, the text prompts are written in the form of "normal" and "abnormal". The Text Transformer is selected as the text encoder to extract the text embeddings representing "normal" and "abnormal", so that the model focuses on the abnormal regions of the image rather than the semantics of the object. The method for writing text prompts in the forms of "normal" and "abnormal", and selecting Text Transformer as the text encoder to extract the text embeddings representing "normal" and "abnormal" is as follows: Using learnable textual context, the concepts of "normal" and "abnormal" are learned by directly optimizing the context through backpropagation. Descriptive cues containing "normal" and "abnormal" states are used, and the average descriptors associated with these states are calculated as follows: Where, m k It is the number of text expressions representing the state. The embedding of the i-th text expression representing the state; The input to the text encoder is represented as follows: Where, e1, e2∈ Let p represent the embeddings of the "normal" and "abnormal" states, respectively; and p∈ It is the corresponding learnable textual context; Step 3: Use the pixel-text score map to guide the CLIP model in dense prediction. The specific method is as follows: After creating different text embeddings for "normal" and "abnormal" states, the text features t∈ are obtained. Subsequently, the pixel-text score map is calculated using the language-compatible feature map z and the text feature t as follows: in and It is in the channel dimension The normalized version, pixel-text score map, represents the results of pixel-text matching; The fractional map is concatenated to the last feature map and used as input to the image decoder to incorporate language priors. =[ ,s] H 12 W 12 C and C correspond to the height, width, and number of channels of the 12th layer main trunk, respectively; Using Semantic FPN as the image decoder enables the model to better recover image details; that is: in, () represents the image decoder, M ; We choose binary cross-entropy loss as the training objective between dense prediction results and true labels: in, The loss is the binary cross-entropy loss, where M represents the model prediction result and y represents the true label. The pixel-to-text matching loss used aims to transform image-level features into pixel-level features, replacing the cosine similarity in the contrastive loss function of the original CLIP model with a pixel-to-text score map; Step 4: Use segmentation loss as the training objective between dense prediction results and true labels, and use pixel-text matching loss to minimize the distance between positive text and image, and maximize the distance between negative text and image. Step 5: Collect target defect data and convert it into multimodal data. Input the multimodal data into the CLIP model to obtain the final segmentation result.

2. The transformer defect detection method based on CLIP model pixel-text matching according to claim 1, characterized in that, The method described in step 2 for selecting Vision Transformer as a visual encoder to extract image features of power equipment defects and adding a linear layer is as follows: (1) Select Vision Transformer as the visual backbone, and represent the features of layers 1 to 12 as follows: The features of the last layer are extracted as the visual features for dense prediction, represented as... (2) Global features are obtained by performing global average pooling on the features of the 12 layers, thereby obtaining ; (3) By concatenating the global features with the feature map and passing them to the multi-head self-attention layer, a combined representation is obtained. The formula for the combined representation is: in, , This represents the features of the 12th layer of the Vision Transformer. This indicates the global features obtained by performing global average pooling on the features from the 12 layers; (4) Using combined representations As the output of the image encoder, a linear layer is added after the image encoder.

3. The transformer defect detection method based on CLIP model pixel-text matching according to claim 2, characterized in that, The pixel-to-text matching loss is calculated as follows: in, A pixel-text score map representing positive samples. Pixel-text score map representing negative samples. It's a hyperparameter.

4. An electronic device, comprising a memory and a processor, characterized in that, The memory is used to store a program that supports the processor in executing the pixel-text matching transformer defect detection method based on the CLIP model as described in any one of claims 1 to 3, and the processor is configured to execute the program stored in the memory.

5. A storage medium storing a computer program, characterized in that, The computer program, when run by a processor, performs the steps of the pixel-text matching transformer defect detection method based on the CLIP model as described in any one of claims 1 to 3.