A multi-modal large model small sample open set new energy power station inspection method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining multimodal large models with image and text information, a generalized small-sample open-set target detection model is constructed, which solves the problems of incomplete information and category confusion in the inspection of new energy power plants, and achieves high-precision unknown category recognition and generalization capabilities.

CN120747665BActive Publication Date: 2026-06-12STATE POWER INVESTMENT GRP XIONGAN ENERGY CO LTD +2

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: STATE POWER INVESTMENT GRP XIONGAN ENERGY CO LTD
Filing Date: 2025-06-20
Publication Date: 2026-06-12

Application Information

Patent Timeline

20 Jun 2025

Application

12 Jun 2026

Publication

CN120747665B

IPC: G06V10/774; G06V10/82; G06V10/764; G06N3/0455; G06N3/082; G06Q50/06

CPC: G06V10/774; G06V10/82; G06V10/765; G06N3/0455; G06N3/082; G06Q50/06; Y04S10/50

AI Tagging

Application Domain

Data processing applications Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing inspection methods for new energy power plants suffer from incomplete single-modal image modeling information when faced with small sample sizes. This makes it difficult for the model to form a discriminative decision boundary for zero-sample classes, and it is prone to open set class confusion, making it unable to effectively identify unknown classes.

Method used

By employing a multimodal large model and combining image and text information, a generalized small-sample open-set target detection model is constructed. Through unknown class pseudo-sample mining and a cost-aware unknown class loss function, a fine-grained category decision boundary is formed, thus optimizing the unknown class decision boundary of the model.

Benefits of technology

It improves the detection accuracy and generalization ability of new energy power plant inspection with a small amount of training data, and can effectively identify known classes with small samples and known classes with zero samples, reducing false detections of unknown classes.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120747665B_ABST

Patent Text Reader

Abstract

The application discloses a kind of based on multimodal big model small sample open set new energy power station inspection method, the method constructs multiple text learnable prompt, by aggregating text information related to task, guide detection model form more fine-grained class decision boundary.Due to lack of real unknown class sample in the process of training, this method first considers the mining of unknown class pseudo sample as a bipartite graph matching task, by adding unknown class virtual node, construct cost matrix, mine unknown class pseudo sample, and then optimize model to form compact unknown class decision boundary.For the problem that small sample known class and unknown class are easy to confuse, the method proposes an unknown class optimization loss based on cost perception, considers the classification and positioning quality of unknown class, and improves the open set detection performance of the model.The method realizes the transformation from single-mode visual small model to multi-modal visual large model, and can obtain good generalized small sample open set target detection performance with only a small amount of training data.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of multimodal large-scale model inspection technology for new energy power plants, specifically a method for inspecting new energy power plants based on a multimodal large-scale model with small sample open sets. Background Technology

[0002] With the continuous expansion of the scale of new energy power plants, traditional manual inspection methods are inefficient and cannot meet the inspection needs of large-scale power plants. Existing inspection methods for new energy power plants mainly use drones equipped with infrared thermal imaging cameras to obtain infrared and visible light images, and then use deep learning-based target detection to identify defects. Intelligent inspection technologies such as drones can quickly cover large areas and improve inspection efficiency.

[0003] Deep learning-based methods rely on a large amount of manually labeled data, resulting in a large workload for data preparation and low algorithm execution efficiency. However, in the inspection of new energy power plants, some defects or anomalies may be very rare, leading to a limited number of samples. Multimodal large models can learn effectively in the case of small samples and enhance the model's understanding of new concepts through cross-modal information, which is of great significance for handling rare events in power plants. Lü et al. (Lü Tiangen, Hong Richang, He Jun. Multimodal guided local feature selection small sample learning method [J]. Journal of Software, 2023, 34(05):2068-2082.) proposed a method based on multimodal text feature measurement, which detects small sample targets in images by locally aligning text features with image features and then using prototype measurement. However, the above method is not effective when facing open set tasks. Although object detection methods based on large models have made great strides in the field of zero-shot learning, this object detection method still faces a serious open set problem. The model is prone to detecting unknown classes as known classes with textual prompts with high confidence. Generalized few-shot open set refers to enabling a model to detect known classes in both few and zero samples using a small number of training samples, while rejecting interference from unknown classes. Currently, large models still face the following two challenges in few-shot learning:

[0004] (1) Single-modal image modeling has incomplete information: Single-modal image modeling relies only on visual information and lacks support from other modal data (such as text, sound, etc.). This limits the model's comprehensive understanding of the scene. Without having seen an instance of a specific category, the model has difficulty forming a discriminative decision boundary for zero-shot classes, leading to difficulties in zero-shot recognition.

[0005] (2) Open set category confusion: The model overfits to the text prompts and image training data of known classes and loses the ability to generalize to unknown classes. When the model has never seen the category prompt corresponding to the target, the model will make an incorrect prediction and detect the unknown class object as a known class with similar appearance, causing category confusion. Summary of the Invention

[0006] To address the shortcomings of existing technologies, this invention aims to solve the technical problem of proposing a method for inspecting new energy power plants using a multimodal large-scale model and small-sample open-set data. This invention, from the perspective of multimodal modeling, jointly models text and images to construct a generalized small-sample open-set target detection large-scale model. It designs an unknown class pseudo-sample mining method based on bipartite graph matching, multivariate text-based learnable hints, and a cost-aware unknown class loss function, enabling the detection model to form more fine-grained category decision boundaries and achieve higher detection accuracy.

[0007] The technical solution adopted by the present invention to solve the aforementioned technical problem is as follows:

[0008] A method for inspecting new energy power plants based on a multimodal large model and a small sample open set is characterized by the following steps:

[0009] Step 1: Obtain the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set during the inspection of new energy power plants. Each image set contains corresponding defect images and non-defect images. The number of images in each image set is no less than 100. Each image data in each image set includes an image with a corresponding defect category label and a corresponding text description. The defect images in each image set cover all known defect categories of the corresponding subject.

[0010] Step 2: Construct a multimodal large model with small sample open set detection model

[0011] The detection model includes an end-to-end ViT model, a text processing module, and an unknown class pseudo-sample mining module; the end-to-end ViT model consists of three parts: an image feature embedding module, a Transformer encoder, and an MLP classification head; the end-to-end ViT model is the basic detector.

[0012] The image feature embedding module first divides the input image P0 into multiple equal-sized blocks, then adds a positional code corresponding to the position of the block in image P0, then flattens each block into a one-dimensional vector, and then embeds each of these flattened blocks into a high-dimensional feature vector through a convolutional layer or linear transformation. Finally, a learnable class token is added to the front of the high-dimensional feature vector as a representative of global information, and is ultimately used for classification.

[0013] The Transformer encoder consists of multiple encoder layers. The output of the previous encoder layer serves as the input to the next encoder layer. Each encoder layer includes a first normalization layer, a multi-head attention layer, a second normalization layer, and a feedforward neural network. The feature T0 input to an encoder layer is first processed by the first normalization layer, and the result is input to the multi-head attention layer. The output of the multi-head attention layer is residually concatenated with the feature T0. The resulting feature T1 is then processed by the second normalization layer and input to the feedforward neural network. The output of the feedforward neural network is residually concatenated with the feature T1, and the resulting feature T2 is used as the output of the encoder layer.

[0014] The output class token of the last encoder layer of the Transformer encoder is used as the input to the MLP classification head. The MLP classification head classifies according to the input to obtain the final prediction result.

[0015] The text processing module includes a text encoder in a pre-trained CLIP model and a multi-text scientific embedding query library. First, the single-text prompts of all possible results of the inspection of the new energy power plant are modified into multi-text learnable prompts. Then, the feature vector of each learnable prompt is extracted by the text encoder. Finally, each learnable prompt and its corresponding feature vector are embedded into a data entry for storage to obtain the multi-text scientific embedding query library.

[0016] Image P0 is processed sequentially by an image feature embedding module and a Transformer encoder to obtain feature T. p Feature T p Image features f are obtained through linear projection; the category similarity between image features f and the multivariate text science embedded query library is calculated, and then the category prediction similarity score S is obtained through max pooling operation. The dimension of S is B×K, where B represents the number of prediction boxes, K is the total number of prediction categories set, and K is the sum of the number of all known classes and the number of set unknown classes.

[0017]

[0018] sin(,) denotes cosine similarity, τ represents the temperature coefficient, is a predefined non-zero constant, and q i Z represents the feature vector of the i-th learnable hint in the multi-text science department embedding query library, and Z is the number of learnable hints in the multi-text science department embedding query library;

[0019] The larger the element value in S, the more similar the corresponding predicted box is to the corresponding learnable cue. That is, the learnable cue is the predicted label of the predicted box. Based on S, the predicted label of each predicted box of image feature f is obtained.

[0020] The characteristic T output by the Transformer encoder p The data is input to the MLP classification head, which outputs a set of N fixed-size prediction boxes and the probability of each prediction box belonging to each category label. These N prediction boxes are then input to the unknown class pseudo-sample mining module. This module first constructs a cost matrix containing virtual unknown classes by adding virtual nodes to a bipartite graph, and then uses the Kuhn-Munkres algorithm to search for the best match of N elements from the cost matrix at the lowest cost. Then, the cost-aware unknown class loss is calculated, and the cost-aware unknown class loss function L is used. ux Defined as Equation (6):

[0021]

[0022] Indicates best match The intersection-union (IoU) ratio of the predicted bounding boxes of unknown classes with the true bounding boxes of known classes in the data. Indicates best match The probability of predicting the unknown class in c; i It is a target category tag. Let the coordinates of the centroid of the true bounding box and its height and width relative to the image size be denoted as . For best match The centroid coordinates of the prediction box and its height and width relative to the image size; For best match The index labels in the text are The predicted label is category c i The probability of the corresponding category label for each prediction box is output by the MLP classification head.

[0023] Step 3: Train a multimodal large model with small open set detection model

[0024] Step 3.1 First, adjust the format of each image in the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set from Step 1 to meet the input format requirements of the basic detector. Then, divide the defect types in each image set into three parts: the first part is small sample known class data, the second part is zero sample known class data, and the third part is unknown class data. Small sample known class data refers to images where the corresponding defect category is labeled and the text description of the image includes the corresponding defect category. Zero sample known class data refers to images where the corresponding defect category is not labeled, but the text description of the image includes the corresponding defect category. Unknown class data refers to images where the corresponding defect category is labeled "unknown" and the text description of the image includes the "unknown" defect category.

[0025] Step 3.2 The base detector is either the pre-trained ViT-B32 or ViT-L14 released by Google DeepMind; set the training batch size to 1, and set the learning rate, hyperparameters, and maximum number of iterations;

[0026] Step 3.3 Input the image data from the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set into the pre-trained basic detector in Step 3.2, respectively. One image data includes image P. x and the corresponding text description W x Image P x After being processed sequentially by the image feature embedding module and the Transformer encoder, the output feature T is obtained. x Feature T x Image features f are obtained through linear projection. x ; describe the text W x The text is input into the text processing module, where it first undergoes extraction by the text encoder to obtain the feature vector of the text description. Then, the text description W is processed... x The corresponding feature vectors are stored in a multivariate text science embedding query library; then the image features f are calculated. x The category similarity with the multi-text science department's embedded query library is obtained by max pooling to obtain the category prediction similarity score S. x To obtain image features f x The predicted label for each prediction box;

[0027] At the same time, feature T x The data is input to the MLP classification head, which outputs a set of N fixed-size predicted boxes and the probability of each predicted box belonging to each predicted label. The predicted label corresponding to the highest probability is the category label of that predicted box. These N predicted boxes are then input to the unknown class pseudo-sample mining module. This module first constructs a cost matrix containing virtual unknown classes by adding bipartite graph unknown class virtual nodes. Then, it uses the Kuhn-Munkres algorithm to search for the best match of N elements in the cost matrix at the lowest cost and calculates the cost-aware unknown class loss function L. ux The value; based on the unknown class loss function L ux The value of W is obtained by gradient backpropagation for the text description W in the multi-text science system embedded query library. x The values of the corresponding feature vectors are corrected;

[0028] According to image P x The above-labeled target bounding box, defect category label, and category prediction similarity score S xThe MLP classification head outputs a fixed-size set of N predicted bounding boxes, the probability of each box belonging to each predicted label, and the loss function L for the unknown class. ux Calculate image P x Training loss L:

[0029]

[0030] Among them, L ce (c i () indicates the best match in the unknown class pseudo-sample mining module. The cross-entropy classification loss, where the best match The true category label is the annotation label of the input image itself, and its predicted label is the similarity score S predicted based on the category. x The result; λ represents the sum of the IoU loss and L1 regression loss between the predicted defect location bounding box and the ground truth bounding box output by the MLP classification head. iou With λ L1 ω represents the weighting coefficients; ω is the loss function for the unknown class L. ux Weighting coefficients;

[0031] According to image P x To minimize the training loss, the AdamW optimizer is selected. The network parameters in the base detector are updated once to complete the training of one image data. Then, the next image data is input into the base detector, using the network parameters from the previous image data as the initial parameters for the next image data. This training process is repeated for each image data until the last image data is trained, completing one training round. The network parameters from the previous training round are then used as the initial parameters for the next training round, repeating this process until the preset number of training rounds is reached, resulting in the base detector with optimal parameters.

[0032] Step 4: Inspection of New Energy Power Plants

[0033] To obtain inspection photos of photovoltaic, wind turbines, or towers taken at the new energy power plant site, first adjust the format to meet the input format requirements of the basic detector, and then input them into the basic detector with the optimal parameters in step three. The MLP classification head outputs the detection results to obtain whether the corresponding inspected component contains defects and the defect category.

[0034] Compared with existing technologies, the beneficial effects of this invention are as follows: This invention provides a method for inspecting new energy power plants based on a multimodal large model with small sample open sets. It constructs multivariate text-based learnable prompts, and by aggregating task-related text information, guides the detection model to form more granular category decision boundaries, alleviating the problem of overly broad decision boundaries caused by single-text prompts. Since true unknown class samples are lacking during training, this invention treats the mining of unknown class pseudo-samples as a bipartite graph matching task for the first time. By adding virtual nodes for unknown classes and constructing a cost matrix, it mines unknown class pseudo-samples, thereby optimizing the model to form a compact unknown class decision boundary. Addressing the problem of confusion between known and unknown classes in small samples, this invention proposes a cost-aware unknown class optimization loss, which improves the model's open set detection performance by considering the classification and localization quality of unknown classes. This invention achieves the transformation from a single-modal visual small model to a multimodal visual large model, requiring only a small amount of training data to obtain good generalized small sample open set target detection performance. Attached Figure Description

[0035] Figure 1 This is a schematic diagram illustrating the principle and structure of the detection model of the new energy power plant inspection method based on a multimodal large model and small sample open set according to the present invention.

[0036] Figure 2 This is a schematic diagram of the bipartite graph matching algorithm principle of the unknown class pseudo-sample mining module of the detection model of a new energy power station inspection method based on multimodal large model small sample open set. (a) is a schematic diagram of the bipartite graph matching process of the real box and the predicted box in one embodiment, and (b) is the maximum matching graph of (a).

[0037] Figure 3 The results are obtained by using the OWL-ViT prediction model to detect a set of images.

[0038] Figure 4 To implement the multimodal large model small sample open set new energy power plant inspection method of this invention, and to... Figure 3 The results obtained by detecting the same set of images. Detailed Implementation

[0039] Specific embodiments are given below with reference to the accompanying drawings. These specific embodiments are only used to illustrate the technical solutions of the present invention in detail, and are not intended to limit the scope of protection of this application.

[0040] This invention provides a method for inspecting new energy power plants based on a multimodal large model and a small sample open set (hereinafter referred to as the method, see below). Figure 1 The method includes the following steps:

[0041] Step 1: Acquire image sets for photovoltaic (PV) power plant inspections, wind turbine inspections, and tower inspections. Each image set contains corresponding defect and non-defect images, with no fewer than 100 images in each set. Each image data entry in each set includes an image with a corresponding defect category label and a corresponding text description. The defect images in each set cover all known defect categories for the corresponding subject. For the PV inspection set, defect categories include hot spot defects, cracking defects, diode failure defects, contamination defects, and shading defects in PV panels. For the wind turbine inspection set, defect categories include images of dirt adhesion, oil leaks, scratches, paint peeling, rust, fading, corrosion, deformation, cracks, and wear. For the tower inspection set, defect categories include images of loose screws, rust, deformation, cracks, and damaged metal fittings. The defect and non-defect images are at least one of infrared and visible light images.

[0042] Step 2: Construct a multimodal large model with small sample open set detection model

[0043] The detection model comprises an end-to-end ViT (Vision Transformer) model, a text processing module, and an unknown class pseudo-sample mining module. The end-to-end ViT model consists of three parts: an image feature embedding module (patches), a Transformer encoder, and an MLP classification head. The end-to-end ViT model serves as the base detector.

[0044] The image feature embedding module first divides the input image P0 into multiple equal-sized small blocks (called patches). Then, it adds a positional code corresponding to the position of each small block in image P0. Next, it flattens each small block into a one-dimensional vector. Then, it embeds each flattened small block into a high-dimensional feature vector through a convolutional layer or linear transformation. Finally, it adds a learnable class token to the front of the high-dimensional feature vector as a representative of global information and uses it for classification.

[0045] The Transformer encoder consists of multiple encoder layers. The output of one encoder layer serves as the input to the next. Each encoder layer includes a first normalization layer (Norm), a multi-head attention layer, a second normalization layer (Norm), and a feedforward network. A feature T0 input to an encoder layer is first processed by the first normalization layer (Norm), and the result is input to the multi-head attention layer (Norm). The output of the multi-head attention layer (Norm) is residually concatenated with feature T0. The resulting feature T1 is then processed by the second normalization layer (Norm) and input to the feedforward network (Norm). The output of the feedforward network (Norm) is residually concatenated with feature T1, and the resulting feature T2 is used as the output of that encoder layer.

[0046] The output class token of the last encoder layer of the Transformer encoder is used as the input to the MLP classification head. The MLP classification head classifies according to the input to obtain the final prediction result.

[0047] The text processing module includes a pre-trained CLIP model with a text encoder and a multivariate text scientific embedding query library. First, it modifies all possible single-text prompts (including a text description of an unknown class) for all possible results of new energy power plant inspections into multivariate text learnable prompts. For example, a single-text prompt such as "image of a wind turbine with an oil leak defect" is modified into multivariate text learnable prompts such as "wind turbine," "oil leak," and "image of a wind turbine." Then, the text encoder extracts the feature vector for each multivariate text learnable prompt. Finally, each learnable prompt and its corresponding feature vector are embedded into a single data entry for storage, resulting in the multivariate text scientific embedding query library.

[0048] Image P0 is processed sequentially by an image feature embedding module and a Transformer encoder to obtain feature T. p Feature T p Image features f are obtained through linear projection. The category similarity between image features f and the multivariate text science embedded query library is calculated, and then max pooling is performed to obtain the category prediction similarity score S. The dimension of S is B×K, where B represents the number of prediction boxes, K is the total number of prediction categories set, and K is the sum of the number of all known classes and the number of set unknown classes.

[0049]

[0050] sin(,) represents cosine similarity, τ represents the temperature coefficient, and is a non-zero constant set to 0.05. The image feature f is the feature obtained by linear projection of the output features of the Transformer encoder. Multivariate text science embedded query library. It was extracted by a text encoder. q i Let Z represent the feature vector of the i-th learnable hint in the multi-text science embedding query library. Z is the number of learnable hints in the multi-text science embedding query library.

[0051] The larger the value of an element in S, the more similar the corresponding predicted bounding box (predicted bounding box) is to the corresponding learnable cue. That is, the learnable cue is the category label of the predicted bounding box. Based on S, the classification label of each predicted bounding box of image feature f is obtained.

[0052] The characteristic T output by the Transformer encoder p The data is input to the MLP classification head, which outputs a set of N fixed-size prediction boxes and the probability of each prediction box belonging to each category label. These N prediction boxes are then input to the unknown class pseudo-sample mining module. This module first constructs a cost matrix containing virtual unknown classes by adding bipartite graph unknown class virtual nodes, and then uses a bipartite graph matching algorithm to mine unknown class samples from the predicted bounding boxes.

[0053] Because of the lack of true unknown class samples during training, this invention constructs a cost matrix from the predicted bounding boxes by adding bipartite graph unknown class virtual nodes. It then uses a bipartite graph matching algorithm to mine low-cost unknown class pseudo-samples, optimizes the cost-aware unknown class loss function, constructs the unknown class decision boundary, and ultimately achieves generalized small-sample open-set object detection. In graph theory, a matching is a set of edges where no two edges share a common vertex. Bipartite graph matching is an important concept in graph theory, referring to finding a way to match edges in a bipartite graph such that every vertex in the graph belongs to exactly one edge. The goal of bipartite graph matching is to find the maximum matching, i.e., the graph contains the maximum number of matching edges. The bipartite graph matching process between ground truth and predicted bounding boxes in object detection is as follows: Figure 2 As shown in (a). Figure 2 (b) is a maximum matching that contains 3 matching edges. A matching edge is an edge that does not contain any common dependent vertex prediction boxes, such as... Figure 2As shown in (a), c(1,1) and c(2,576) are bipartite graph matches; c(1,2) and c(2,2) are not bipartite graph matches because they share a common dependent vertex prediction box 2. To mine unknown class pseudo-samples, this invention first adds an unknown class virtual node—virtual box 3—to the bipartite graph. By connecting the virtual node to each prediction box node, a cost matrix containing the virtual unknown class is constructed. Then, using a bipartite graph matching algorithm, unknown class samples are mined from the predicted bounding boxes, and the model is trained to form the unknown class decision boundary, rejecting interference from the unknown class.

[0054] To obtain the unknown class matching box, we first need to calculate the cost weight Q of connecting the unknown class virtual node to each predicted box node. match Construct a cost matrix that includes virtual unknown classes. Let y represent a set of N predicted labels. Assuming N is greater than the number of targets in the image, we consider y as a set of N zero-padded true labels for the targets. First, we add a virtual label yunknown to y. u Replace an element 0 in order to find the set y and The maximum matching between N elements is found by searching for the best matching of N elements σ at the lowest cost. σ∈G N G N This represents the cost matrix.

[0055]

[0056] in, Represents the true label y i,i≠u With the predicted label at index σ(i) The matching cost between two nodes. This invention uses the Kuhn-Munkres algorithm to calculate the optimal match.

[0057] The matching cost is equivalent to the sum of the distances between the predicted probabilities of the true classes and 1, and the distances between the predicted bounding boxes and the true bounding boxes. Each element i of the true labeled set y can be viewed as y i =(c i ,b i ), where c i It is a target class tag (possibly) ), Defined as the centroid coordinates of the ground truth bounding box and its height and width relative to the image size. For the predicted label with index σ(i), it is assigned the category c. i The probability is defined as Define the prediction box as L match The calculation formula is expressed as in equation (3):

[0058]

[0059] in, This represents the center distance and generalized intersection-union ratio (GUC) between the ground truth bounding box and the predicted bounding box. In calculating Q... match First, let's define:

[0060]

[0061] Where i,j=1,…,N. Then the cost weights connecting the unknown virtual nodes to each prediction box node are expressed as:

[0062]

[0063] N represents the maximum cost in the connection cost between the predicted bounding box and the ground truth bounding box. gt This indicates the number of labeled targets in the image, i.e., the number of predicted bounding boxes.

[0064] To ensure the model correctly classifies known classes with hints and rejects interference from unknown classes during classification, pseudo-samples of unknown classes obtained through mining are used to optimize the loss function for unknown classes. This enables the model to distinguish between known and unknown classes during the classification phase. Simultaneously, the optimal matching is considered. The intersection-over-union ratio (IoU) between candidate bounding boxes and ground truth bounding boxes in the unknown class and the prediction probability of the unknown class. The constraints between them are used to accurately identify unknown targets. Therefore, a cost-aware loss function L for the unknown class is set. ux Defined as Equation (6):

[0065]

[0066] Indicates best match The intersection-union (IoU) ratio of the predicted bounding boxes of unknown classes with the true bounding boxes of known classes in the data. Indicates best match The probability of predicting the unknown class in c. i It is a target category tag. Let the coordinates of the centroid of the true bounding box and its height and width relative to the image size be denoted as . For best match The centroid coordinates of the prediction box and its height and width relative to the image size; For best match The index labels in the text are The predicted label is category c i The probability of the corresponding category label for each prediction box is output by the MLP classification head.

[0067] L uxFrom an optimization perspective, the model's ability to distinguish between known and unknown classes with hints is improved, prompting the model to form a more compact decision boundary for unknown classes, preventing the model from incorrectly detecting unknown classes as known classes, and improving the model's open set detection performance.

[0068] Step 3: Train a multimodal large model with small open set detection model

[0069] Step 3.1 First, adjust the format of each image in the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set from Step 1 to meet the input format requirements of the basic detector. Then, divide the defect types in each image set into three parts: the first part is small sample known class data, the second part is zero sample known class data, and the third part is unknown class data. Small sample known class data refers to images where the corresponding defect category is labeled and the text description of the image includes the corresponding defect category. Zero sample known class data refers to images where the corresponding defect category is not labeled, but the text description of the image includes the corresponding defect category. Unknown class data refers to images where the corresponding defect category is labeled "unknown" and the text description of the image includes the "unknown" defect category.

[0070] As one embodiment, the photovoltaic inspection image set is set as follows: hot spots and cracks are two categories as known classes with small samples; diode faults and pollution are two categories as known classes with zero samples; and shading is one category as an unknown class. The tower inspection image set is set as follows: screw loosening and rust are two categories as known classes with small samples; deformation and cracks are two categories as known classes with zero samples; and metal parts damage is one category as an unknown class. The wind turbine inspection image set is set as follows: dirt adhesion, oil leakage, and scratches are three categories as known classes with small samples; paint peeling and rust are two categories as known classes with zero samples; and fading, corrosion, deformation, cracks, and wear are five categories as unknown classes.

[0071] Step 3.2 The base detector is either the pre-trained ViT-B32 or ViT-L14 released by Google DeepMind; the training batch size is set to 1, the learning rate to 0.000003, the hyperparameters τ and ω to 0.05 and 1 respectively, and the maximum number of iterations is set to 100. A half-precision training strategy is adopted, using 16-bit floating-point numbers instead of 32-bit floating-point numbers for calculation.

[0072] Step 3.3 Input the image data from the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set into the pre-trained basic detector in Step 3.2, respectively. One image data includes image P. x and the corresponding text description W x Image P x After passing through the image feature embedding module (patches) and the Transformer encoder, the output feature T is obtained. xFeature T x Image features f are obtained through linear projection. x The text describes W. x The text is input into the text processing module, where it first undergoes extraction by the text encoder to obtain the feature vector of the text description. Then, the text description W is processed... x The corresponding feature vectors are stored in a multi-text scientific embedding query library. Then, the image features f are calculated. x The category similarity with the multi-text science department's embedded query library is obtained by max pooling to obtain the category prediction similarity score S. x To obtain image features f x The predicted label for each prediction box.

[0073] At the same time, feature T x The data is input to the MLP classification head, which outputs a set of N fixed-size predicted bounding boxes and the probability of each box belonging to each predicted label. The predicted label corresponding to the highest probability is the category label of that predicted box. These N predicted bounding boxes are then input to the unknown class pseudo-sample mining module. This module first constructs a cost matrix containing virtual unknown classes by adding bipartite graph unknown class virtual nodes. Then, it uses a bipartite graph matching algorithm to mine unknown class samples from the predicted bounding boxes and calculates a cost-aware unknown class loss function L. ux The value; based on the unknown class loss function L ux The value of W is obtained by gradient backpropagation for the text description W in the multi-text science system embedded query library. x The values of the corresponding feature vectors are corrected;

[0074] According to image P x The above-labeled target bounding box, defect category label, and category prediction similarity score S x The MLP classification head outputs a fixed-size set of N predicted bounding boxes, the probability of each box belonging to each predicted label, and the loss function L for the unknown class. ux Calculate image P x Training loss L:

[0075]

[0076] Among them, L ce (c i () indicates the best match in the unknown class pseudo-sample mining module. The cross-entropy classification loss, where the best match The true category label is the annotation label of the input image itself, and its predicted label is the similarity score S predicted based on the category. x The result. λ represents the sum of the IoU loss and L1 regression loss between the predicted defect location bounding box and the ground truth bounding box output by the MLP classification head. iou With λ L1 The weighting coefficient is set to 1. ω is the loss function L for the unknown class. ux The weighting coefficients.

[0077] According to image P x To minimize the training loss, the AdamW optimizer is selected with a decay weight set to 0.1. This updates the network parameters of the base detector once, completing the training for one image dataset. Then, the next image dataset is input into the base detector, using the network parameters from the previous image dataset as the initial parameters for the next dataset. This training process is repeated for each image dataset until the last image dataset is trained, completing one training epoch. The network parameters from the previous epoch are then used as the initial parameters for the next epoch, repeating this process until the preset number of training epochs is reached, resulting in the base detector with optimal parameters.

[0078] Step 4: Inspection of New Energy Power Plants

[0079] To obtain inspection photos of photovoltaic, wind turbines, or towers taken at the new energy power plant site, first adjust the format to meet the input format requirements of the basic detector (resolution of 512×512), then input them into the basic detector with the optimal parameters in step three. The MLP classification head outputs the detection results to obtain whether the corresponding inspected component contains defects and the defect category.

[0080] To verify the specific detection effect of the technical solution of this invention, the ViT-based method OWL-ViT (Minderer M, Gritssenko A, Stone A, et al. Simple open-vocabulary object detection with vision transformers[A]. Proceedings of the European Conference on Computer Vision[C]. Tel Aviv, Israel: Springer, 2022:728-755) was selected for comparison. To ensure the fairness of the experiment, the feature extraction networks of the multimodal ViT-based methods such as OWL-ViT and the method of this invention were ViT-B32 and ViT-L14, respectively, for comparative experiments. Both were first trained on the same dataset and then validated on the same dataset. In the specific implementation of this invention, three datasets were prepared: photovoltaic inspection images, wind turbine inspection images, and tower inspection images. The photovoltaic inspection images include 1,170 visible light images of photovoltaic power plants in Guangzong New Energy Power Station in Handan and 3,520 infrared detection images of drone inspections in Guangzong, totaling 4,690 inspection images; the wind turbine inspection images mainly include 22,893 inspection images of 33 wind turbines in Annoji Power Station; the pole inspection images mainly include 1,198 drone inspection images of 31 poles in Baiyunling.

[0081] The photovoltaic dataset is divided into two classes: hot spots and cracks, which are known classes with few samples; two classes: diode faults and pollution, which are known classes with zero samples; and one class: shading, which is unknown. The tower dataset is divided into two classes: screw detachment and rust, which are known classes with few samples; two classes: deformation and cracks, which are known classes with zero samples; and one class: metal accessory damage, which is unknown. The wind turbine dataset is divided into three classes: dirt adhesion, oil leakage, and scratches, which are known classes with few samples; two classes: paint peeling and rust, which are known classes with zero samples; and five classes: fading, corrosion, deformation, cracks, and wear, which are unknown. The input images for the model are all uniformly set to a resolution of 512×512.

[0082] For OWL-ViT, this invention pre-defines the text representation of unknown classes and uses a pseudo-sample mining algorithm based on virtual node bipartite graphs to mine pseudo-samples of unknown classes. It then trains an unknown class loss function based on log-softmax, thereby achieving generalized small-sample open-set target detection. The evaluation metric used is mAP. fs (Small sample), mAP zs (zero samples), AR U The experimental results of WI and AOSE on the photovoltaic inspection dataset and the pole inspection dataset are shown in Table 1.

[0083] Table 1. Evaluation results of small sample open sets on the photovoltaic inspection dataset and the pole inspection dataset.

[0084]

[0085] As shown in Table 1, the method of the present invention achieves a significant performance improvement in unknown class detection. Under the 10-shot setting of the tower inspection dataset, the average recall rate (AR) of the method of the present invention (ViT-L14) for unknown classes is significantly higher. U The accuracy reached 55.1%, a 12.8% improvement over OWL-ViT. These experimental results demonstrate that the proposed method exhibits strong generalization performance when facing unknown categories, effectively detecting and recognizing category cues not seen during training. Furthermore, the proposed method shows significant improvements in detection performance for both few-shot known classes and zero-shot known classes. Compared to OWL-ViT, the proposed method achieves a significantly higher average few-shot detection result (mAP). fs mAP of zero-sample detection results zs The improvements were 7.0% and 8.1%, respectively. This indicates that the method of the present invention not only exhibits good generalization performance for unknown classes with limited training samples, but also achieves significant performance improvements in the identification of known classes with few samples and known classes with zero samples.

[0086] This invention evaluates the effectiveness of the method using a more challenging wind turbine inspection dataset. Three classes are set as known classes with few samples, two classes as known classes with zero samples, and five wind classes as unknown classes. Compared to the photovoltaic and pole inspection datasets, which use two known classes with two few samples, two zero-sample classes, and two unknown classes, the wind turbine inspection dataset presents a greater challenge. Experimental results for 1-shot, 5-shot, 10-shot, and 30-shot tests are shown in Table 2.

[0087] Table 2 Evaluation results of small sample open sets on the wind turbine inspection dataset.

[0088]

[0089] As shown in Table 2, the method of this invention achieves significant improvements in small-sample performance, zero-sample performance, and unknown class detection performance. In the 10-shot small-sample case, the method of this invention uses ViT-B32 as the feature extraction network, which improves the small-sample detection performance (mAP) by 8.8% compared to OWL-ViT. fs 2.0% zero-sample detection performance (mAP) zs 6.6% AR performance for unknown classes U .

[0090] Visualization results on the 10-shot wind turbine inspection dataset are as follows: Figure 3 , Figure 4As shown, compared to the OWL-ViT method, the method of this invention has certain advantages in the detection of unknown objects. Figure 4 As shown, in the left image, the method of this invention can detect the mark defects on the wind turbine surface, while the OWL-ViT method treats them as background. In the middle image, the method of this invention detects two scratch defects. For the right image, the OWL-ViT method incorrectly detects the wind turbine component as a scratch, while the method of this invention can detect it as an unknown class, preventing class confusion. Overall visualization analysis results show that the method of this invention has good open set class detection capability.

[0091] Any aspects not covered in this invention are applicable to existing technologies.

Claims

1. A method for inspecting new energy power plants based on a multimodal large model and a small sample open set, characterized in that, The method includes the following steps: Step 1: Obtain the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set during the inspection of new energy power plants. Each image set contains corresponding defect images and non-defect images. The number of images in each image set is no less than 100. Each image data in each image set includes an image with a corresponding defect category label and a corresponding text description. The defect images in each image set cover all known defect categories of the corresponding subject. Step 2: Construct a multimodal large model with small sample open set detection model The detection model includes an end-to-end ViT model, a text processing module, and an unknown class pseudo-sample mining module; the end-to-end ViT model consists of three parts: an image feature embedding module, a Transformer encoder, and an MLP classification head; the end-to-end ViT model is the basic detector. The image feature embedding module first divides the input image P0 into multiple equal-sized blocks, then adds a positional code corresponding to the position of the block in image P0, then flattens each block into a one-dimensional vector, and then embeds each of these flattened blocks into a high-dimensional feature vector through a convolutional layer or linear transformation. Finally, a learnable class token is added to the front of the high-dimensional feature vector as a representative of global information, and is ultimately used for classification. The Transformer encoder consists of multiple encoder layers. The output of the previous encoder layer serves as the input to the next encoder layer. Each encoder layer includes a first normalization layer, a multi-head attention layer, a second normalization layer, and a feedforward neural network. The feature T0 input to an encoder layer is first processed by the first normalization layer, and the result is input to the multi-head attention layer. The output of the multi-head attention layer is residually concatenated with the feature T0. The resulting feature T1 is then processed by the second normalization layer and input to the feedforward neural network. The output of the feedforward neural network is residually concatenated with the feature T1, and the resulting feature T2 is used as the output of the encoder layer. The output class token of the last encoder layer of the Transformer encoder is used as the input to the MLP classification head. The MLP classification head classifies according to the input to obtain the final prediction result. The text processing module includes a text encoder in a pre-trained CLIP model and a multi-text scientific embedding query library. First, the single-text prompts of all possible results of the inspection of the new energy power plant are modified into multi-text learnable prompts. Then, the feature vector of each learnable prompt is extracted by the text encoder. Finally, each learnable prompt and its corresponding feature vector are embedded into a data entry for storage to obtain the multi-text scientific embedding query library. Image P0 is processed sequentially by an image feature embedding module and a Transformer encoder to obtain feature T. p Feature T p Image features are obtained through linear projection. f By calculating image features f The category similarity with the embedded query library of multi-text science is then subjected to max pooling to obtain the category prediction similarity score S, where the dimension of S is B×K, where... This indicates the number of prediction boxes, where K is the total number of prediction categories set, which is the sum of the number of all known categories and the number of the set unknown categories; （1） sin(,) denotes cosine similarity. This represents the temperature coefficient, which is a predefined non-zero constant. Let represent the feature vector of the i-th learnable hint in the multi-text science system embedding query library. Z The number of learnable hints in the embedded query library for multi-text science departments; The larger the value of an element in S, the more similar the corresponding predicted bounding box is to the corresponding learnable cue; that is, the learnable cue is the predicted label of the predicted bounding box. Based on S, the image features are obtained. f The predicted label for each prediction box; The feature Tp output by the Transformer encoder is input into the MLP classification head, and the MLP classification head outputs a fixed-size set of features. N Each predicted bounding box and the probability that each predicted bounding box belongs to each category label; N Each predicted bounding box is input into the unknown class pseudo-sample mining module. This module first constructs a cost matrix containing virtual unknown classes by adding virtual nodes to a bipartite graph. Then, it uses the Kuhn-Munkres algorithm to search for the best match of N elements from the cost matrix at the lowest cost. Then, the cost-aware unknown class loss is calculated, and the cost-aware unknown class loss function is used. L x is defined as equation (6): （6） Indicates best match The intersection-union (IoU) ratio of the predicted bounding boxes of unknown classes with the true bounding boxes of known classes in the data. Indicates best match Predicting the probability of unknown classes in the unknown class; ci It is a target category tag. , where is the centroid coordinates of the true bounding box and its height and width relative to the image size. For best match The centroid coordinates of the prediction box and its height and width relative to the image size; For best match The index labels in the text are The probability that the predicted label is category ci; the predicted probability of the corresponding category label for each prediction box is output by the MLP classification head; Step 3: Train a multimodal large model with small open set detection model Step 3.1 First, adjust the format of each image in the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set from Step 1 to meet the input format requirements of the basic detector. Then, divide the defect types in each image set into three parts: the first part is small sample known class data, the second part is zero sample known class data, and the third part is unknown class data. Small sample known class data refers to images where the corresponding defect category is labeled and the text description of the image includes the corresponding defect category. Zero sample known class data refers to images where the corresponding defect category is not labeled, but the text description of the image includes the corresponding defect category. Unknown class data refers to images where the corresponding defect category is labeled "unknown" and the text description of the image includes the "unknown" defect category. Step 3.2 The base detector is either the pre-trained ViT-B32 or ViT-L14 released by DeepMind; set the training batch size to 1, and set the learning rate, hyperparameters, and maximum number of iterations; Step 3.3 Input the image data from the photovoltaic inspection image set, wind turbine inspection image set, and tower inspection image set into the pre-trained basic detector in Step 3.2, respectively. One image data includes image P. x and the corresponding text description W x Image P x After passing through the image feature embedding module and the Transformer encoder, the output feature T is obtained. x Feature T x Image features are obtained through linear projection. f x ; describe the text W x The text is input into the text processing module, where it first undergoes extraction by the text encoder to obtain the feature vector of the text description. Then, the text description W is processed... x The corresponding feature vectors are stored in a multivariate text science embedding query library; then the image features are calculated. f x The category similarity with the multi-text science department's embedded query library is obtained by max pooling to obtain the category prediction similarity score S. x To obtain image features f x The predicted label for each prediction box; At the same time, feature T x The input is fed into the MLP classification header, and the MLP classification header outputs a fixed-size set of data. N Each predicted bounding box and the probability of each predicted bounding box belonging to each predicted label are considered. The predicted label corresponding to the highest probability is the category label of that predicted bounding box. N Each predicted bounding box is input into the unknown class pseudo-sample mining module. This module first constructs a cost matrix containing virtual unknown classes by adding virtual nodes to a bipartite graph. Then, it uses the Kuhn-Munkres algorithm to search for the best match of N elements from the cost matrix at the lowest cost and calculates a cost-aware loss function for the unknown class. L x The value; based on the unknown class loss function L x The value of W is obtained by gradient backpropagation for the text description W in the multi-text science system embedded query library. x The values of the corresponding feature vectors are corrected; According to image P x The above-labeled target bounding box, defect category label, and category prediction similarity score S x The MLP classification header outputs a fixed-size set of data. N Each predicted bounding box, the probability that each bounding box belongs to each predicted label, and the loss function for the unknown class. L x Calculate image P x Training loss L: （7）（8）（9） in, This indicates the best match in the unknown class pseudo-sample mining module. The cross-entropy classification loss, where the best match The true category label is the annotation label of the input image itself, and its predicted label is the similarity score S predicted based on the category. x The result; This represents the sum of the IoU loss and L1 regression loss between the predicted defect location bounding box and the ground truth bounding box output by the MLP classification head. ,λiou and λL1 These are weighting coefficients; ω It is an unknown class loss function L The weighting coefficients of x; According to image P x To minimize the training loss, the AdamW optimizer is selected. The network parameters in the base detector are updated once to complete the training of one image data. Then, the next image data is input into the base detector, using the network parameters from the previous image data as the initial parameters for the next image data. This training process is repeated for each image data until the last image data is trained, completing one training round. The network parameters from the previous training round are then used as the initial parameters for the next training round, repeating this process until the preset number of training rounds is reached, resulting in the base detector with optimal parameters. Step 4: Inspection of New Energy Power Plants To obtain inspection photos of photovoltaic, wind turbines, or towers taken at new energy power plants, first adjust their format to meet the input format requirements of the basic detector, and then input them into the basic detector with the optimal parameters in step three. The MLP classification head outputs the detection results to obtain whether the corresponding inspected component contains defects and the defect category.

2. The method for inspecting new energy power plants based on a multimodal large model and a small sample open set according to claim 1, characterized in that, In step one, the defect categories in the photovoltaic inspection image set include images of hot spot defects, cracking defects, diode failure defects, pollution defects, and shading defects in photovoltaic panels; the defect categories in the wind turbine inspection image set include images of dirt adhesion, oil leakage, scratches, paint peeling, rust, fading, corrosion, deformation, cracks, and wear; and the defect categories in the pole tower inspection image set include images of loose screws, rust, deformation, cracks, and damaged metal fittings.

3. The method for inspecting new energy power plants based on a multimodal large model and a small sample open set according to claim 1, characterized in that, In step two, the temperature coefficient Set it to 0.

05.

4. The method for inspecting new energy power plants based on a multimodal large model and a small sample open set according to claim 1, characterized in that, Using the Kuhn-Munkres algorithm, we search for the best matching of N elements in the cost matrix at the lowest cost. The specific process is as follows: To obtain the unknown class matching box, we first need to calculate the cost weights connecting the unknown class virtual node to each predicted box node. Q match Construct a cost matrix that includes virtual unknown classes; ,express N A set of predicted labels; assuming N If the number of objects in the image is greater than the number of objects in the image, y The size of the actual label and zero-filled element considered as the target is N A set; first, in y, use an unknown class virtual label. Replace an element 0 in order to find the set The maximum matching between N elements is found by searching for the best matching of N elements σ at the lowest cost. , , G N Represents the cost matrix; （2） in, Represents real labels y i,i≠u With index Predicted labels The matching cost between two nodes; The matching cost is equivalent to the sum of the distance between the predicted probability of the true class and 1, and the distance between the predicted bounding box and the true bounding box; each element of the true labeled set y i It can be seen as ,in c i It is a target category tag. , defined as the centroid coordinates of the ground truth bounding box and its height and width relative to the image size; for index , The predicted label is used as the category. c i The probability is defined as Define the prediction box as , L match The calculation formula is expressed as shown in equation (3): （3） in, This represents the sum of the IoU loss and L1 regression loss between the predicted defect location bounding box and the ground truth bounding box output by the MLP classification head; in the calculation Q match First, let's define: （4） in, i,j=1,…,N The cost weight for connecting the unknown virtual node to each prediction box node is then expressed as: （5） N represents the maximum cost in the connection cost between the predicted bounding box and the ground truth bounding box. gt This indicates the number of labeled targets in the image, i.e., the number of predicted bounding boxes.

5. The method for inspecting new energy power plants based on a multimodal large model and a small sample open set according to claim 2, characterized in that, In step three, the photovoltaic inspection image set is divided into two categories: hot spots and cracks, which are known with small samples; diode faults and pollution, which are known with zero samples; and shading, which is unknown. The tower inspection image set is divided into two categories: loose screws and rust, which are known with small samples; deformation and cracks, which are known with zero samples; and damage to metal parts, which is unknown. The wind turbine inspection image set is divided into three categories: dirt, oil leaks, and scratches, which are known with small samples; paint peeling and rust, which are known with zero samples; and fading, corrosion, deformation, cracks, and wear, which are unknown.

6. The method for inspecting new energy power plants based on a multimodal large model and a small sample open set according to claim 1, characterized in that, In step three, the learning rate is set to 0.000003, the hyperparameters τ and ω are set to 0.05 and 1 respectively, and the maximum number of iterations is set to 100.

7. The method for inspecting new energy power plants based on a multimodal large model and a small sample open set according to claim 1, characterized in that, In step one, the defective image and the non-defective image are at least one of infrared image and visible light image.