Power station inspection image segmentation method and system based on multi-modal driving and dynamic optimization

CN122289698APending Publication Date: 2026-06-26SHANDONG JIANZHU UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG JIANZHU UNIV
Filing Date
2026-05-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing power equipment inspection image segmentation technology cannot simultaneously balance accuracy, generalization and engineering practicality. It suffers from problems such as high annotation cost, fixed pseudo-labels and large computational overhead, and cannot meet the high accuracy and high real-time requirements of power inspection.

Method used

A multimodal driven and dynamic optimization approach is adopted. Initial pseudo-labels are generated using a pre-trained visual language model and a multimodal large language model. Through cross-modal feature alignment and transition matrix optimization, a lightweight image segmentation network is constructed to achieve efficient and stable equipment defect segmentation.

Benefits of technology

It significantly improves the accuracy and generalization ability of power equipment defect segmentation, reduces annotation costs, adapts to complex field environments, is compatible with edge computing devices, and meets the high precision and real-time requirements of power inspection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122289698A_ABST
    Figure CN122289698A_ABST
Patent Text Reader

Abstract

This invention proposes a multimodal driven and dynamically optimized method and system for power plant inspection image segmentation, belonging to the field of computer vision technology. The method includes: acquiring inspection sample images; generating initial pseudo-labels using a pre-trained visual language model and a multimodal large language model; inputting the inspection sample images into the visual model to obtain dense features, aligning them cross-modally with text semantic embeddings to obtain enhanced visual features; constructing a transition matrix based on image features and enhanced visual features, and inputting these features into a decoder to obtain a dual-path prediction mask, which is then fused to generate a segmentation probability map; optimizing the initial pseudo-labels based on the segmentation probability map and the transition matrix, and using the optimized pseudo-labels to construct a supervision signal to guide the generation of the transition matrix; iteratively optimizing until convergence to obtain a trained image segmentation network, which is then used as input to the inspection image to be segmented to obtain the segmentation result. This improves the segmentation accuracy and generalization ability of inspection images.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a multimodal driven and dynamically optimized image segmentation method and system for power plant inspection. Background Technology

[0002] With the rapid development of smart grids, intelligent power grid inspection has become an important guarantee for the safe operation of power. Image segmentation technology has been widely used in power equipment defect identification and condition assessment. However, existing methods struggle to simultaneously achieve high accuracy, generalization, and engineering applicability.

[0003] Mainstream segmentation algorithms rely on a large number of pixel-level annotations. However, the complex structure of power equipment and the uneven distribution of defect samples lead to high annotation costs and a scarcity of effective data, limiting the actual performance of the models. Weakly supervised methods lower the annotation threshold, but pseudo-labels are fixed, easily affected by lighting, occlusion, and complex backgrounds, resulting in semantic bias, blurred boundaries, and the inability to correct errors in real time. General-purpose large models lack power-domain-specific knowledge, have insufficient accuracy in identifying subtle equipment defects, and exhibit poor robustness to single visual features. Furthermore, large models have high computational costs, making them difficult to deploy on edge devices such as drones and inspection robots, and thus failing to meet the high-precision, high-real-time requirements of power inspection operations. Summary of the Invention

[0004] To address the aforementioned issues, this invention proposes a multimodal driven and dynamically optimized power plant inspection image segmentation method and system, which improves the segmentation accuracy and generalization ability of inspection images, and achieves efficient, stable, and high-precision equipment defect segmentation with little or no annotation.

[0005] To achieve the above objectives, the present invention adopts the following technical solution: In a first aspect, the present invention provides a multimodal driven and dynamically optimized power plant inspection image segmentation method, comprising: Obtain inspection sample images, use pre-trained visual language models and multimodal large language models to obtain image features and text semantic embeddings, and generate initial pseudo-labels; The inspection sample images are input into the visual model to obtain dense features. The dense features are then aligned with the text semantic embedding across modalities to obtain enhanced visual features. A transition matrix reflecting the spatial correlation between pixels is constructed based on image features and enhanced visual features; the image features and enhanced visual features are then input into the decoder to obtain a dual-path prediction mask, which is then fused to generate a segmentation probability map; The initial pseudo-labels are optimized based on the segmentation probability map and the transition matrix to obtain the final pseudo-labels; the final pseudo-labels are used to construct a supervision signal to guide the generation of the transition matrix; the optimization is performed iteratively until convergence, and the trained image segmentation network is obtained. The inspection image to be segmented is input into the trained image segmentation network to obtain the segmentation result.

[0006] Secondly, the present invention provides a multimodal driven and dynamically optimized power plant inspection image segmentation system, comprising: The initial pseudo-label building unit is configured to acquire inspection sample images, use pre-trained visual language models and multimodal large language models to obtain image features and text semantic embeddings, and generate initial pseudo-labels. The feature processing unit is configured to input the inspection sample image into the visual model to obtain dense features, and to perform cross-modal alignment between the dense features and the text semantic embedding to obtain enhanced visual features. The probability map acquisition unit is configured to construct a transition matrix reflecting the spatial correlation between pixels based on image features and enhanced visual features; and input the image features and enhanced visual features into the decoder to obtain a dual-path prediction mask, which is then fused to generate a segmentation probability map. The supervisory unit is configured to optimize the initial pseudo-labels based on the segmentation probability map and the transition matrix to obtain the final pseudo-labels; use the final pseudo-labels to construct a supervisory signal to guide the generation of the transition matrix; and perform iterative optimization until convergence to obtain the trained image segmentation network. The segmentation unit is configured to input the inspection image to be segmented into a trained image segmentation network to obtain the segmentation result.

[0007] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the multimodal driven and dynamically optimized power plant inspection image segmentation method described in the first aspect.

[0008] Fourthly, the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the multimodal driven and dynamically optimized power plant inspection image segmentation method described in the first aspect.

[0009] Compared with the prior art, the beneficial effects of the present invention are as follows: This invention utilizes visual language models and large language models to generate initial pseudo-labels, significantly reducing reliance on manual annotation. Through cross-modal alignment of dense features and textual semantic embedding, it strengthens the semantic consistency of defect features, improving the robustness of identifying small-sample, low-contrast defects. By constructing a spatial correlation transition matrix and fusing dual-path prediction masks, it achieves accurate association and segmentation of pixel-level defect features, effectively suppressing noise interference caused by lighting and occlusion in inspection scenarios. The iterative optimization of pseudo-labels and the reverse constraint mechanism of the supervision signal enable model self-correction and accelerated convergence, effectively improving the segmentation accuracy and generalization ability of various defects such as instrument damage and insulator damage, providing efficient and reliable technical support for intelligent inspection of power equipment defects.

[0010] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0011] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute a limitation thereof.

[0012] Figure 1 The main flowchart of a multimodal driven and dynamically optimized power plant inspection image segmentation method provided in this embodiment of the invention; Figure 2 This is a basic framework diagram of a multimodal driven and dynamically optimized power plant inspection image segmentation method provided in an embodiment of the present invention; Figure 3 A flowchart of cross-modal feature alignment and enhancement provided in an embodiment of the present invention; Figure 4 This is a flowchart of the error correction transition matrix generation process provided in an embodiment of the present invention; Figure 5 A schematic diagram of a lightweight decoder provided in an embodiment of the present invention; Figure 6 The following are experimental effect diagrams provided for embodiments of the present invention; wherein, (a) is a schematic diagram for identifying instrument damage; (b) is a schematic diagram for identifying rusty metal; (c) is a schematic diagram for identifying bird nests; (d) is a schematic diagram for identifying oil stains on the ground; (e) is a schematic diagram for identifying abnormal door closure; and (f) is a schematic diagram for identifying insulator damage. Detailed Implementation

[0013] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0014] As mentioned in the background section, with the accelerated development of smart grids, intelligent inspection of substations and transmission lines has become a crucial link in ensuring the safe and stable operation of the power system. Currently, image segmentation technology based on computer vision is widely used in defect detection and condition assessment of power equipment such as insulators, transformer bushings, and circuit breakers. However, in practical engineering applications, existing segmentation technologies still face many serious challenges.

[0015] Mainstream high-precision semantic segmentation models (such as Mask R-CNN) typically rely on massive amounts of pixel-level labeled data for fully supervised training. However, power equipment is diverse and structurally complex, and defect samples (such as insulator damage and hardware corrosion) exhibit long-tail distribution characteristics. Pixel-level labeling requires not only highly specialized knowledge but is also time-consuming and labor-intensive, resulting in a severe shortage of high-quality labeled data, which seriously restricts the generalization ability of segmentation models in the power field. Although weakly supervised semantic segmentation reduces labeling costs by utilizing image-level labels, these methods typically rely on class activation maps to complete foreground region mining and pseudo-label generation. However, the generated class activation maps (CAMs) often only focus on the most discriminative local regions of the object, making it difficult to cover the complete equipment outline, resulting in incomplete segmentation results.

[0016] Most existing weakly supervised segmentation methods follow a two-stage process: "offline generation of pseudo-labels to train the segmentation network." In this process, once generated, the pseudo-labels are fixed and used as label supervision for subsequent training. However, in complex field inspection scenarios, changes in lighting, vegetation occlusion, and background interference can easily lead to semantic ambiguity or boundary shifts in the initial pseudo-labels. Due to the lack of online correction mechanisms, these initial errors are amplified during training, causing the model to learn incorrect feature representations, resulting in missed or false detections of critical equipment components, failing to meet the high-precision defect detection requirements of power line inspection.

[0017] In recent years, although general segmentation models, represented by SAM (Segment Anything Model), have demonstrated strong zero-shot capabilities, they are mainly based on pre-training on natural images and lack expertise in the power field. When dealing with power equipment with specific geometric structures, such as insulator strings and equalizing rings, general models often struggle to accurately capture their fine structures and subtle defects. Furthermore, relying solely on a single visual feature (such as focusing only on texture or semantics) is insufficient to handle complex inspection environments, easily misidentifying background debris as equipment parts, or causing blurred segmentation boundaries in adverse weather conditions.

[0018] With the widespread application of edge devices such as drones and inspection robots, inspection scenarios place stringent demands on the computational efficiency of algorithms. While existing multimodal large language models (MLLMs) and large-scale visual foundation models (such as CLIP and DINO) possess powerful feature representation capabilities, their massive parameter count makes full fine-tuning or real-time inference at computationally limited edge environments highly impractical. How to retain the advantages of large models while reducing computational overhead and achieving efficient deployment in resource-constrained environments is a pressing technical challenge that needs to be addressed.

[0019] Based on this, the present invention proposes a multimodal driven and dynamically optimized power plant inspection image segmentation method, system, medium and equipment. By integrating multimodal prior knowledge and dynamic correction capabilities, a weakly supervised segmentation method adapted to edge computing environment is designed to solve the problems of scarce labeled data, low segmentation accuracy and high deployment cost in power inspection scenarios.

[0020] Example 1 like Figure 1 As shown in the figure, this embodiment discloses a multimodal driven and dynamically optimized power plant inspection image segmentation method, including the following steps: S1: Obtain inspection sample images, use pre-trained visual language models and multimodal large language models to obtain image features and text semantic embeddings, and generate initial pseudo-labels; S2: Input the inspection sample images into the visual model to obtain dense features, and perform cross-modal alignment between the dense features and the text semantic embedding to obtain enhanced visual features; S3: Construct a transition matrix reflecting the spatial correlation between pixels based on image features and enhanced visual features; and input the image features and enhanced visual features into the decoder to obtain a dual-path prediction mask, which is then fused to generate a segmentation probability map; S4: Optimize the initial pseudo-labels based on the segmentation probability map and the transition matrix to obtain the final pseudo-labels; use the final pseudo-labels to construct a supervision signal to guide the generation of the transition matrix; perform iterative optimization until convergence to obtain the trained image segmentation network; S5: Input the image to be segmented into the trained image segmentation network to obtain the segmentation result.

[0021] Next, combined Figure 2 This embodiment provides a detailed description of a multimodal driven and dynamically optimized power plant inspection image segmentation method.

[0022] The image segmentation method described in this embodiment is designed and implemented based on computer vision technology. The image segmentation network is constructed based on a pre-trained visual language model (CLIP image encoder and CLIP text encoder), a visual model (DINO model), and a multimodal large language model. During training, the image segmentation network is input with a visible light image and text description, while during inference, only the visible light image is required. In the feature extraction part, the pre-trained models CLIP and DINO are introduced to replace the traditional encoder for feature extraction, and the parameters are kept frozen. Among them, CLIP is a pre-trained visual language model used to extract CAM and image features, and DINO is a pre-trained self-supervised model used to extract image features. Subsequently, lightweight cross-modal alignment is used to align the image features of DINO and the text features of CLIP, and semantic information is used to enhance the DINO features. The dual-modal correction matrix generates a transition matrix using high-quality features and is continuously updated during the training process. The lightweight segmentation network generates a mask based on the features and uses the optimized pseudo-labels as the network's supervision signal. The dynamic optimization engine uses the transition matrix and segmentation results to generate refined pseudo-labels, while guiding the model to perform iterative optimization during training.

[0023] (I) Static CAM Generation Based on CLIP and MLLM Collaboration This step aims to address the semantic ambiguity and inaccurate localization issues of traditional CAM (Class Activation Map) by generating static pseudo-labels for use in the online phase. CAM refers to the class activation map, representing the region where the target is located.

[0024] First, obtain inspection sample images such as damaged insulators and missing electrical box covers, and their corresponding image-level category labels.

[0025] The inspection sample images are input into the frozen CLIP image encoder to extract image features. Simultaneously, the image-level category labels are input into the CLIP text encoder for encoding to extract text features.

[0026] Calculate the cosine similarity between image features and text features, filter, normalize, and weight the global similarity matrix, and generate a global class activation map. (Global CAM) captures the overall semantic region of interest in an image.

[0027] Secondly, the inspection sample images are input into the frozen multimodal large language model (MLLM) to generate detailed text descriptions of the images; these fine-grained detailed text descriptions are then encoded using the CLIP text encoder to obtain text semantic embeddings.

[0028] The similarity between image features and text semantic embeddings is calculated, and based on the similarity, filtering, normalization, and weight enhancement are performed to generate a patch-level class activation map. (Patch CAM) captures local details and boundary information of an object.

[0029] Finally, a complementary fusion strategy is adopted to activate the global class graph. Considered as a "seed" for location, it utilizes a patch-level class activation graph. Supplement the global class activation graph Obtain the final static activation map for the uncovered object regions. :

[0030] in, This represents the potential inactive region of CLIP. This formula utilizes the characteristic responses of MLLM to fill in the blind spots of CLIP, significantly expanding the coverage of the active region without compromising the high-confidence localization results of CLIP. This effectively solves the problem that traditional CAM only focuses on local discriminative regions.

[0031] As one implementation method, if the application scenario has sufficient computing resources (such as cloud server clusters) and pursues ultimate accuracy, the encoder of the large model can be fine-tuned with all parameters or lightweight fine-tuning techniques such as LoRA can be used. By fine-tuning on a specific power equipment dataset, the model features can be made more suitable for specific tasks, further improving segmentation accuracy. In addition, the inspection sample images are generally visible light (RGB) images. For extreme environments such as nighttime inspections, dense fog, or strong light interference, infrared (IR) thermal imaging images can be used as the primary input, and temperature thresholds can be used to locate defect areas. Alternatively, a multimodal fusion scheme can be adopted to fuse the temperature features of infrared images with the texture features of visible light images across modes, thereby solving the problem of information loss in special environments under single modes and realizing all-weather power equipment inspection.

[0032] This embodiment utilizes the collaborative mechanism of CLIP and MLLM to generate initial pseudo-labels through a complementary fusion strategy. CLIP provides high-confidence semantic seeds, while MLLM supplements details and boundary regions, effectively solving the problem in traditional weakly supervised methods where class activation maps (CAMs) only focus on local objects and ignore the complete structure. This method can generate high-quality supervision signals without any pixel-level manual annotation, greatly reducing the time and economic costs of data annotation in the power industry.

[0033] (II) Cross-modal feature alignment and feature enhancement like Figure 3As shown, this step aims to extract complementary visual features in resource-constrained environments. Features extracted by different pre-trained models exhibit semantic gaps. A lightweight cross-modal alignment is used to guide the alignment of the semantic information of CLIP with the dense features of the visual model (DINO), thereby endowing DINO with linguistic attributes to fully leverage their complementary advantages.

[0034] First, cross-modal alignment is achieved using a projection layer with a small number of parameters and non-linear activation. This projection layer is the only trainable layer in the entire alignment module. The obtained text semantics are then embedded... Obtain using projection layer Align its dimensions with the visual features of DINO.

[0035] Secondly, the inspection sample images are input into the DINO model, and the attention map of each attention head is extracted from the last layer of the DINO backbone. Each attention map The image regions of interest are identified, and fine-grained visual embeddings of the corresponding regions are extracted through attention-weighted averaging. This is a characteristic of DINO.

[0036] By calculating each and Cosine similarity between them is used to assess their semantic association:

[0037] To optimize this cross-modal alignment, we utilize the idea of ​​contrastive learning and use the InfoNCE alignment loss function to maximize the similarity of matched text-visual pairs while minimizing the similarity of non-matching items, thereby achieving accurate cross-modal alignment.

[0038] Finally, based on similarity calculation, the alignment matrix is ​​obtained by using CLIP text semantic embedding and DINO dense features. By performing average pooling operations along the channel dimension Aggregation is performed to generate a compressed semantic similarity matrix. By splicing and The DINO enhanced features are then obtained through a lightweight module (linear mapping and 1x1 convolution). That is, to enhance visual features.

[0039] This embodiment constructs a CLIP and DINO dual-path frozen network and innovatively introduces a lightweight cross-modal alignment module. CLIP features provide accurate semantic discrimination, while DINO features provide clear geometric boundaries. The two are deeply fused under alignment loss constraints. This design not only ensures the accuracy of power equipment identification but also enhances the boundary perception capability of fine structures such as insulator strings and transformer bushings, effectively avoiding background noise interference and significantly reducing the missed detection and false positive rates.

[0040] (III) Generation of Transition Matrix Based on Dual-Mode Features like Figure 4 As shown, this step is used to capture the spatial relationships between pixels, providing a foundation for static pseudo-label optimization. The core of online pseudo-label optimization lies in utilizing the semantic relationships between pixels to propagate high-confidence predictions to uncertain regions. Therefore, by leveraging the complementary properties of CLIP and DINO features, a bimodal cooperative transfer matrix is ​​constructed to guide subsequent label optimization.

[0041] First, utilize CLIP features respectively and DINO enhanced features Calculate the affinity matrix between pixels and The affinity matrix is ​​calculated using a cosine similarity-based method.

[0042] To eliminate the influence of differences in feature amplitudes and enhance the significance of associations, the two affinity matrices are mapped to the [0,1] interval using the Sigmoid function and then fused using element-wise multiplication. and Obtain the final affinity diagram .

[0043] Secondly, to ensure the stability and global consistency of the propagation process, the final affinity graph... Normalize it to transform it into a probability transition matrix.

[0044] Specifically, an iterative double-random normalization strategy is adopted for the final affinity graph. Perform alternating row and column normalization. After several iterations, the matrix converges to a double random matrix. That is, the sum of the rows and the sum of the columns are both 1.

[0045] To eliminate the directional deviation caused by asymmetry, further symmetry processing is performed, and the receptive field is expanded through random walk operations to establish long-distance dependencies between pixels:

[0046] in, Indicates matrix transpose. This represents the number of steps in the random walk. Transition matrix. Ultimately used to guide the correction process of pseudo-labels, it can smoothly propagate high-confidence semantic information along the geometric boundary to the fuzzy area, achieving synergistic optimization of semantics and structure.

[0047] To address the performance bottleneck caused by erroneous activation of pseudo-labels in traditional weakly supervised segmentation, this embodiment designs a dynamic pseudo-label correction module. Utilizing the semantic advantages of CLIP and the structural advantages of DINO, a transition matrix is ​​constructed to correct the initial pseudo-labels online during training. This mechanism enables the model to self-detect and correct boundary biases and missed detection regions, improving the intersection-union ratio (IUU) of segmentation boundaries to a new level and effectively overcoming interference problems in complex field environments.

[0048] (iv) Dual-path decoding and consistency constraint training like Figure 5 As shown, this step utilizes the complementarity of dual-path prediction to improve the robustness of the model through mutual constraints, while high-quality segmentation predictions will be used to back-optimize static pseudo-labels.

[0049] First, utilize CLIP features and enhanced DINO features As the final image features, the two feature streams are input into the same lightweight decoder (using a 3-layer Transformer structure), and a 1x1 convolution is used as the classifier to obtain the dual-path prediction mask. and .

[0050] Secondly, considering that CLIP excels at capturing high-level semantic consistency while DINO is adept at preserving low-level geometric structure, predictions from a single branch often have limitations. Therefore, an adaptive gating fusion module was designed to stitch the data along the channel dimension. and Then, a spatial weight map is generated through a lightweight convolutional network. The contribution weights of the two branches are dynamically adjusted based on the spatial context of the features to generate a high-confidence online segmentation probability map. : (4) Ultimately, although CLIP and DINO are complementary in feature representation, their feature spaces still exhibit distributional differences due to their different pre-training objectives. To prevent the decoder from learning mode-specific noise and to promote consistency between the two branch predictions, a consistency loss is introduced. This loss forces the two segmentation predictions to be close to each other in the probability space, achieving implicit alignment at the feature level.

[0051] (v) Dynamic pseudo-label correction and affinity monitoring This step addresses the issue of static pseudo-labeling. Based on the segmentation graph and transition matrix generated by the segmentation network, an online optimization module is designed. It dynamically integrates multi-level cues, not only optimizing static pseudo-labels using the segmentation graph and transition matrix, but also supervising the generation of the affinity matrix from the affinity graph generated by the pseudo-labels. This has facilitated the continuous co-evolution between the segmentation model and its pseudo-labels.

[0052] First, using the final segmentation probability graph and transition matrix For the initial static activation graph Optimization was performed. To correct deviations in static CAM and suppress noise activation, the following fusion strategy was designed: (5) in, The optimized activation map is shown, with ⊙ representing element-wise multiplication. This fusion corrects static CAM bias while suppressing noisy activations. It is then refined using post-processing techniques (e.g., CRF) to generate the final pseudo-labels.

[0053] Secondly, due to the transition matrix It is derived from the final affinity diagram The generated data is used to correct the initial activation graph, so its quality determines the upper limit of tag optimization to a certain extent. It is generated during training and is a learnable matrix; therefore, a supervision signal is constructed for it using the corrected pseudo-labels. This, in turn, guides the network to learn more robust feature associations.

[0054] By minimizing the final affinity graph With monitoring signals The cross-entropy loss forces the network to learn spatial relationships that conform to semantic logic, thereby generating a higher-quality transition matrix in the next iteration.

[0055] After multiple iterations, the trained image segmentation network is obtained when the loss function is minimized. The visible light inspection image to be segmented is then input into the trained image segmentation network to obtain the segmentation result.

[0056] To address the limited computing power of edge devices such as drones and inspection robots, this embodiment employs a full-backbone freezing strategy, training only the lightweight alignment module and decoder. While retaining the powerful feature extraction capabilities of the large model, this significantly reduces the computational overhead of training and inference. This enables the high-precision, weakly supervised segmentation algorithm to be deployed at low cost on actual inspection terminals, providing technical feasibility for real-time intelligent diagnosis.

[0057] To verify the effectiveness of this embodiment, the following specific implementation method is provided.

[0058] 1. Experimental Scenario and Training Dataset Construction This embodiment selects substations and power equipment in natural settings as typical inspection scenarios. Images are acquired under different lighting conditions (e.g., front lighting, backlighting, shadows, etc.) and different viewing angles using fixed high-definition cameras and airborne high-definition PTZ cameras, constructing a high-resolution image dataset.

[0059] Defect categories include multi-scale and multi-form defect types such as insulator damage, cable damage, bird nests and foreign objects, electrical box door damage, and equipment aging.

[0060] Labeling: A weakly supervised learning paradigm is used, and the dataset only provides image-level labels. That is, during the training phase, the model is only informed of the type of defect present in the current image, without providing the specific geometric contours or pixel-level masks of the defect targets.

[0061] 2. Initial pseudo-tag generation process For the input inspection images, a pre-trained visual language model is first used in conjunction with textual cues such as damaged insulators, bird nests, and foreign objects to generate a global activation map. Subsequently, the image patches are input into a multimodal large language model (MLLM) with a preset prompt: "You are a senior power line inspection expert. Please carefully observe the image and determine if there are any insulator damages or bird nests / foreign objects...". The MLLM outputs semantic descriptors for the image patches, which are then reconstructed into detail activation maps. Finally, a complementary fusion strategy is used to fuse the two types of activation graphs to obtain the final static activation graph. .

[0062] 3. Online dynamic optimization and segmentation training process Step S1 (Dual-path feature extraction): Input the image into the first visual model (CLIP) to extract global contextual features. Input the second visual model (DINO) to extract local texture features. Focus on the edge of the crack.

[0063] Step S2 (Cross-modal feature enhancement): The power industry dictionary (e.g., insulator damage) is converted into text features using the CLIP text encoder. After mapping by the lightweight module, the similarity between it and the local features of DINO is calculated to obtain the semantic association matrix. After aggregating the semantic association matrix, it is combined with DINO local features By fusing the data, enhanced visual features can be obtained. This feature can effectively suppress background interference.

[0064] Step S3 (Transition Matrix and Segmentation Map Generation): Utilize the features obtained in Step S1 respectively and the features obtained in step S2 Calculate the affinity diagram and A transition matrix reflecting the semantic association of pixels is constructed by double random normalization and random walk. Simultaneously, dual-path features are input into a lightweight segmentation network, and an online segmentation probability map is obtained using an adaptive gating fusion module. .

[0065] Step S4 (Closed-Loop Dynamic Optimization): This step consists of forward correction and backward guidance. Forward correction utilizes the transition matrix. For the initial pseudo-label Perform graph smoothing propagation, combined with segmentation probability graphs. The confidence weights are used to obtain optimized pseudo-labels. The reverse guidance method utilizes the calculated affinity loss between pixels, which is dynamically adjusted between the segmentation network and the transition matrix in the form of gradient backpropagation. The weight distribution.

[0066] Step S5 (Iteration): The above process is continuously iterated in the training rounds until the loss function converges, resulting in a trained lightweight image segmentation model.

[0067] 4. Visualization of experimental results, such as... Figure 6 As shown, the model uses standard coloring to mark the segmented target areas to clearly distinguish different defect categories: the "instrument damage" area is marked in green, the "rusty metal" area is marked in green, the "bird's nest" area is marked in blue, the "ground oil stain" area is marked in blue, the "abnormal door closure" area is marked in yellow, and the "insulator damage" area is marked in yellow. The characteristic areas of each type of defect are accurately identified and distinguished with bright colors, intuitively presenting the model's detection and visualization effects on six typical defects: instrument damage, rusty metal, bird's nest, ground oil stain, abnormal door closure, and insulator damage.

[0068] This specific embodiment significantly improves semantic segmentation performance in power plant inspection scenarios through collaborative innovation in multimodal collaborative pseudo-label generation, dual-modal feature complementarity fusion, and dynamic correction mechanisms. It overcomes four major technical bottlenecks in weakly supervised power equipment segmentation: high annotation costs, persistent pseudo-label errors, ambiguous boundary positioning, and difficulties in edge deployment. Through quantifiable performance breakthroughs, it propels power plant inspection from "manual spot checks" to "fully automated real-time diagnosis." Large-scale application can significantly reduce annual power grid operation and maintenance costs, decrease downtime due to faults, and provide core technical support for smart grid construction.

[0069] Example 2 This embodiment provides a multimodal driven and dynamically optimized power plant inspection image segmentation system, including: The initial pseudo-label building unit is configured to acquire inspection sample images, use pre-trained visual language models and multimodal large language models to obtain image features and text semantic embeddings, and generate initial pseudo-labels. The feature processing unit is configured to input the inspection sample image into the visual model to obtain dense features, and to perform cross-modal alignment between the dense features and the text semantic embedding to obtain enhanced visual features. The probability map acquisition unit is configured to construct a transition matrix reflecting the spatial correlation between pixels based on image features and enhanced visual features; and input the image features and enhanced visual features into the decoder to obtain a dual-path prediction mask, which is then fused to generate a segmentation probability map. The supervisory unit is configured to optimize the initial pseudo-labels based on the segmentation probability map and the transition matrix to obtain the final pseudo-labels; use the final pseudo-labels to construct a supervisory signal to guide the generation of the transition matrix; and perform iterative optimization until convergence to obtain the trained image segmentation network. The segmentation unit is configured to input the inspection image to be segmented into a trained image segmentation network to obtain the segmentation result.

[0070] Example 3 This embodiment provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps in the multimodal driven and dynamically optimized power plant inspection image segmentation method described in Embodiment 1 above.

[0071] Example 4 This embodiment provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the steps in the multimodal driven and dynamically optimized power plant inspection image segmentation method described in Embodiment 1 above.

[0072] The steps or modules involved in Embodiments 2 to 4 above correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.

[0073] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A multimodal driven and dynamically optimized power plant inspection image segmentation method, characterized in that, include: Obtain inspection sample images, use pre-trained visual language models and multimodal large language models to obtain image features and text semantic embeddings, and generate initial pseudo-labels; The inspection sample images are input into the visual model to obtain dense features. The dense features are then aligned with the text semantic embedding across modalities to obtain enhanced visual features. A transition matrix reflecting the spatial correlation between pixels is constructed based on image features and enhanced visual features; The image features and enhanced visual features are then input into the decoder to obtain a dual-path prediction mask, which is then fused to generate a segmentation probability map. The initial pseudo-labels are optimized based on the segmentation probability map and the transition matrix to obtain the final pseudo-labels; The final pseudo-labels are used to construct a supervision signal to guide the generation of the transition matrix in reverse. Perform iterative optimization until convergence to obtain a trained image segmentation network; The inspection image to be segmented is input into the trained image segmentation network to obtain the segmentation result.

2. The multimodal driven and dynamically optimized power plant inspection image segmentation method as described in claim 1, characterized in that, The process involves acquiring inspection sample images and generating initial pseudo-labels using a pre-trained visual language model and a multimodal large language model. The pre-trained visual language model includes a CLIP image encoder and a CLIP text encoder. Specific steps include: Obtain the inspection sample images and their corresponding image-level category labels; The inspection sample image is input into the CLIP image encoder to extract image features; the image-level category label is input into the CLIP text encoder to extract text features; the cosine similarity between the image features and the text features is calculated and a global class activation map is generated. The inspection sample images are input into a multimodal large language model to generate detailed text descriptions of the images; the detailed text descriptions are encoded using the CLIP text encoder to obtain text semantic embeddings; the similarity between image features and text semantic embeddings is calculated and a patch-level class activation map is generated. A static activation graph is obtained by merging the global class activation graph and the patch-level class activation graph, which serves as the initial pseudo-label.

3. The multimodal driven and dynamically optimized power plant inspection image segmentation method as described in claim 1, characterized in that, The process involves inputting the inspection sample images into a visual model to obtain dense features; Cross-modal alignment of dense features with pre-trained text semantic embeddings yields enhanced visual features, specifically including: The inspection sample images are input into the visual model, and dense features are obtained based on the attention mechanism. Alignment matrices are calculated based on dense features and textual semantic embeddings, and then aggregated into a semantic similarity matrix; enhanced visual features are obtained by concatenating dense features and semantic similarity matrices.

4. The multimodal driven and dynamically optimized power plant inspection image segmentation method as described in claim 1, characterized in that, The construction of the transition matrix reflecting the spatial correlation between pixels based on image features and enhanced visual features specifically includes: Affinity matrices are calculated based on image features and enhanced visual features respectively, and then fused into a final affinity map; The final affinity graph is normalized to obtain a double random matrix; The transition matrix is ​​obtained by symmetrizing the double random matrix.

5. The multimodal driven and dynamically optimized power plant inspection image segmentation method as described in claim 1, characterized in that, The fusion to generate the segmentation probability map specifically includes: Based on adaptive gated fusion of dual-path prediction masks, a spatial weight map is generated through a convolutional network; By dynamically adjusting the contribution weights of the dual-path prediction mask using a spatial weight graph, a segmentation probability graph is obtained.

6. The multimodal driven and dynamically optimized power plant inspection image segmentation method as described in claim 2, characterized in that, The step of optimizing the initial pseudo-label based on the segmentation probability map and the transition matrix to obtain the final pseudo-label specifically includes: multiplying the initial pseudo-label element-wise with the segmentation probability map, and multiplying the patch-level class activation map element-wise with the transition matrix; and using the product of the two element-wise multiplications as the final pseudo-label.

7. The multimodal driven and dynamically optimized power plant inspection image segmentation method as described in claim 4, characterized in that, The method of using the final pseudo-label to construct a supervision signal to guide the generation of the transition matrix specifically includes: by minimizing the cross-entropy loss between the final affinity graph and the supervision signal, the network is forced to learn spatial associations that conform to semantic logic, thereby guiding the generation of the transition matrix in the next iteration.

8. A multimodal driven and dynamically optimized power plant inspection image segmentation system, characterized in that, include: The initial pseudo-label building unit is configured to acquire inspection sample images, use pre-trained visual language models and multimodal large language models to obtain image features and text semantic embeddings, and generate initial pseudo-labels. The feature processing unit is configured to input the inspection sample image into the visual model to obtain dense features, and to perform cross-modal alignment between the dense features and the text semantic embedding to obtain enhanced visual features. The probabilistic graph acquisition unit is configured to construct a transition matrix reflecting the spatial correlation between pixels based on image features and enhanced visual features; The image features and enhanced visual features are then input into the decoder to obtain a dual-path prediction mask, which is then fused to generate a segmentation probability map. The supervisory unit is configured to optimize the initial pseudo-labels based on the segmentation probability map and the transition matrix to obtain the final pseudo-labels; The final pseudo-labels are used to construct a supervision signal to guide the generation of the transition matrix in reverse. Perform iterative optimization until convergence to obtain a trained image segmentation network; The segmentation unit is configured to input the inspection image to be segmented into a trained image segmentation network to obtain the segmentation result.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps in the multimodal driven and dynamically optimized power plant inspection image segmentation method as described in any one of claims 1-7.

10. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps in the multimodal driven and dynamically optimized power plant inspection image segmentation method as described in any one of claims 1-7.