Open-vocabulary substation equipment segmentation and inspection system based on multi-modal prompt learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a multimodal prompting learning system, a visual language orthogonally coupled projection layer is constructed using structured text and image anchors. Combined with a bidirectional gradient synchronous update mechanism, the problems of feature offset and memory occupation under small sample conditions in substation inspection are solved, and the accurate identification of equipment defects and cross-scenario knowledge reuse are realized.

CN122156832BActive Publication Date: 2026-06-30SHENZHEN LAIDA SIWEI INFORMATION TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHENZHEN LAIDA SIWEI INFORMATION TECH CO LTD
Filing Date: 2026-05-09
Publication Date: 2026-06-30

Application Information

Patent Timeline

09 May 2026

Application

30 Jun 2026

Publication

CN122156832B

IPC: G06V10/764; G06V10/26; G06V10/82; G06V10/80; G06N5/04

AI Tagging

Technology Topics

Data pack Visual technology

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In substation inspections, existing visual models are prone to degradation of open vocabulary generalization ability and feature shift under small sample conditions, making it difficult to accurately identify equipment defects. Furthermore, global parameter fine-tuning leads to excessive memory usage, making it difficult to support cross-scenario knowledge reuse.

Method used

The multimodal cue learning system utilizes structured text and sample images as multimodal anchors to construct a visual-language orthogonally coupled projection layer. Combined with a bidirectional gradient synchronous update mechanism, it suppresses interference from complex background noise, achieves multi-scale feature fitting, and generates cross-modal attention heatmaps, supporting the serialization and reuse of target cue parameters.

Benefits of technology

Without altering the large-scale backbone network, the accuracy and dynamic adaptability of substation equipment image segmentation were improved, the memory usage of edge nodes was reduced, and accurate identification of equipment anomalies and generation of inspection reports were achieved.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122156832B_ABST

Patent Text Reader

Abstract

This invention relates to the fields of intelligent inspection of power equipment and computer vision technology, specifically to an open-vocabulary substation equipment segmentation and inspection system based on multimodal cue learning. The system includes: an anchoring modeling module, which acquires structured text of the target scene as an initial text prefix input to a text encoder and acquires sample images as visual anchors; a collaborative adaptation module, which acquires a real-time image stream, extracts multi-scale visual features through a visual encoder, and inputs them into a visual-language orthogonal coupling projection layer to obtain target cue parameters; a fitting segmentation module, which calculates the inner product of the visual cue feature tensor and the text cue feature tensor to generate a cross-modal attention heatmap and outputs a pixel-level segmentation mask; and a closed-loop deployment module, which generates state recognition results based on the pixel-level segmentation mask and serializes and stores the target cue parameters as a parameter data package with attribute labels. This invention can achieve open-vocabulary segmentation under small sample conditions and reduce the risk of catastrophic forgetting.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of intelligent inspection of power equipment and computer vision technology, specifically to an open-vocabulary substation equipment segmentation and inspection system based on multimodal cue learning. Background Technology

[0002] In the current substation inspection environment, real physical defect samples are extremely scarce, and there are often high-frequency visual interference textures such as insulator reflection and diffuse reflection of metal shells. Existing solutions for global parameter fine-tuning or simple cross-modal stitching of large-scale visual models are prone to degradation of the model's original open vocabulary generalization ability when faced with small sample fine-tuning in substation industry. At the same time, simple modal stitching makes the model highly susceptible to the interference of the aforementioned complex background noise, resulting in feature shifts and mismatch between visual features and stable physical defect semantics. In addition, global parameter fine-tuning leads to excessive memory consumption of edge nodes, making it difficult to support accurate defect localization and cross-scene knowledge reuse under small sample conditions.

[0003] Therefore, how to suppress multimodal feature shifts and improve the accuracy and dynamic adaptability of substation equipment image segmentation without altering the large-scale backbone network has become an urgent technical problem to be solved. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention provides an open vocabulary substation equipment segmentation and inspection system based on multimodal cue learning. Specifically, the core concept of this invention lies in:

[0005] The system acquires structured text containing device entities, component entities, and physical states, and combines it with sample images as multimodal anchors to construct a visual-language orthogonal coupling projection layer while keeping the large-scale visual and text encoders frozen.

[0006] The system utilizes a bidirectional gradient synchronous update mechanism to fine-tune only a very small number of prompt parameters in an independent tensor space, thereby suppressing complex background noise and pseudo-anomaly interference while achieving accurate fitting of multi-scale visual features and text prompt features.

[0007] The system achieves pixel-level segmentation of defect regions by generating cross-modal attention heatmaps and supports the serialized closed-loop reuse of target cue parameters.

[0008] Furthermore, this invention achieves the following by constructing four core modules: anchoring modeling, collaborative adaptation, fitting segmentation, and closed-loop deployment. This enables the process from initial anchoring of structured text and sample images to multi-scale feature extraction and bidirectional gradient synchronous fine-tuning of real-time image streams, to the generation of cross-modal attention heatmaps and mask decoding, and finally to the identification of abnormal states of inspected equipment, the generation of inspection reports, and the serialization and encapsulation of target prompt parameters.

[0009] Building upon this foundation, the system also introduces a boundary filtering mechanism to intercept interfering image data without physical semantics, and supports directly calling parameter data packets in memory space based on the scene attribute matching of the new scene image stream, thereby achieving zero-sample efficient segmentation inference on edge computing nodes. Attached Figure Description

[0010] The present invention will be further explained below with reference to the accompanying drawings and embodiments:

[0011] Figure 1 This is a schematic diagram of the modules of the open vocabulary substation equipment segmentation and inspection system based on multimodal cue learning provided in the embodiments of this application. Detailed Implementation

[0012] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments.

[0013] An open-vocabulary substation equipment segmentation and inspection system based on multimodal prompting learning, the system includes:

[0014] The image acquisition module is used to continuously acquire on-site image data of the equipment in the target scene;

[0015] The anchoring modeling module is used to obtain structured text of the target scene. The structured text is composed of triple data containing device entities, component entities and physical states. The structured text is used as the initial text prefix input to a text encoder with preset parameters, and sample images are obtained as visual anchors.

[0016] The collaborative adaptation module, connected to the anchoring modeling module, is used to acquire image data as a real-time image stream through a data interface, extract multi-scale visual features from the real-time image stream through a visual encoder with preset parameters, and input the multi-scale visual features into a preset visual language orthogonal coupling projection layer.

[0017] The collaborative adaptation module is also used to fine-tune the parameters of the initial text prefix in the preset independent tensor space based on the bidirectional gradient synchronous update mechanism in the visual-language orthogonal coupling projection layer to obtain the target prompt parameters, and output the visual prompt feature tensor and the text prompt feature tensor.

[0018] The bidirectional gradient synchronous update mechanism constructs visual cue branches and text cue branches in the visual-language orthogonal coupling projection layer. It injects preset vectors into the visual cue branches and text cue branches respectively to generate visual cue branch perturbations and text cue branch perturbations as local input directions. It records the visual cue responses generated by multi-scale visual features to visual cue branch perturbations and the first-order changes of the text cue responses generated by the initial text prefix to text cue branch perturbations on the local input directions. It extracts the visual cue gradient vectors and text cue gradient vectors corresponding to multi-scale visual features and the initial text prefix, and forms a square matrix based on the first-order changes to describe the mutual sensitivity of the two branches. It calculates the Jacobian matrix determinant of the square matrix and compares the value of the Jacobian matrix determinant with the preset resonance threshold obtained based on the features of historical real defect samples. It adjusts the model's preset loss function according to the comparison results to constrain the update directions of the two to keep them synchronized.

[0019] The fitting segmentation module, connected to the co-adaptation module, is used to calculate the inner product of the visual cue feature tensor and the text cue feature tensor to generate a cross-modal attention heatmap, and input it into a preset mask decoder and combine it with multi-scale visual features to output a pixel-level segmentation mask.

[0020] The closed-loop deployment module, connected to the fitting segmentation module, is used to generate state recognition results based on pixel-level segmentation masks and output them to an external monitoring terminal, and serialize and store the target prompt parameters as a parameter data package with attribute labels.

[0021] This embodiment provides an open-vocabulary substation equipment segmentation and inspection mechanism based on multimodal cue learning, such as... Figure 1 As shown; specifically, this embodiment uses the evening inspection period after rainfall at a 220kV substation as the main scenario for the drone to conduct defect inspections of the main transformer area, insulator string area and circuit breaker area; the system is deployed between the edge computing node and the monitoring terminal in the substation. The edge computing node is responsible for acquiring image streams, performing prompt parameter fine-tuning and segmentation inference, and the monitoring terminal is responsible for receiving equipment status results and inspection reports;

[0022] In practice, the anchoring modeling module receives structured text of the target scene. The structured text is different from ordinary natural language. Instead, it is an industrial semantic description oriented towards inspection tasks. For example, it may contain relationships such as equipment category, component location, and physical state. The purpose is that the defects in the substation are not isolated local pixel clusters, but physical states attached to a specific carrier.

[0023] The system uses the structured text as the initial text prefix and feeds it into the text encoder with fixed parameters. At the same time, it retrieves a preset number of sample images from the historical multimodal database as visual anchors. The configuration is frozen so that the existing parameters of the backbone encoder are not changed, and only an adaptable cue space is established around it to avoid degradation of the original generalization ability due to industrial small sample fine-tuning.

[0024] During the real-time inspection phase, the collaborative adaptation module continuously receives video frames from the drone gimbal camera through the data interface; the visual encoder extracts multi-scale visual features from the real-time image stream. The multi-scale visual features include shallow features and deep features. Shallow features are more likely to reflect edges, cracks, stains and reflective details, while deep features are more likely to reflect equipment type, component outline and spatial layout.

[0025] Multi-scale visual features are introduced into the visual-language orthogonally coupled projection layer. This projection layer does not simply concatenate image features with text features, but establishes a synchronous update channel between the visual and text sides in an independent tensor space. Orthogonal coupling means that in the initial state of the projection layer, the column vectors of the visual cue projection matrix and the text cue projection matrix are orthogonalized through orthogonal initialization to maximize the preservation of the information independence of cross-modal features in the early stage of mapping and avoid initial semantic redundancy and overlap.

[0026] The reason is that substation scenes often have high-frequency textures such as porcelain insulator reflection, oil stain diffuse reflection, and strong high brightness of metal shells. These textures will generate an update trend with gradient values greater than the preset gradient threshold on the visual side; while words such as cracks and oil leaks in the text represent stable physical semantics. If the two are adjusted independently in opposite directions, it is easy to cause a mismatch between the visual side being pulled by noise and the text side maintaining abstract semantics.

[0027] To this end, the system extracts the gradient vectors of visual cues and text cues, and uses the determinant of the Jacobian matrix formed by their coupling relationship to characterize their synchronization in the joint space. When the two types of updates support each other on the same physical defect, it indicates that the text's interpretation of the state in the image is stable. When the deviation between the two types of update directions is greater than a preset deviation angle threshold, it often means that the system misidentifies reflections, shadows, or background textures as defect features. In this case, the loss function needs to be adjusted to bring the update directions closer together. When the deviation between the two types of update directions is less than or equal to the preset deviation angle threshold, the update directions are determined to be consistent, and parameter fine-tuning continues according to the current update direction.

[0028] Furthermore, to avoid directly interpreting the two gradient vectors as matrix objects with quantifiable determinants, the Jacobian matrix determinant in this embodiment refers to the determinant of the locally coupled Jacobian matrix constructed within the visual-language orthogonal coupling projection layer. In specific processing, the system first maps the visual cue gradient vector and the text cue gradient vector to independent tensor spaces of the same dimension and performs scale normalization. Taking the visual cue branch perturbation and the text cue branch perturbation as two local input directions, the system records the first-order changes of the visual cue response and the text cue response with respect to these two directions. The system extracts the visual cue gradient vector and the text cue gradient vector corresponding to the multi-scale visual features and the initial text prefix, and forms a square matrix based on the first-order changes to describe the mutual sensitivity of the two branches.

[0029] The preset vector is a random noise vector that follows a Gaussian distribution and is consistent with the input dimension of the corresponding branch, or a random orthogonal direction vector calculated based on the forward propagation features; the preset vector is injected into the projection layer to evaluate the local sensitivity of the model to different modal input directions at the current parameter point.

[0030] Specifically, assuming the visual cue branch perturbation is The text prompt indicates that the branch perturbation is... The corresponding visual cue response is The text prompt response is The square matrix describing the mutual sensitivity of the two branches is the local coupling Jacobian matrix. Represented as:

[0031]

[0032] Calculate the Jacobian matrix determinant of this square matrix. The formula is:

[0033]

[0034] in, This represents the first-order partial derivative of the visual cue response with respect to the visual cue branch perturbation. This represents the first-order partial derivative of the text prompt response with respect to the text prompt branch perturbation. and The cross term is used to represent the sensitivity to mutual perturbations between modes.

[0035] The determinant value of this matrix is used to characterize the local area change and orientation twist of the joint update space, rather than to evaluate the magnitude of a single gradient vector. If the value is greater than or equal to the preset resonance threshold, it indicates that the mutual sensitivity between the visual and text sides has a skew deviation that exceeds the preset skew range. If the value is lower than the preset resonance threshold, it indicates that the updates on both sides maintain a synchronous relationship that meets the preset synchronization conditions within the current local range. The preset resonance threshold is pre-calculated and calibrated based on the distribution of the determinant of the benchmark Jacobian matrix generated in the projection layer based on the features of historically verified real defect samples. With this setting, the calculation object, meaning, and subsequent threshold comparison logic of the Jacobian matrix determinant are all limited to the joint cue space of the coupled projection layer.

[0036] After completing the above synchronization constraints, the collaborative adaptation module outputs visual cue feature tensors and text cue feature tensors; the fitting segmentation module calculates the inner product of the two to form a cross-modal attention heatmap; the heatmap is used to characterize the spatial response distribution of the physical state described in the text in the image space; for example, if the text description is insulator-skirt surface-fracture, the heatmap will preferentially enhance the areas corresponding to the fracture of the skirt surface and the edge breakage, and suppress the background of the iron tower, the reflection of the conductor and the sky area;

[0037] Then, the mask decoder combines the heatmap with multi-scale visual features to output a pixel-level segmentation mask; the closed-loop deployment module then generates a state recognition result based on the mask, and packages and stores the optimized target prompt parameters for quick use in subsequent similar scenarios;

[0038] Under abnormal operating conditions, if the real-time image stream exhibits blurring exceeding the preset blur threshold, occlusion, or frame breaks, causing the visual encoder to be unable to stably extract multi-scale features, the system will maintain the backbone model frozen, only outputting low-confidence markers and requesting image re-acquisition, without forcibly updating the prompt parameters, to prevent erroneous samples from damaging the obtained industrial semantic mapping; if the structured text received on the text side lacks key equipment entities, the system will revert to the previous available prompt parameter version and prompt the monitoring terminal to only perform general equipment contour segmentation, not state segmentation; if edge computing node resources are insufficient, the system will prioritize preserving the prompt space update and mask decoding process, suspending non-critical log storage, thereby ensuring the continuity of online inspection;

[0039] As a specific application scenario, in the aforementioned 220kV substation scenario, the drone flew to the main transformer area and collected data on dark, irregular deposits in the area below the oil conservator; it flew to the insulator string area and collected data on a set of slats with thin, elongated cracks on their edges; the system read the structured text pre-compiled by the shift team, which included semantic items such as main transformer—flange connection—oil leakage, insulator—slat surface—crack; the parameter-fixed text encoder kept the general semantic space unchanged, while the visual encoder continuously extracted the overall outline of the equipment and defect details from the image;

[0040] The coupled projection layer found that although the gradient of some highlighted areas was greater than the preset gradient threshold on the visual side, it was not synchronized with the update direction of the oil leak semantics, so it was suppressed; while the oil stain edges and crack sections were consistent with the update trend of the corresponding text semantics, so they were further enhanced; finally, the monitoring terminal received two types of segmentation results: one was the oil stain expansion area at the main transformer flange, and the other was the local crack area of the insulator skirt, and the parameter data packet corresponding to the current scene was stored simultaneously.

[0041] The purpose of this step is to inject the coupling relationship between equipment carriers and defect states in substations into an independent prompt space without modifying the large-scale visual and text backbone models. This will enable open vocabulary segmentation under small sample conditions, reduce the risk of degradation of the original open vocabulary capabilities, and provide a serializable knowledge carrier for subsequent cross-scenario reuse.

[0042] Furthermore, the anchoring modeling module includes: a text parsing unit, used to parse the preset inspection procedure into triplet data containing equipment entities, component entities, and physical states, and to concatenate the triplet data into structured text; a prefix injection unit, used to convert the structured text into word vectors, and to insert the word vectors as initial text prefixes into the preset deep network layer of the parameter-fixed text encoder; and an anchor point extraction unit, used to extract sample images from the multimodal database, and to use the sample images as visual anchors to initialize the weight matrix of the visual language orthogonal coupling projection layer.

[0043] This embodiment provides an anchoring modeling mechanism; specifically, based on the above-mentioned main scenario of evening inspection of 220kV substation, before the formal online inspection is carried out, the system first organizes the in-station inspection procedures, defect dictionary and historical image data into initial anchoring information that can be processed by the model.

[0044] In practice, using only the free text description in the previous embodiment may still lead to semantic feature deviation. For example, the duty officer may record that the bushing is abnormal or the insulator is suspected of being damaged. Although such expressions are in line with spoken language, they do not clearly indicate which component the defect is attached to or what physical state it presents. In order to avoid semantic ambiguity in the text causing deviation in prompt features, this embodiment sets up a text parsing unit to parse the inspection procedure into triplet data.

[0045] In a specific data processing example, the phenomenon of insulator skirt cracking can be organized into insulator-skirt-cracking, and the phenomenon of main transformer flange oil leakage can be organized into main transformer-flange connection-oil leakage; multiple triplets can be sequentially spliced into structured text fragments, so that the text encoder receives not a single tag word, but a semantic chain with industrial constraint relationships.

[0046] Furthermore, after the prefix injection unit transforms the structured text into word vectors, it does not simply append them to the beginning of the input, but injects them into the preset depth network layer of the text encoder. The reason is that shallow layers are more biased towards the surface form of words, while deep layers are more conducive to maintaining the relationship structure of device-component-state. By inserting the initial text prefix into the preset depth layer, state description words such as rupture no longer exist independently of specific components, but together with the insulator skirt, they form a set of constrained industrial semantic context.

[0047] These samples are not required to cover all defect morphologies, but should include as many of the main body of the device, component boundaries, and several representative appearance states as possible in the target scenario.

[0048] To illustrate with a specific data example, if three images of insulator cracks, four images of main transformer oil contamination, and two images of circuit breaker casing contamination are selected from the database, these images can be used to initialize the coupled projection layer. This allows the coupled projection layer to form distinguishing weights for relevant textures during the initialization phase. Textures with a higher correlation to equipment status receive a higher initial response, while textures with a higher correlation to background noise receive a lower initial response. This initialization is not for training a complete segmentation model, but rather to provide a more stable starting point for subsequent small-sample cue updates.

[0049] Under abnormal operating conditions, if the inspection procedure text is missing component entities, such as only having equipment abnormality without information on components such as bushings, skirts, and flange connections, the text parsing unit will attach a low-structure tag to the entry and will not use it as a priority anchoring semantic. If the sample images extracted from the database have annotation times exceeding a preset time span threshold, viewing angle deviations exceeding a preset deviation threshold, or image quality parameters lower than a preset quality threshold, the anchor point extraction unit will reduce its initialization weight and, if necessary, retain only text anchoring without enabling the corresponding visual anchor. If the structured text and visual anchor are significantly inconsistent in terms of equipment category, for example, the text is "insulator - broken," but the image is "cable trench water accumulation," the system will refuse to initialize the correspondence to prevent incorrect anchoring from entering subsequent processes.

[0050] As a specific application scenario, during the inspection preparation phase of this substation, the maintenance personnel imported the monthly inspection procedures, which included items such as oil leakage inspection of the main transformer flange connection and edge chipping inspection of the suspension insulator skirt; the text parsing unit organized them into several triplets and spliced them into a structured text sequence.

[0051] Meanwhile, the system retrieves representative sample images from the historical database for nighttime and post-rain scenes over the past three months to establish visual anchor points. Due to enhanced surface reflection after rain, the system prioritizes images under the same weather conditions as the initialization basis. Thus, during subsequent actual drone data collection, the system indicates that the spatial sensitivity to oil stain diffusion contours and umbrella skirt crack boundaries is higher than its sensitivity to background reflection.

[0052] The purpose of this step is to transform the original inspection experience into calculable, constrainable, and reusable industrial semantic anchors and visual anchors, thereby achieving active initialization of cue learning and reducing random feature shifts under small sample conditions.

[0053] Furthermore, the collaborative adaptation module includes: a feature extraction unit, used to perform convolution processing on the real-time image stream through a visual encoder to generate multi-scale visual features, and to extract the overall outline of the inspected equipment based on the multi-scale visual features;

[0054] The residual mapping unit is used to extract the visual feature residual between multi-scale visual features and the baseline visual features, which are extracted from the visual anchor points. The visual feature residual is dynamically mapped to an independent tensor space to update the internal state of the target cueing parameters.

[0055] The gradient resonance unit is used to trigger a bidirectional gradient synchronization update mechanism under preset sample input conditions. Combined with the determinant of the Jacobian matrix, the gradient direction is updated synchronously by introducing a cosine distance penalty term for the gradient vectors of visual cues and text cues into the preset loss function to generate target cue parameters. The number of parameters of the target cue parameters is less than the proportion of the total number of parameters of the text encoder and the visual encoder, which is less than the preset parameter scale ratio threshold, and the target cue parameters reside in an independent GPU memory space.

[0056] This embodiment provides a collaborative adaptation mechanism; specifically, after the above anchoring modeling is completed, the UAV enters the online inspection route, and the system needs to synchronously adapt to the visual and text prompts in the continuously changing industrial site.

[0057] In practice, relying solely on the initial anchor point in the previous implementation method may still result in insufficient adaptation in real-world scenarios. This is because the appearance of the same defect varies significantly depending on factors such as the angle of illumination, residual rainwater, equipment aging, and shooting distance at the substation site. For example, oil stains on the main transformer may appear as an attachment area with obvious changes in edge density at close range, while at a distance, they may appear as an irregular dark pixel area.

[0058] To this end, the feature extraction unit uses a parameter-fixed visual encoder to perform convolution processing on the real-time image stream to form multi-scale visual features. In a specific feature parsing example, shallow features are used to extract fine line cracks and edge fractures, medium-level features are used to identify the outlines of components such as skirts, flanges, and bushings, and deep features are used to identify the insulator string, main transformer body, and circuit breaker overall area.

[0059] Based on this, the residual mapping unit extracts visual feature residuals; the feature residuals are used to characterize the differences between the current scene image and the initialized visual anchor point. They reflect the new information added to the scene, such as post-rain reflections, oil stains, cracks, or dirt accumulation. Specifically, the visual feature residuals are obtained by subtracting the multi-scale visual features of the current real-time image stream from the baseline visual features extracted and cached by the visual anchor point through a preset encoder on an element-by-element basis.

[0060] The system does not use these differences to directly reshape the backbone encoder, but instead maps them into an independent tensor space to update the internal state of the target cue parameters. The independent tensor space is specifically implemented in the network structure as a learnable cue embedding layer that is added to the backbone visual encoder and text encoder independently. The target cue parameters are the trainable weight matrix in this cue embedding layer. The main purpose of the above configuration is that the backbone model retains general recognition capabilities, while local changes in the industrial field are only injected through a small number of cue parameters, avoiding feature shift caused by local noise interference to the entire model.

[0061] The gradient resonance unit triggers a bidirectional gradient synchronization update mechanism when the preset sample input conditions are met. The preset sample input conditions can be that the image quality meets the standard, the structured text is complete, and the device target is located in the center of the field of view. The system uses the update relationship between the visual side and the text side to form a Jacobian matrix determinant to determine whether the two update trends revolve around the same physical state.

[0062] In a specific computational example, if the visual update direction is represented as being more like a crack or more like a reflection, and the text update direction is represented as being closer to the broken semantics or closer to the soiled semantics, when both converge toward the broken state, it is considered a resonant update; when the visual side moves toward the reflection while the text side moves toward the broken, it indicates that the ambient noise is interfering with the cue learning, and direction correction should be performed.

[0063] Throughout the process, the system only adjusts the target prompting parameters, which are less than the preset parameter size ratio threshold of the total number of parameters in the text encoder and visual encoder. This ensures that edge nodes do not need to perform fine-tuning operations on the entire model to ensure that the memory usage exceeds the preset memory usage ratio. Here, the preset parameter size ratio threshold refers to the minimum parameter size limit of the target prompting parameters in this embodiment. If temporary changes in memory usage occur during the engineering implementation due to hardware word length, tensor alignment, or batch processing cache, these changes are only changes in the runtime cache and do not change the minimum ratio limit of the target prompting parameters relative to the total number of parameters in the two types of frozen encoders.

[0064] Furthermore, in this embodiment, the synchronous update of gradient directions by introducing a cosine distance penalty term for the visual cue gradient vector and the text cue gradient vector into a preset loss function means that the cosine distance is included as a penalty term in the objective function to be minimized. The synchronization of the update directions of the two branches is achieved through the descent of the objective function, rather than making the visual cue gradient vector and the text cue gradient vector move further apart. In engineering implementation, the direction constraint term can be set to a form equivalent to the cosine distance, for example, subtracting the normalized cosine similarity from one as the direction term to be reduced, or using the negative cosine similarity as the direction term to be reduced.

[0065] Under abnormal operating conditions, if the proportion of target device pixels in the real-time image stream is lower than the preset proportion threshold, the proportion of occluded area exceeds the preset occlusion threshold, or the feature difference between consecutive frames exceeds the preset jump threshold, the residual mapping unit will pause the influence of the current frame on the target prompt parameters and only maintain the stable state of the previous round. If the sample input conditions are not met, such as missing text descriptions, overexposed images, or insufficient confidence in device category recognition, the gradient resonance unit will not trigger an update and will only retain the inference output. If the update of the independent tensor space is unstable multiple times in a row, the system can roll back to the most recently verified parameter snapshot to prevent the accumulation of error increments.

[0066] As a specific application scenario, during the nighttime inspection of this 220kV substation, when the drone circled the insulator string, the first two frames of the image showed strong reflections at the edges of the skirts due to the skewed angle of the searchlight; the third frame showed clear cracks due to the adjusted viewing angle. The feature extraction unit formed multi-scale features for each of the three frames, and the residual mapping unit found that the new changes in the first two frames mainly came from reflections, while the new changes in the third frame mainly came from local edge fractures. Based on this, the gradient resonance unit suppressed the pull of the first two frames on the crack indication and retained the update of the third frame that was consistent with the semantics of insulator-skirt-crack. Finally, the system only needs to complete a small number of parameter adjustments within the indication space, and the proportion of the target indication parameter to the total number of parameters of the text encoder and the visual encoder is kept below the preset parameter scale ratio threshold, so that the real crack area can be stably highlighted in subsequent images of the same string of insulators.

[0067] The purpose of this step is to map newly added visual information from the field to an independent cue space with a small parameter scale, and to keep industrial visual changes consistent with defect semantic changes by synchronously updating constraints, thereby achieving continuous adaptation under small sample and low memory conditions.

[0068] Furthermore, the execution logic of the loss function is adjusted according to the comparison results as follows: if the value of the determinant of the Jacobian matrix is greater than or equal to the preset resonance threshold, the gradient direction is determined to be divergent, and the cosine distance between the visual cue gradient vector and the text cue gradient vector is added to the loss function as a penalty term to force alignment of the update direction through backpropagation; if the value of the determinant of the Jacobian matrix is less than the preset resonance threshold, the gradient direction is determined to be resonant, and the initial text prefix is updated according to the current gradient direction.

[0069] This embodiment provides a gradient synchronization determination and loss adjustment mechanism; specifically, based on the previous implementation, the system further provides specific processing logic when the update relationship between the visual side and the text side deviates.

[0070] In practice, using only the synchronous update framework still has limitations, and pseudo-feature alignment problems may still occur under extreme conditions. The pseudo-feature alignment phenomenon is characterized by the visual side being pulled by strong reflections, occlusion shadows, or metallic stains, while the text side is still updating around the semantics of cracks or oil leaks. Although the feature distribution of both has changed, the feature shift does not represent the same physical state. To avoid the amplification of this situation, this embodiment compares the determinant of the Jacobian matrix with a preset resonance threshold to form two different branches.

[0071] When the comparison results show that the two reach or exceed the preset resonance threshold, the system determines that the current update is in a divergent state. The divergent state is characterized by: the visual side generates a high attention response, but the text side does not support that the position belongs to the target state; for example, the bright specular reflection in the image has a visual feature response value greater than the preset local response threshold, but does not represent oil leakage or cracking.

[0072] At this point, the system adds a cosine distance penalty term between the visual cue gradient vector and the text cue gradient vector to the loss function, thereby suppressing further deviation between the two update directions; the specific calculation formula of the loss function is adjusted based on the comparison results as follows:

[0073]

[0074] in, The total loss function is adjusted after adding a penalty term. This is the original basic task loss function of the model. The preset penalty item weight coefficient, The extracted visual cue gradient vector, The extracted text prompt gradient vector; The cosine similarity between two gradient vectors is expressed in the latter part of the formula. This is used to represent the cosine distance between two vectors, and the direction is updated by forcing alignment through backpropagation. The norm of a vector This represents the dot product operation between vectors.

[0075] In a specific data processing example, if the water droplets on the flange surface in a certain frame image are strongly noticed by the visual side, but the text description is still flange connection - oil leakage, then the penalty term will weaken the update trend of the highlight as an anomaly, prompting the cue learning to return to the direction that is really related to the oil spill diffusion boundary.

[0076] When the comparison result is lower than the preset resonance threshold, the system determines that the current visual and text updates revolve around the same physical state and can directly update the initial text prefix along the current direction. This mechanism is configured as a constrained update strategy: when a new visual feature in the field image matches an existing defect semantic, the system allows the cue space to fit and update the new visual feature. For example, a broken insulator will exhibit a different crack contrast under twilight backlight conditions than during the day. If the visual and text update directions are consistent, the system can improve the expression of the break semantic in this scenario.

[0077] Under abnormal operating conditions, if the results near the resonance threshold fluctuate frequently, the system can adopt a buffered judgment for several consecutive frames. That is, the strategy is not immediately modified due to a sudden deviation in a single frame, but the same judgment trend is required to be met continuously before switching branches, so as to avoid false triggering caused by image jitter. If the cosine distance penalty term continues to take effect but still cannot be stably aligned, the system marks the current sample as a difficult sample and pushes it to the manual review queue. If the performance degrades after the text prefix has been updated to a certain number of rounds, the system will first roll back the most recent stable version and then re-evaluate the resonance threshold branch.

[0078] As a specific application scenario, during the inspection of this substation, when the drone approached the oil tank of the main transformer, the residual water film after rain produced a bright band under the searchlight. The visual model determined that this banded area had a significant feature response in the initial feature extraction stage, but the text side was more inclined to the edge diffusion pattern of the flange connection regarding the semantics of oil leakage. The comparison results showed that the current update was divergent, so the system strengthened the penalty for directional deviation and finally suppressed the bright water film. When the drone adjusted its angle, it captured a dark attachment area with continuous downward drag marks at the bottom of the flange. The visual and text update directions became consistent, so the system retained and followed this gradient update direction to perform contextual enhancement on the text prefixes related to oil leakage.

[0079] The purpose of this step is to establish clear release and correction branches for cue learning, thereby achieving active suppression of industrial noise and stable absorption of real defect semantics.

[0080] Furthermore, the fitting segmentation module includes: a tensor inner product unit, used to calculate the inner product matrix of the visual cue feature tensor and the text cue feature tensor; a heatmap generation unit, used to normalize the inner product matrix to generate a cross-modal attention heatmap; and a priori decoding unit, used to input the cross-modal attention heatmap as the spatial attention weight into the mask decoder, and combine it with multi-scale visual feature decoding to output a pixel-level segmentation mask.

[0081] This embodiment provides a fitting segmentation mechanism; specifically, after completing the synchronous adaptation of the prompt parameters, the system needs to accurately map the state features described in the text to the local area of the device in the image, thereby outputting a pixel-level mask that can be used for inspection.

[0082] In practice, simply updating both the visual and textual sides does not equate to obtaining usable segmentation results. Without a spatial fitting process, the system can only detect potential cracks or leaks in a particular image frame, but cannot pinpoint the actual location of the defect within the device. This embodiment uses tensor inner product units to calculate the inner product matrix between the visual cue feature tensor and the textual cue feature tensor. The specific formula is as follows:

[0083]

[0084] in, For visual cue feature tensors, The inner product matrix is the transpose of the text prompt feature tensor. Used to characterize the intensity distribution corresponding to the spatial location of an image and the state of the text. For text prompt feature tensors, This represents the matrix multiplication operator.

[0085] The heatmap generation unit on this inner product matrix Normalization was performed to generate a cross-modal attention heatmap. The normalization calculation formula is:

[0086]

[0087] in, Represents the spatial position in the inner product matrix Text feature dimensions The response intensity This represents the cross-modal attention weights obtained after normalization. Represented by natural constant An exponential function with base 0. The purpose of normalization is to make the corresponding strengths between different frames and different devices comparable, and to avoid suppressing the real defect parts due to the original response being too strong.

[0088] Normalization does not simply compress numerical values, but makes the corresponding intensities between different frames and different devices comparable, avoiding the suppression of real defects by some bright or large background areas due to the original response being too strong; cross-modal attention heatmap is used to characterize the spatial response prior of state semantic features: the higher the response weight of a region, the more it conforms to the defect semantics defined in the current text.

[0089] The prior decoding unit uses the cross-modal attention heatmap as the spatial attention weight input to the mask decoder and combines it with multi-scale visual features to generate a pixel-level segmentation mask. The purpose of introducing the above spatial prior mechanism is that industrial defects usually have small geometric dimensions and subtle edge features. If only the general segmentation backbone is relied upon, background stains, bolt shadows, or equipment nameplates are easily mistakenly included in the target area. By introducing the heatmap prior, the decoder can prioritize refining the boundary at the position that highly matches the state semantics, thereby improving the positioning accuracy of micro-cracks and local oil stains.

[0090] Under abnormal operating conditions, if the responses of multiple regions in the inner product matrix are similar, it indicates that the current text state lacks a clear landing point in the image. The system can output multiple candidate hotspots with low confidence labels instead of forcibly generating a single mask. If the distribution of the heatmap after normalization is too scattered, it indicates that the text description may be too broad, such as only writing "device malfunction". The system will then prompt for refining the text semantics. If the area of the result obtained after mask decoding exceeds the preset area judgment threshold and significantly exceeds the physical boundary of the component, the system can call back the previous heatmap and re-constrain the boundary to prevent the anomaly from spreading.

[0091] As a specific application scenario, during the inspection of insulator strings in this substation, the text prompt is "Insulator - Sheath - Crack". The tensor inner product unit fits the corresponding text state to each local region of the current image and finds that the region near the outer edge of the third sheath has the highest response. After the heat map is generated, this region forms a thin band of interest. The prior decoding unit then combines the shallow crack edge features and the mid-layer sheath contour features to converge the band of interest into a continuous crack mask, without erroneously extending to nearby hardware connections.

[0092] Similarly, in the main transformer flange area, the text prompt is "Main transformer - flange connection - oil leak". The system distributes the high-weight areas of the heat map to the bottom of the flange and the trace area extending downwards, ultimately forming an oil trace mask that extends along the direction of gravity.

[0093] The purpose of this step is to transform cross-modal semantic correspondences into interpretable spatial priors, thereby enabling accurate depiction of equipment defect areas and effective suppression of background interference.

[0094] Furthermore, the closed-loop deployment module includes: a state recognition unit, which acquires the multi-scale visual features transmitted by the collaborative adaptation module or the fitting segmentation module, and uses them to extract the connected component pixel area value and contour pixel perimeter value of the pixel-level segmentation mask, and compares the area connectivity ratio and edge complexity with the lower limit of the area connectivity ratio and the edge complexity feature coefficient corresponding to the inspected equipment component obtained from the preset two-dimensional numerical mapping table, respectively. If the area connectivity ratio of the connected component pixel area value to the total pixel area of the inspected equipment is greater than or equal to the lower limit of the area connectivity ratio, and / or the edge complexity calculated based on the connected component pixel area value and the contour pixel perimeter value is greater than or equal to the edge complexity feature coefficient, then it is determined that the inspected equipment has an abnormal defect state, so as to generate a state recognition result and output it to the monitoring terminal.

[0095] The report generation unit is used to stitch together the pixel-level segmentation mask with the status recognition results to generate an inspection report.

[0096] The parameter serialization unit is used to extract the target prompt parameters and encapsulate them into a parameter data package with attribute tags.

[0097] This embodiment provides a closed-loop deployment mechanism; specifically, after the pixel-level segmentation mask is output, the system further maps the mask into equipment status conclusions, inspection reports and reusable parameter data packets based on the output defect area image localization results;

[0098] In practice, simply outputting the segmentation mask cannot directly meet actual operation and maintenance needs. Operation and maintenance operations require determining whether the defect has reached a preset handling threshold. Therefore, the state recognition unit extracts the connected component pixel area value and the contour pixel perimeter value from the pixel-level segmentation mask. The connected component pixel area value reflects the defect expansion range, while the contour pixel perimeter value reflects the complexity of the defect edge. For oil leakage defects, large-area, continuously extending downward connected components often indicate that leakage has developed. For rupture defects, when the defect contour is elongated and the boundary changes sharply, the contour pixel perimeter value better reflects the severity of the crack. The system compares these features with a preset defect size threshold to generate a state recognition result, specifically, the edge complexity... The evaluation formula is:

[0099]

[0100] in, These are the perimeter values of the contour pixels extracted from the pixel-level segmentation mask. The value represents the area of the connected region pixels. It should be noted that both the area of the connected region pixels and the perimeter of the contour pixels are statistical feature values based on the image pixel scale, rather than physical absolute dimensions. Specifically, the area of the connected region pixels is obtained by calculating the total number of pixels contained in the defective connected region within the pixel-level segmentation mask; the perimeter of the contour pixels is obtained by counting the number of boundary pixels in the connected region of the mask.

[0101] This formula uses an ideal circular boundary as a benchmark; the more curved and irregular the edge, the more complex its edge becomes. The larger the calculated value, the more... In this case, the system can determine that the equipment has an abnormal defect. The preset edge complexity feature coefficients, Pi;

[0102] The threshold here can come from operation and maintenance experience, equipment procedures or historical maintenance standards, and is used to distinguish different levels such as minor anomalies, defects to be reviewed, and those recommended for immediate handling. Specifically, the preset defect size threshold is instantiated in the system as a two-dimensional numerical mapping table containing the lower limit of the area connectivity ratio and the edge complexity feature coefficients corresponding to different equipment components. The area connectivity ratio refers to the ratio of the pixel area of the defect connected domain in the extracted pixel-level segmentation mask to the pixel area of the corresponding inspected equipment or component as a whole.

[0103] The report generation unit overlays the segmentation mask onto the original inspection image and adds information such as equipment name, component location, defect status, acquisition time, and waypoint number to generate a composite inspection report. This type of report can both preserve the original visual evidence and directly serve the maintenance workflow.

[0104] The parameter serialization unit extracts the optimized target prompt parameters for the current scene and encapsulates them into a parameter data package with attribute tags. The attribute tags can correspond to device category, weather conditions, shooting time, lens field of view, etc., and are used to identify in which type of scene the target prompt parameters are more suitable for reuse.

[0105] In this embodiment, the parameter data packet is a unified name for the serialized target prompt parameters and their attribute labels; under abnormal working conditions, if the segmentation mask is too scattered and the number of connected components is greater than the preset upper limit of the number of connected components, the state recognition unit can determine that the result may be affected by raindrops, stains or image noise, and output a suspected abnormality first, pending manual confirmation, rather than directly escalating to a defect conclusion.

[0106] If the perimeter value of the outline pixels is greater than or equal to the preset perimeter anomaly threshold, but the area value of the connected component pixels is lower than the preset lower limit of the area value of the connected component pixels, it indicates that the target may be a cluster of small noise points, and the system can add minimum connected component filtering; if the device number, timestamp, or waypoint information is missing when the report is generated, only a temporary report will be generated and it will not be included in the formal maintenance log; if the attribute label is found to be incomplete during the serialization of the parameter data packet, the system will prohibit it from entering the knowledge base to prevent subsequent mis-calls;

[0107] As a specific application scenario, after the inspection of the main transformer area of the substation was completed, the system extracted a main connected region and two secondary drag mark connected regions from the oil stain mask at the bottom of the flange. Based on the pixel area value of the connected region, it can be determined that the oil stain has formed a continuous attachment surface. Combined with the perimeter value of the contour pixel, it can be inferred that there is an irregular extension caused by gravity drag at the lower edge of the oil stain. Therefore, the status recognition unit judges it as a developing oil leak and recommends review and processing.

[0108] The report generation unit overlays the recognition conclusion along with the original image and the mask. Figure 1The output is sent to the monitoring terminal; at the same time, the prompt parameters optimized in this round are encapsulated into parameter data packets and labeled with after-rain, evening, 220kV outdoor equipment, insulator rupture / main transformer oil leakage;

[0109] The purpose of this step is to transform the segmentation results into executable operation and maintenance information, and to precipitate the scenario-based experience accumulated from a single inspection into parameter data packages that can be directly called in the future.

[0110] Furthermore, the closed-loop deployment module also includes: a knowledge loading unit, used to receive new scene image streams and parse the scene attributes of the new scene image streams; a matching and calling unit, used to load the target prompt parameters in the parameter data package into the memory space of a preset edge computing node when the scene attributes match the attribute labels of the parameter data package, so as to perform a zero-shot segmentation task; wherein, the zero-shot segmentation task is completed by the fitting segmentation module calling the target prompt parameters; the matching and calling unit is also used to trigger a retraining instruction when the scene attributes do not match the attribute labels of the parameter data package.

[0111] This embodiment provides a knowledge loading and matching invocation mechanism; specifically, after forming a parameter data package in the previous embodiment, the system further supports the direct reuse of existing prompt knowledge in new scenarios, without having to retrain each time.

[0112] In practice, simply storing data packets has limitations. If there is no matching and filtering mechanism when a new scene arrives, the system may incorrectly call target prompt parameters that do not match the current working conditions, thus reducing the segmentation effect. For example, target prompt parameters formed in a sunny daytime scene may not be suitable for a nighttime scene with strong reflections after rain. Therefore, this embodiment sets up a knowledge loading unit to parse the scene attributes of the new scene image stream. Scene attributes may include device type, shooting time, weather conditions, lens angle, and background complexity. In a specific data processing example, the new scene attributes can be parsed as device = circuit breaker, weather = after rain, time period = nighttime, and defect word = oil leak.

[0113] The matching unit compares the scene attributes with the attribute tags of the parameter data package in the knowledge base. If a match is found, the corresponding target prompt parameters are directly loaded into the memory space of the target device to perform a zero-sample segmentation task. Here, zero samples refer to the fact that the current device or defect combination does not provide any new labeled samples in the real-time task, but the system can use the state semantic transfer capability accumulated in the existing parameter data package to complete the segmentation. For example, although the historical training samples mainly come from the main transformer oil leakage, as long as the current new scene is compatible with the state semantics of after rain, night, and oil leakage in terms of attributes, the system can still perform direct segmentation of circuit breaker shell oil leakage.

[0114] If the match fails, it means that the similarity between the current parameter data package and the field working condition attributes is lower than the preset working condition matching threshold. At this time, the system avoids calling the existing target prompt parameters and instead triggers the retraining instruction to re-enter the anchoring modeling and collaborative adaptation process. This can prevent the system from treating irrelevant textures as defects due to erroneous knowledge transfer.

[0115] In this embodiment, the parameter data package and the target prompt parameters have a clear hierarchical relationship: the target prompt parameters are the parameter content that is actually loaded into the target device memory space and participates in zero-sample segmentation, and the parameter data package is the serialized encapsulation of the target prompt parameters and their attribute tags; the matching object of the matching call unit is the attribute tag of the parameter data package, and the loading object is the target prompt parameters in the parameter data package;

[0116] In abnormal operating conditions, if scene attributes only partially match, such as different device types but the same weather and time of day, the system can first call similar parameter data packages with low priority for trial segmentation, and output labels that require manual verification; if the confidence of the trial segmentation is low, the reuse will be terminated immediately and retraining will begin; if there are multiple matching parameter data packages in the knowledge base, the system will prioritize the version with the most recent time, the highest device similarity, and the most stable historical verification results; if the extraction of attributes from the new scene image stream fails, such as missing weather information or unclear device classification, the system will not automatically load the parameter data package by default to avoid misjudgment;

[0117] As a specific application scenario, after completing this round of inspection of the aforementioned 220kV substation, the system was dispatched to another site in the same area the next day to conduct a nighttime post-rain inspection of the circuit breaker area; the knowledge loading unit parsed the attributes of nighttime, post-rain, metal equipment casing, and suspected oil leakage from the new image stream; the matching and calling unit found that there were already data packets of parameters related to post-rain, evening to nighttime, and oil leakage status in the knowledge base. Although the training samples mainly came from the main transformer flange oil leakage, their state semantics were highly similar to the current task, so the target prompt parameters were directly loaded into the edge node memory and zero-shot segmentation was performed.

[0118] As a result, the system can still output the oil stain mask at the lower edge of the metal casing without re-labeling the circuit breaker oil leakage samples; if the new scenario is changed to sunny day, strong sunlight, and flashover marks on porcelain insulators, the system will directly trigger a retraining command due to the obvious mismatch in attributes.

[0119] The purpose of this step is to enable contextualized prompts to be reused or rejected precisely by attributes, thereby achieving rapid deployment across sites and devices and controlling the risk of error migration.

[0120] Furthermore, the system also includes a boundary filtering module, which performs semantic attribute determination on the on-site image data before inputting it into the collaborative adaptation module as a real-time image stream; if the on-site image data contains target pixels with physical semantic features, then the on-site image data is input into the collaborative adaptation module as the real-time image stream; otherwise, the on-site image data is intercepted and discarded.

[0121] This embodiment provides a boundary filtering mechanism; specifically, before an image enters the co-adaptation module, the system first determines whether the image is worth entering the multimodal cue learning process in order to avoid invalid data polluting the cue space.

[0122] In practice, relying solely on the aforementioned online adaptation mechanism may still result in a large number of meaningless frames in continuous video streaming scenarios. For example, during drone relocation, images of the sky, fences, reflective water on the ground, pure background vegetation, or short-term camera shake may be captured. Even if these images have visual edges and textures, they do not have clear device physical semantics. If they are directly sent to the collaborative adaptation module, irrelevant features may be incorrectly introduced into the prompt space. Therefore, this embodiment sets up a boundary filtering module to determine the semantic attributes of the image data.

[0123] The purpose of this semantic attribute determination is not to identify the overall content richness of the image, but to confirm whether it contains target pixels corresponding to the inspection task; the so-called target pixel with physical semantic features refers to the pixel or pixel region that can correspond to at least one of the following: device entity, component entity, or state semantics.

[0124] In a specific data processing example, if an insulator string outline and umbrella skirt structure are detected in a frame of an image, even if no defects are found, they are target pixels with physical semantic features and should be retained; if a frame only contains the sky and light glare, without equipment outlines or related components, it should be discarded directly; this setting helps to limit the system's applicable boundaries to images with industrial semantic support, rather than forcibly learning all visual noise.

[0125] In abnormal operating conditions, if the image is in a boundary state, such as only capturing a local corner of the device, the occlusion area ratio being greater than the preset occlusion threshold, or the target pixel ratio being lower than the preset lower limit, the system can place it in the candidate buffer queue instead of discarding it immediately; after joint judgment with adjacent frames, it will then decide whether to input it into the collaborative adaptation module; if the semantic attribute judgment model itself is unstable, the system will prioritize a conservative strategy, that is, reduce the frequency of fine-tuning updates to intercept image frames without physical semantics; if multiple consecutive frames are intercepted, the monitoring terminal will prompt the drone to adjust its heading or lens attitude;

[0126] As a specific application scenario, during the nighttime inspection route of the substation, when the drone flew from the main transformer area to the insulator string area, it continuously collected three types of images: the first type was the reflection of water puddles on the ground, the second type was the shadow of the side fence and conductors of the equipment area, and the third type was a close-up view of the target insulator string.

[0127] The boundary filtering module determines that although the first two types of images contain complex textures, they do not have clear device-component-state semantics, so they are intercepted and discarded; the third type of image contains the outline of the insulator skirt and the connecting hardware, which are target pixels with physical semantic features, so they are input to the collaborative adaptation module; in this way, subsequent prompts are only based on the actual inspection objects.

[0128] The purpose of this step is to establish input boundaries for multimodal cue learning, avoiding semantically meaningless images and pure geometric noise from entering the update path, thereby achieving more stable online adaptation and a clearer system applicability.

[0129] Furthermore, the target scenario is a substation industrial scenario; the structured text includes at least one of the following: transformer equipment entity, insulator component entity, and description of equipment rupture state; the real-time image stream is acquired and transmitted by a preset edge computing node, and the edge computing node uses independently allocated video memory space to run the target prompt parameters.

[0130] This embodiment provides an edge deployment mechanism for substation industrial scenarios; specifically, based on the aforementioned implementation methods, this embodiment further defines the actual applicable objects, typical text content, and edge computing power organization methods of this system;

[0131] In practical implementation, this system targets the industrial scenario of substations, rather than any open image environment. This is because the structured text, coupled projection layer, and prompt parameter update logic of this system are all built around the strong coupling relationship between the equipment carrier and its physical state. Therefore, the structured text contains at least one or more combinations of transformer equipment entities, insulator component entities, and equipment rupture state descriptions. Through these industrial entities and state semantics, the system can establish task descriptions that can be actually inspected, such as main transformer-flange connection-oil leakage and insulator-shoulder-rupture.

[0132] The real-time image stream is acquired and transmitted by an edge computing node, which can be deployed on an UAV platform, an inspection robot, or a local server within a station. Unlike traditional full-model fine-tuning methods that require large-scale GPU memory support, this embodiment utilizes independently allocated GPU memory to run target cueing parameters, that is, it concentrates limited computing power on cueing space updates, cross-modal fitting, and mask decoding without modifying the parameters of the ultra-large-scale backbone network. With the above configuration, the edge device can complete the closed loop of acquisition-segmentation-judgment-return transmission on-site, without having to upload all images to the central server for offline processing.

[0133] In a specific architecture configuration: if the edge node has limited video memory resources, the system prioritizes residing the visual encoder with fixed parameters, the text encoder with fixed parameters, and a very small number of target cue parameters, among which the target cue parameters and their intermediate states occupy independent video memory space; once the inspection task is switched, only this part of the cue data needs to be replaced, instead of replacing the entire backbone model; this reduces the bandwidth consumption of data transmission and shortens the task switching latency.

[0134] Under abnormal operating conditions, if the edge nodes have insufficient instantaneous video memory, the system can temporarily store the real-time image stream and prioritize key frame segmentation to reduce the processing frequency of non-key frames; if the image acquisition link is interrupted, the system will keep the most recent stable prompt parameters unchanged and record the task to be retransmitted; if the structured text only contains the device entity and lacks a state description, the system can perform device contour-level segmentation but will not output a clear defect level; if the current scene exceeds the industrial boundary of the substation, such as capturing images of vehicles or personnel activity areas on roads outside the station, the system will only perform basic target filtering and will not call the defect prompt link.

[0135] As a specific application scenario, during the evening inspection of this 220kV substation after rain, the UAV's onboard edge node collects real-time images of the main transformer area and insulator string area, and sends the video frames to the system. The structured text contains at least industrial semantic items such as main transformer oil leakage and insulator skirt rupture. Due to the limited video memory of the edge node, the system does not attempt to fine-tune the global parameters of the complete large-scale model, but only loads the target prompt parameters and intermediate states corresponding to the current task in the independent video memory space. After the UAV completes a waypoint, it can immediately send an alarm back to the monitoring terminal based on the mask and state recognition results. When flying to the next waypoint, it only needs to continue to use or switch the prompt parameter data package, without reloading the backbone model.

[0136] The purpose of this step is to clarify the application boundaries and edge deployment methods of this system in the industrial field of substations, so as to achieve on-site inspection segmentation and status identification under low memory and low latency conditions.

[0137] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. An open-vocabulary substation equipment segmentation and inspection system based on multimodal cue learning, characterized in that, The system includes: an image acquisition module, used to acquire on-site image data of the target scene; An anchoring modeling module, connected to the image acquisition module, is used to acquire structured text of the target scene as the initial text prefix input to a preset parameter-fixed text encoder, and to acquire sample images as visual anchors. The collaborative adaptation module, connected to the anchoring modeling module, is used to acquire the on-site image data as a real-time image stream. The visual encoder with preset parameters extracts multi-scale visual features and inputs them into the visual language orthogonal coupling projection layer. Based on the bidirectional gradient synchronous update mechanism built in the projection layer, the initial text prefix is fine-tuned in a preset independent tensor space to obtain target prompt parameters, and the visual prompt feature tensor and text prompt feature tensor are output. The fitting segmentation module, connected to the collaborative adaptation module, is used to calculate the inner product of the visual cue feature tensor and the text cue feature tensor to generate a cross-modal attention heatmap, and input the cross-modal attention heatmap into a preset mask decoder, and output a pixel-level segmentation mask in combination with the multi-scale visual features; A closed-loop deployment module, connected to the fitting segmentation module, is used to generate state recognition results based on the pixel-level segmentation mask and serialize and store the target prompt parameters as a parameter data package with attribute labels; the bidirectional gradient synchronization update mechanism is specifically configured as follows: constructing a visual prompt branch and a text prompt branch in the visual-language orthogonal coupling projection layer; generating perturbations in the local input direction by injecting preset vectors into the visual prompt branch and the text prompt branch respectively; wherein, the preset vector is specifically configured as a random noise vector that follows a Gaussian distribution and has the same input dimension as the corresponding branch; Extract the visual cue gradient vector corresponding to the multi-scale visual features and the text cue gradient vector corresponding to the initial text prefix; record the responses of the two branches to the local input direction, form a mutual sensitivity matrix based on the first-order changes of the responses of the two branches to the local input direction, and calculate the Jacobian matrix determinant of the matrix. The value of the determinant of the Jacobian matrix is compared with the preset resonance threshold obtained based on the characteristics of historical real defect samples, so as to adjust the preset loss function of the model and constrain the update direction of the two branches to be synchronized according to the comparison result.

2. The open vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 1, characterized in that, The anchoring modeling module includes: The text parsing unit is used to parse the preset inspection procedure into triplet data containing equipment entity, component entity and physical state, and to concatenate the triplet data into the structured text. A prefix injection unit is used to convert the structured text into word vectors and insert the word vectors as the initial text prefix into a preset deep network layer of the parameter-fixed text encoder. Anchor point extraction unit is used to extract the sample images from the multimodal database and use the sample images as visual anchor points to initialize the weight matrix of the visual language orthogonally coupled projection layer.

3. The open vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 1, characterized in that, The collaborative adaptation module includes: The feature extraction unit is used to perform convolution processing on the real-time image stream through the visual encoder to generate the multi-scale visual features, and to extract the overall outline of the inspected equipment based on the multi-scale visual features. The residual mapping unit is used to extract the visual feature residual between the multi-scale visual features and the baseline visual features, wherein the baseline visual features are extracted from the visual anchor points, and dynamically map the visual feature residual to the independent tensor space to update the internal state of the target cueing parameters. The gradient resonance unit is used to trigger the bidirectional gradient synchronization update mechanism under the preset sample input conditions that the image quality and structured text integrity of the real-time image stream meet the preset sample input conditions. Combined with the determinant of the Jacobian matrix, the gradient direction is updated synchronously by introducing a cosine distance penalty term for the gradient vector of the visual cue and the gradient vector of the text cue into the preset loss function to generate the target cue parameters. The target prompt parameter has a parameter quantity that is less than the total parameter quantity of the text encoder and the visual encoder, and the target prompt parameter resides in an independent video memory space.

4. The open vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 1, characterized in that, The step of adjusting the model's preset loss function based on the comparison results specifically includes: If the value of the determinant of the Jacobian matrix is greater than or equal to the preset resonance threshold, the gradient direction is determined to be divergent, and the cosine distance between the visual cue gradient vector and the text cue gradient vector is added to the loss function as a penalty term to force alignment of the update direction through backpropagation. If the determinant of the Jacobian matrix is less than the preset resonance threshold, then gradient direction resonance is determined, and the initial text prefix is updated according to the current gradient direction.

5. The open vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 1, characterized in that, The fitting segmentation module includes: Tensor inner product unit, used to calculate the inner product matrix between the visual cue feature tensor and the text cue feature tensor; A heatmap generation unit is used to normalize the inner product matrix to generate the cross-modal attention heatmap; The prior decoding unit is used to input the cross-modal attention heatmap as spatial attention weights into the mask decoder, and combine the multi-scale visual features to decode and output the pixel-level segmentation mask.

6. The open-vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 1, characterized in that, The closed-loop deployment module includes: The state recognition unit acquires the multi-scale visual features transmitted by the collaborative adaptation module, extracts the connected component pixel area value and contour pixel perimeter value of the pixel-level segmentation mask, extracts the overall contour of the inspected equipment based on the multi-scale visual features to determine the overall pixel area of the inspected equipment, calculates the area connectivity ratio of the connected component pixel area value to the overall pixel area of the inspected equipment, and calculates the edge complexity based on the connected component pixel area value and the contour pixel perimeter value. The area connectivity ratio and the edge complexity are compared with the lower limit of the area connectivity ratio and the edge complexity feature coefficient corresponding to the inspected equipment component obtained from a preset two-dimensional numerical mapping table. If the area connectivity ratio is greater than or equal to the lower limit of the area connectivity ratio, and / or the edge complexity is greater than or equal to the edge complexity feature coefficient, then it is determined that the inspected equipment has an abnormal defect state, and the state identification result is generated and output to the monitoring terminal. If the conditions are not met, it is determined that the inspected equipment is in normal operation state, and the corresponding state identification result is generated and output to the monitoring terminal. The report generation unit is used to stitch the pixel-level segmentation mask with the state recognition result to generate an inspection report; The parameter serialization unit is used to extract the target prompt parameters and encapsulate the target prompt parameters into the parameter data packet with attribute tags.

7. The open-vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 6, characterized in that, The closed-loop deployment module also includes: The knowledge loading unit is used to receive a new scene image stream and parse the scene attributes of the new scene image stream; A matching invocation unit is used to load the target prompt parameters in the parameter data package into the memory space of a preset edge computing node when the scene attributes match the attribute labels of the parameter data package, so as to perform a zero-shot segmentation task; wherein, the zero-shot segmentation task is completed by the fitting segmentation module calling the target prompt parameters; The matching invocation unit is also used to trigger a retraining instruction when the scene attribute does not match the attribute label of the parameter data packet.

8. The open-vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 1, characterized in that, The system also includes: The boundary filtering module is used to determine the semantic attributes of the on-site image data before inputting it into the collaborative adaptation module as the real-time image stream; if the on-site image data contains target pixels with physical semantic features, then the on-site image data is input into the collaborative adaptation module as the real-time image stream; otherwise, the on-site image data is intercepted and discarded.

9. The open-vocabulary substation equipment segmentation and inspection system based on multimodal cue learning according to claim 1, characterized in that, The target scenario is a substation industrial scenario; The structured text includes at least one of the following: transformer equipment entity, insulator component entity, and description of equipment rupture condition; The real-time image stream is acquired and transmitted by a preset edge computing node, which uses independently allocated video memory to run the target prompt parameters.