A small sample power equipment image classification method based on multi-modal contrast learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a multimodal contrastive learning method, the problems of multimodal data fusion and small-sample fault identification in power equipment image classification are solved, achieving efficient and accurate power equipment fault diagnosis and supporting the digital transformation of the power grid.

CN122244565APending Publication Date: 2026-06-19STATE GRID TIANJIN ELECTRIC POWER COMPANY +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: STATE GRID TIANJIN ELECTRIC POWER COMPANY
Filing Date: 2026-05-20
Publication Date: 2026-06-19

Application Information

Patent Timeline

20 May 2026

Application

19 Jun 2026

Publication

CN122244565A

IPC: G06V10/764; G06V10/82; G06F18/241; G06F18/25; G06F18/15; G06N3/0455; G06N3/0895; G06N3/0495; G06N3/082

AI Tagging

Application Domain

Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A power distribution network voltage support evaluation method, system, device and medium based on generalized regulation resources
CN122225477ABiological models Ac network voltage adjustment
System(s) and method(s) for generative model processing of image data including object(s) having particular feature(s) and / or classification(s)
WO2026122857A1Biological models
Knowledge graph construction method and device, equipment and storage medium
CN119149753BImprove timing analysisImproving performance in directional reasoningBiological models Knowledge representation
QA system and method
US20260162247A1Programme control Image enhancement
Systems and methods for data collection in an industrial environment
US20260161153A1Machine part testing Receivers monitoring

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies cannot effectively handle multimodal data fusion and small-sample fault classification of power equipment images, which hinders the digital transformation of the power grid. Furthermore, existing classification schemes overfit under small-sample conditions and have insufficient recognition capabilities.

Method used

A multimodal contrastive learning approach is adopted, which achieves efficient classification of power equipment images by synchronously acquiring and preprocessing multimodal data, multi-stage contrastive learning, lightweighting and edge deployment of models, combined with time synchronization, feature fusion and adaptive optimization.

Benefits of technology

It improves the fault detection rate and diagnostic accuracy, adapts to the needs of power grid on-site operation and maintenance, improves the accuracy of fault classification, reduces the false alarm rate, and meets the requirements of real-time performance and reliability.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244565A_ABST

Patent Text Reader

Abstract

This invention relates to the field of image processing technology and discloses a method for classifying small-sample power equipment images based on multimodal contrastive learning. The method includes the following steps: microsecond-level synchronous acquisition of multimodal data (images, infrared, and audio) is achieved through a hybrid time synchronization protocol of NTP and PTP; multimodal feature fusion is completed via a bidirectional frequency domain cross-attention mechanism; a three-stage training framework is employed, consisting of single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning, combined with physical constraint pseudo-label optimization and adaptive regularization to complete small-sample model training; the model is then deployed to edge devices after lightweight compression; and finally, fault classification results are output through feature fusion and decision calibration. This invention solves the problem of small-sample fault classification in power equipment by employing technologies such as multimodal fusion, three-stage contrastive learning, physical constraints, and lightweight deployment, significantly improving fault detection rate and diagnostic accuracy, and adapting to the needs of power grid on-site operation and maintenance.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and more specifically, to a small-sample power equipment image classification method based on multimodal contrastive learning. Background Technology

[0002] With the accelerated digitalization of power grid operations, numerous real-time images of field equipment are collected in scenarios such as on-site equipment maintenance and automated inspections. These images are then transmitted to a sample platform. However, the current classification of power equipment images relies primarily on manual or semi-automatic annotation by professional annotators. This results in highly subjective and inefficient classifications, requiring significant manpower for tasks such as categorizing equipment types and fault anomalies in power equipment images. This not only consumes considerable time and energy but also severely hinders the digital transformation of the power grid and increases operational costs for enterprises. Power equipment faults are characterized by low incidence rates, difficulty in obtaining labeled samples, and high costs, creating a typical small-sample learning scenario.

[0003] Existing technologies still suffer from several core shortcomings: Current multimodal image classification techniques can only handle bimodal data of images and text, failing to meet the fusion requirements of heterogeneous sensor data (visible light, infrared, and audio) from power field applications; existing contrastive learning and attention fusion mechanisms are mostly based on homologous signal decomposition, resulting in natural alignment and strong physical correlation between modalities. However, directly applying these mechanisms to heterogeneous trimodal power data with significant differences in acquisition principles, time scales, and feature spaces leads to severe cross-modal semantic gaps and hinders effective feature alignment. Existing classification schemes rely heavily on a large number of labeled samples to learn subtle differences in equipment fault features. Under extremely small sample conditions (only 5-20 labeled samples per class), they are prone to overfitting and severely lack the ability to identify low-probability fault categories in power equipment, failing to meet the needs of practical production applications. Therefore, there is an urgent need to provide a small-sample power equipment image classification method based on multimodal contrastive learning to address these problems. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention provides a small-sample power equipment image classification method based on multimodal contrastive learning. By employing multimodal fusion, three-stage contrastive learning, physical constraints, and lightweight deployment techniques, it solves the challenge of classifying small-sample faults in power equipment, significantly improving fault detection rate and diagnostic accuracy, and adapting to the needs of power grid on-site operation and maintenance.

[0005] To achieve the above objectives, the technical solution of the present invention is as follows:

[0006] This invention provides a few-sample image classification method for power equipment based on multimodal contrastive learning, comprising the following steps:

[0007] S1. Multimodal data synchronous acquisition and preprocessing: The acquisition system, which is constructed by a central server, edge acquisition nodes and multimodal sensors, and a time synchronization unit composed of an NTP server and a PTP hardware clock module, achieves microsecond-level time synchronization of the acquisition system through the NTP server and PTP hybrid time synchronization protocol. Image, infrared and audio data are acquired, modal preprocessing is performed separately, and multimodal feature fusion is completed through a bidirectional frequency domain cross-attention fusion mechanism.

[0008] S2. Multimodal contrastive learning model training: Sequentially perform single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning, and combine modality adaptive data augmentation, pseudo-label generation and iterative optimization, and adaptive regularization to complete small sample training optimization;

[0009] S3, Model Lightweighting and Edge Deployment: Model compression is achieved through knowledge distillation, combined with deep and broad pruning and model quantization using the ESA algorithm. The model is then converted to a new format and deployed to edge devices. Real-time inference optimization is used to ensure inference efficiency.

[0010] S4. Classification prediction and decision calibration: The preprocessed multimodal features are input into the trained model. Preliminary classification results are obtained through feature fusion and few-sample inference. After decision calibration, the final power equipment fault classification results are output.

[0011] As a preferred embodiment of the present invention, S1 includes the following steps:

[0012] S11. Multimodal data synchronous acquisition: A data acquisition system consisting of a central server, edge acquisition nodes, and multimodal sensors is built. The time synchronization unit is equipped with an NTP server and a PTP hardware clock module. The NTP server and PTP hybrid time synchronization protocol is used to achieve microsecond-level time synchronization of the entire system. The time alignment of image, audio, and infrared data with different acquisition frequencies is completed through an adaptive time alignment algorithm. Data acquisition, timestamp processing, and structured storage are completed according to a standardized process.

[0013] S12. Data Preprocessing: Perform multi-dimensional standardized preprocessing on visible light images, infrared thermal imaging data, and audio signals to complete size normalization, noise reduction, and enhancement processing of the infrared image data; perform frame filtering, feature extraction, and dimensionality reduction on the audio signals to output preprocessed data that meets the model input requirements.

[0014] S13. Multimodal Feature Fusion: A three-tower encoder is used to extract the original features of the visible light image, infrared image, and audio three modalities respectively, and the dimension is unified by a linear projection layer. The initial fusion of frequency domain features of each modality is completed based on frequency domain feature extraction and bidirectional frequency domain cross-attention fusion mechanism. The basic fusion weight of each modality is calculated by combining the homoscedasticity uncertainty loss of each modality feature. After optimization by a lightweight MLP, the final fusion weight is adjusted by the working condition adaptive module, and the optimized multimodal fusion feature is output.

[0015] As a preferred embodiment of the present invention, S2 includes the following steps:

[0016] S21. Single-modal supervised contrastive learning augmentation: Modality-adaptive data augmentation is performed on image, infrared, and audio data respectively. Single-modal features are extracted using dedicated encoders for each modality. Each encoder is independently trained and iteratively updated based on an improved supervised contrastive loss function to improve the discriminative power of single-modal features.

[0017] S22, Cross-modal self-supervised alignment: The output features of each modal encoder are mapped to a unified shared feature space through a cross-modal alignment weight matrix. A cross-modal contrastive loss function is constructed using multimodal data from the same device as positive samples and multimodal data from different devices as negative samples. The cross-modal alignment weight matrix is optimized to achieve alignment of the multimodal feature space.

[0018] S23. Multimodal fusion contrast learning: Dynamically weighted fusion of spatially aligned multimodal features, calculation of cosine similarity between fused features, construction of multimodal fusion contrast loss function based on fused features, iterative optimization of fusion weights of each modality, and obtaining the optimal multimodal fusion feature;

[0019] S24. Small Sample Training Optimization: Through modal adaptive data augmentation, high-confidence pseudo-label generation and iterative optimization with physical consistency verification, and adaptive regularization, training configuration is performed to expand effective training data and improve the training effect and generalization ability of the model under small sample conditions.

[0020] As a preferred embodiment of the present invention, S3 includes the following steps:

[0021] S31. Lightweight Model Design: Knowledge distillation technique using the inverse KL divergence strategy is employed to compress the number of parameters in a large model;

[0022] By combining the depth and breadth pruning techniques of the ESA algorithm to remove redundant filters in each convolutional layer, the number of model parameters is compressed to less than 5% of the original number of parameters.

[0023] The model is quantized using the GPTQ approximate second-order quantization technique, reducing the computational resource requirements.

[0024] S32, Edge Device Adaptation: Convert the PyTorch format model to ONNX format and deploy it to Jetson Xavier NX or substation robot edge devices via the OpenCVDNN module;

[0025] Hybrid precision quantization technology is used to dynamically adjust the quantization bit width, balancing inference accuracy and speed;

[0026] S33. Real-time inference optimization: A fault feature priority-aware real-time inference framework is adopted. The front-end of inference is optimized through lightweight fault type prediction, dynamic routing of feature processing, and dynamic pruning of modal features. Combined with the computing resource scheduling with a built-in fault feature caching mechanism, the real-time performance of the entire inference process is improved.

[0027] As a preferred embodiment of the present invention, S4 includes the following steps:

[0028] S41. Input the preprocessed image, infrared and audio modal features into the model, and dynamically allocate the weights of each modal feature according to the differences in equipment fault types through a two-way frequency domain cross-attention mechanism to complete feature fusion.

[0029] S42. Based on small sample reasoning logic, a fault category prototype library is constructed using the support set. The cosine similarity between the query set features and the prototype library is calculated to achieve preliminary fault matching. The weights of each mode are automatically balanced by combining homoscedastic uncertainty loss to complete the preliminary fault matching.

[0030] S43. Based on historical fault data, the classification thresholds for each category are adaptively adjusted according to the equipment type and fault characteristics. At the same time, temperature scaling technology is introduced to improve prediction reliability. The set probability threshold is used as the fault judgment standard to complete dynamic calibration of decision-making and realize efficient classification and reliable judgment of equipment faults.

[0031] As a preferred embodiment of the present invention, the microsecond-level time synchronization is implemented by using a hybrid NTP and PTP time synchronization protocol to generate a microsecond-level precise timestamp. The timestamp calculation formula is as follows: ,in To synchronize the system time with PTP, This is the network latency compensation value. The inherent delay of the sensor is pre-calibrated; the central server broadcasts the PTP timestamp to all edge nodes. After receiving the PTP timestamp, the edge nodes calculate and compensate for the network delay. Combined with the pre-calibrated inherent delay of the sensor, a microsecond-level accurate timestamp is generated through a dedicated timestamp calculation formula to achieve a time synchronization accuracy of ±10μs for the entire system.

[0032] The specific implementation of the adaptive time alignment algorithm is as follows: based on the image acquisition frame rate, the alignment point of each modal data is calculated, the audio data is generated by interpolation to match the frame rate feature vector, and the infrared data is generated by inter-frame difference to complete the precise alignment of each modal data on the time axis.

[0033] After completing the time alignment of image, audio, and infrared data at different acquisition frequencies, the data is encapsulated into structured data packets containing a unique data ID, modality type, microsecond-level acquisition timestamp, device ID, and acquisition location coordinates, following the process of system initialization, data acquisition, timestamp processing, time alignment, and storage, ensuring that the acquisition latency is ≤200ms.

[0034] As a preferred embodiment of the present invention, the image and infrared data preprocessing includes: performing standardized preprocessing on the acquired visible light images and infrared thermal imaging data, sequentially performing four operations: size normalization and pixel normalization, infrared image denoising and temperature correction, frequency domain anomaly enhancement, and multi-scale feature encoding preparation. Specifically, all images are uniformly scaled to 224×224 pixels and channel-level normalization is completed. Adaptive denoising of infrared images affected by environmental interference is performed using dual-tree complex wavelet transform. Anomaly spectral components are enhanced in the Fourier domain through a frequency domain mask enhancement strategy. Finally, the preprocessed image data is output.

[0035] The audio data preprocessing employs pre-emphasis processing. The system is framed with a frame length of 32ms and a frame shift of 16ms, and Hamming window filtering is applied. Mel spectrum or GFCC features are extracted, and key feature dimensions are filtered by SVM-RFE, retaining the first 20 dimensions to reduce computational complexity.

[0036] As a preferred embodiment of the present invention, the specific implementation process of the bidirectional frequency domain cross-attention fusion mechanism is as follows:

[0037] Multimodal feature extraction and dimensionality unification: Three encoders are used to extract the native features of visible light images, infrared images, and audio respectively. The native features of all modal visible light images, infrared images, and audio are uniformly mapped to 768 dimensions through a linear projection layer.

[0038] Frequency domain feature extraction and bidirectional cross-attention fusion: Two-dimensional Fourier transform is performed on the dimensional unified projection features of each modality to convert them into frequency domain features; through the bidirectional frequency domain cross-attention fusion mechanism, mutual attention, dynamic interaction and preliminary fusion of frequency domain features of each modality are realized;

[0039] Dynamic weight adjustment based on homoscedastic uncertainty loss: calculate the variance and mean of each modal feature to obtain the uncertainty parameter, determine the basic weight of each modal fusion, and perform weighted integration of the preliminary fusion features to obtain the basic fusion features;

[0040] Feature optimization after fusion: The basic fusion features are input into a lightweight multilayer perceptron (MLP). The basic fusion features are processed and optimized through linear transformation of the MLP and nonlinear expression enhancement of the activation function, and finally the optimized fusion features of 768 dimensions are output.

[0041] Secondary adjustment of operating condition adaptive modal weights: The load current, ambient temperature, and historical temperature rise rate of the equipment are acquired in real time. Based on the operating condition parameters, an operating condition-modal correlation mapping function is constructed to obtain the operating condition modulation factor of each mode. The operating condition modulation factor is multiplied by the fusion basis weight calculated by homoscedasticity uncertainty loss to obtain the final fusion weight of each mode. The optimized fusion features are then subjected to secondary weight adjustment.

[0042] As a preferred embodiment of the present invention, the improved supervised contrastive loss function sets the batch size to 256 and the temperature parameter to 0.1, only treats samples with the same label as positive samples, constructs a negative sample set through an implicit method, and calculates the loss value based on cosine similarity; the cross-modal contrastive loss function uses multimodal data from the same device as positive samples and multimodal data from different devices as negative samples, and the optimization objective is to maximize the cross-modal feature similarity of the same device and minimize the cross-modal feature similarity of different devices, with a training cycle of 30 epochs and a batch size of 128.

[0043] As a preferred embodiment of the present invention, the multimodal fusion comparison learning specifically includes the following steps:

[0044] Using spatially aligned image, infrared, and audio modal features as input, dynamic fusion weights are assigned to each modality to generate fusion features;

[0045] Calculate the cosine similarity between fused features, and construct a multimodal fusion contrast loss function with fused features from the same device as positive samples and fused features from different devices as negative samples;

[0046] The fusion weights are iteratively optimized with the goal of minimizing the fusion contrast loss to obtain the optimal multimodal fusion features.

[0047] As a preferred embodiment of the present invention, the modal adaptive data enhancement includes:

[0048] The image modality employs random cropping, color dithering, and small-angle rotation;

[0049] The infrared mode employs temperature scaling, Gaussian noise, and horizontal flipping.

[0050] The audio modality employs time stretching, reverb addition, and pitch adjustment;

[0051] The pseudo-label generation and iterative optimization include:

[0052] Use the trained model to generate pseudo-labeled samples with a confidence level greater than 0.85;

[0053] The consistency verification module for operational status verifies the physical and logical consistency between the multimodal characteristics of the pseudo-label samples and the operating parameters, thus filtering out contradictory samples.

[0054] The total loss function is constructed by fusing supervised contrastive loss and pseudo-label cross-entropy loss, and pseudo-label accuracy is improved through up to 5 iterations of training.

[0055] The adaptive regularization includes:

[0056] Regularization terms are constructed based on the Frobenius norm of the weight matrix;

[0057] The regularization strength adaptively increases as the number of samples in each class decreases to suppress overfitting on small samples.

[0058] As a preferred embodiment of the present invention, the real-time inference optimization specifically includes the following steps:

[0059] S331: Fault Feature Priority Awareness: Determines the fault type and confidence level through lightweight fault type prediction, and realizes dynamic routing of feature processing and dynamic pruning of modal features based on the prediction results;

[0060] S332: Computational resource scheduling: Real-time perception of edge device hardware status, maximizing inference efficiency through resource scheduling, computational pipeline reorganization, and fault feature caching;

[0061] S333: Physical consistency guarantee: Embed physical constraints of power equipment during the inference optimization process and establish an error feedback closed loop to ensure that the inference results conform to physical laws.

[0062] As a preferred embodiment of the present invention, the feature fusion in S41 specifically involves: using a bidirectional frequency domain cross-attention fusion mechanism to fusion the preprocessed image, infrared, and audio modal features, adaptively allocating the weights of each modal feature according to the fault type corresponding to the data to be detected, and generating fused features for subsequent fault classification.

[0063] As a preferred embodiment of the present invention, in S331, the lightweight fault type prediction specifically refers to:

[0064] A lightweight fault type prediction module with less than 50K parameters and based on MobileNetV2 is integrated at the front end of the inference process. This module takes the input data as input, outputs the possible fault types and their confidence levels within 15ms, and consumes less than 5% of the computing resources, providing a basis for decision-making for subsequent inference optimization.

[0065] The feature processing dynamic routing specifically involves: allocating computing resources based on the fault type prediction results, and adopting dedicated routing strategies for different fault types.

[0066] When the transformer is predicted to be overheating, 85% of the computing resources are prioritized for infrared features, 10% are used to simplify image processing, and 5% are skipped for audio processing.

[0067] When a line break is predicted, 80% of computing resources are prioritized for image features, 15% are used to simplify infrared processing, and 5% are skipped for audio processing.

[0068] When the fault is predicted to be a motor bearing failure, 75% of the computing resources are prioritized for audio features, 20% for simplifying infrared processing, and 5% for simplifying image processing.

[0069] The dynamic trimming of modal features specifically involves: intelligently trimming each modal feature map based on fault type and physical characteristics to ensure that key information is not lost, specifically including:

[0070] Infrared features retain temperature hotspot areas for overheating faults, and the feature map size is cropped from 224×224 to 128×128.

[0071] The image features are focused on the fracture area of the line fault, and the feature map size is cropped from 224×224 to 160×160.

[0072] The audio features are focused on a specific frequency range for bearing failures, and the feature map length is cropped from 128 to 64.

[0073] As a preferred embodiment of the present invention, the computing resource scheduling in S332 includes a resource-aware scheduler, which monitors the CPU load and network latency of edge devices in real time and dynamically adjusts the inference strategy:

[0074] When CPU load is less than 60%, enable the full inference process;

[0075] When the CPU load is ≥60%, it automatically switches to lightweight inference mode, only handling the two fault types with the highest confidence.

[0076] When network latency is greater than 50ms, the feature resolution is automatically reduced to 160×160.

[0077] The computing resource scheduling includes lossless computing pipeline reorganization, specifically: reconstructing the inference computing pipeline, executing fault type prediction and feature extraction in parallel, optimizing the execution order of high-frequency computing operations, reducing pipeline blockage through computing dependency analysis, and achieving lossless acceleration of inference;

[0078] The computing resource scheduling includes a fault feature caching mechanism, specifically: establishing a fault feature cache pool to store recently detected key fault features; when the same or similar faults are detected, and the cosine similarity of the fault similarity is >0.95, the features are directly obtained from the cache, skipping part of the calculation, and the cache hit rate is ≥40%.

[0079] As a preferred embodiment of the present invention, the embedding of power equipment-specific physical constraints in each stage of reasoning optimization in S333 specifically refers to:

[0080] During infrared feature processing, thermodynamic formulas are applied to constrain the temperature change rate to <5℃ / min;

[0081] During image feature processing, device structure constraints are applied to ensure that the fracture area is within the device outline;

[0082] When processing audio features, acoustic propagation model constraints are applied, limiting the frequency range to 500-5000Hz;

[0083] The error feedback closed loop specifically involves: establishing an error feedback mechanism between the diagnostic results and the actual state of the equipment; automatically adjusting the feature priority weights when the diagnostic results do not match the actual state of the equipment; and continuously optimizing the feature priority strategy through learning to make the reasoning process conform to physical reality.

[0084] When the diagnosis indicates that the transformer is overheating but the actual temperature of the equipment is normal, the weight of the image features is automatically increased.

[0085] The present invention also provides a few-sample power equipment image classification device based on multimodal contrastive learning, comprising:

[0086] The multimodal data acquisition and preprocessing module is used to build an acquisition system through a central server, edge acquisition nodes and multimodal sensors. Relying on the time synchronization unit composed of NTP server and PTP hardware clock module, it adopts a hybrid time synchronization protocol of NTP and PTP to achieve microsecond-level time synchronization of the acquisition system. It completes the synchronous acquisition of three types of data: visible light image, infrared thermal imaging and audio of power equipment. It performs modality-specific preprocessing on each type of data and completes multimodal feature fusion through a bidirectional frequency domain cross-attention fusion mechanism.

[0087] The multimodal contrastive learning model training module is connected to the multimodal data acquisition and preprocessing module. It is used to sequentially perform single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning to complete the basic training of the classification model. It also combines modality adaptive data augmentation, pseudo-label generation and iterative optimization, and adaptive regularization mechanism to complete the training optimization of the model in extreme small sample scenarios.

[0088] The model lightweighting and edge deployment module is connected to the multimodal contrastive learning model training module. It is used to complete the model lightweighting and compression through knowledge distillation, deep and breadth joint pruning combined with ESA algorithm, and model quantization technology. The compressed model is then converted to a new format and deployed to the power inspection edge device. At the same time, a real-time inference optimization strategy is adopted to ensure the inference efficiency of the model at the edge.

[0089] The classification prediction and decision calibration module is connected to the multimodal data acquisition and preprocessing module and the model lightweighting and edge deployment module, respectively. It is used to input the preprocessed multimodal features into the trained model, obtain the preliminary classification result of power equipment faults through feature fusion and few-sample inference, and output the final power equipment fault classification result after decision calibration.

[0090] The beneficial technical effects of this invention are:

[0091] Employing a hybrid NTP and PTP time synchronization protocol, combined with a dedicated timestamp calculation and compensation mechanism, the system achieves a microsecond-level time synchronization accuracy of ±10μs, far exceeding the industry average of ±100μs. Through an adaptive time alignment algorithm, it achieves precise alignment of 30fps image, 15fps infrared, and 48kHz audio data on the time axis, with data acquisition and processing latency controlled within 200ms. This ensures the spatiotemporal consistency of multimodal data from the source, laying a reliable data foundation for subsequent cross-modal learning and feature fusion.

[0092] A three-stage progressive contrastive learning framework is proposed, consisting of single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning. This framework effectively decouples intramodal discriminative learning from intermodal semantic alignment tasks, overcoming the instability of end-to-end training under small sample conditions. With only 10 labeled samples per class, the fault classification accuracy reaches 89.7%, a 15.2% improvement over the dual-modal baseline scheme. In the extreme small sample scenario of 5-shot, the fault classification F1-score is improved by 7% compared to the traditional fixed-sequence training scheme.

[0093] Bi-FCFM, a bidirectional frequency domain cross-attention fusion mechanism, is designed. It calculates the basic weights for mode fusion by combining homoscedasticity uncertainty loss. At the same time, it constructs a mapping function based on operating parameters such as equipment load current, ambient temperature, and historical temperature rise rate to complete the secondary dynamic adjustment of fusion weights, enabling the model to automatically focus on the most reliable mode under different operating conditions. This mechanism reduces the false alarm rate of model faults by 37%, and improves the fault detection accuracy from 65.2% to 74.9% under extreme operating conditions such as high temperature, high humidity, and low illumination.

[0094] A pseudo-label self-optimization mechanism with physical consistency constraints was designed. By verifying the correlation between equipment states and filtering with physical rules such as thermodynamics and acoustics, only pseudo-label samples with a confidence level > 0.85 and conforming to the operating rules of power equipment are retained. Combined with up to 5 iterations of optimization, the pseudo-label accuracy is improved from 82% to 94%. In the 10-shot scenario, the model false alarm rate is reduced by 38.7%, which effectively suppresses the accumulation of errors in small sample training.

[0095] By employing knowledge distillation through the inverse KL divergence strategy and combining depth- and breadth-based pruning with the ESA algorithm, the number of model parameters is compressed to less than 5% of the original number. Combined with GPTQ approximate second-order quantization and mixed-precision quantization, adaptation to edge devices such as Jetson Xavier NX and substation robots is achieved. Through a fault feature priority-aware real-time inference framework, combined with lightweight fault prediction, dynamic feature routing and pruning, intelligent resource scheduling, and a fault feature caching mechanism, a cache hit rate of ≥40% is achieved, enabling millisecond-level fault diagnosis while ensuring classification accuracy, meeting the real-time requirements of power inspection sites. Simultaneously, dedicated physical constraints and error feedback loops for power equipment are embedded throughout the entire inference process to ensure that the inference results conform to the physical operating laws of the equipment, further improving the reliability of the diagnostic results.

[0096] This solution deeply integrates technologies such as trimodal data fusion, three-stage progressive comparative learning, adaptive weight adjustment based on operating conditions, physical consistency constraints, and lightweight edge deployment. These technologies work together and empower each other, achieving unexpected technical results that far exceed the sum of the effects of applying each technology individually. In real power grid field data testing, this solution improved the detection rate of low-probability faults in power equipment, such as minor oil leakage in bushings and micro-cracks in insulators, by 2.3 times, far exceeding the sum of the effects of using each technology individually. This provides reliable technical support for intelligent operation and maintenance and fault early warning of power equipment. Attached Figure Description

[0097] Figure 1 This is a schematic diagram of the overall process of the present invention.

[0098] Figure 2 This is a flowchart of the multimodal feature fusion technology in this invention.

[0099] Figure 3 This is a flowchart of the single-modal supervised contrastive learning technique in this invention.

[0100] Figure 4 This is a flowchart of the cross-modal self-supervised alignment technology in this invention.

[0101] Figure 5 This is a flowchart of the multimodal fusion and comparison learning technology in this invention.

[0102] Figure 6This is a flowchart of the small sample optimization technique in this invention. Detailed Implementation

[0103] In order to better understand the technical means of the present invention and to implement it in accordance with the contents of the specification, the specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings and examples. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

[0104] Combination Figures 1-6 The present invention provides the following embodiments:

[0105] Example 1:

[0106] A few-sample power equipment image classification method based on multimodal contrastive learning includes the following steps:

[0107] S1. Multimodal data synchronous acquisition and preprocessing: The acquisition system is constructed through a central server, edge acquisition nodes and multimodal sensors, and a time synchronization unit composed of NTP server and PTP hardware clock module. The acquisition system achieves microsecond-level time synchronization by using NTP server and PTP hybrid time synchronization protocol. Image, infrared and audio data are acquired, and modal preprocessing is performed separately. Multimodal feature fusion is completed through bidirectional frequency domain cross-attention fusion mechanism Bi-FCFM.

[0108] S2. Multimodal contrastive learning model training: Sequentially perform single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning, and combine modality adaptive data augmentation, pseudo-label generation and iterative optimization, and adaptive regularization to complete small sample training optimization;

[0109] S3, Model Lightweighting and Edge Deployment: Model compression is achieved through knowledge distillation, combined with deep and broad pruning and model quantization using the ESA algorithm. The model is then converted to a new format and deployed to edge devices. Real-time inference optimization is used to ensure inference efficiency.

[0110] S4. Classification prediction and decision calibration: The preprocessed multimodal features are input into the trained model. Preliminary classification results are obtained through feature fusion and few-sample inference. After decision calibration, the final power equipment fault classification results are output.

[0111] Furthermore, S1 includes the following steps:

[0112] S11. Multimodal data synchronous acquisition: A data acquisition system consisting of a central server, edge acquisition nodes, and multimodal sensors is built. The time synchronization unit is equipped with an NTP server and a PTP hardware clock module. The system adopts a hybrid time synchronization protocol of NTP server and PTP (IEEE1588) to achieve microsecond-level time synchronization. The time alignment of image, audio, and infrared data with different acquisition frequencies is completed through an adaptive time alignment algorithm. Data acquisition, timestamp processing, and structured storage are completed according to a standardized process.

[0113] S12. Data Preprocessing: Perform multi-dimensional standardized preprocessing on visible light images, infrared thermal imaging data, and audio signals to complete size normalization, noise reduction, and enhancement processing of the infrared image data; perform frame filtering, feature extraction, and dimensionality reduction on the audio signals to output preprocessed data that meets the model input requirements.

[0114] S13. Multimodal Feature Fusion: A three-tower encoder is used to extract the original features of the visible light image, infrared image, and audio three modalities respectively, and the dimension is unified by a linear projection layer. The initial fusion of frequency domain features of each modality is completed based on frequency domain feature extraction and bidirectional frequency domain cross-attention fusion mechanism Bi-FCFM. The basic fusion weight of each modality is calculated by combining the homoscedasticity uncertainty loss of each modality feature. After optimization by a lightweight MLP, the final fusion weight is adjusted by the working condition adaptive module, and the optimized multimodal fusion feature is output.

[0115] A closed-loop processing system is constructed, which integrates spatiotemporal synchronous acquisition, standardized preprocessing, and adaptive feature fusion. First, a hybrid time synchronization protocol and time alignment algorithm are used to achieve microsecond-level spatiotemporal synchronization across the entire system, eliminating time misalignment between modalities. Then, modality-specific preprocessing is used to unify model input specifications, enhance core fault features, and suppress environmental interference. Finally, a progressive fusion logic is used to solve the problem of feature space mismatch between heterogeneous modalities, outputting highly discriminative multimodal fusion features, laying a core data foundation for subsequent model training and inference.

[0116] The hardware configuration for this embodiment is as follows:

[0117] Image acquisition unit: Equipped with a 12-megapixel industrial camera, 30fps frame rate, supports 1 / 250s shutter speed, and features an infrared filter to avoid spectral interference.

[0118] Audio acquisition unit: adopts a 4-channel microphone array, sampling rate of 48kHz, signal-to-noise ratio >70dB, and supports 360° sound source localization.

[0119] Infrared acquisition unit: uses a thermal imaging camera, frame rate 15fps, temperature range -20℃ to +150℃, accuracy ±2℃.

[0120] Time synchronization unit: Equipped with an NTP server and a PTP hardware clock module to ensure that the time synchronization accuracy of the entire system reaches ±10μs.

[0121] Furthermore, the microsecond-level time synchronization is achieved by using a hybrid time synchronization protocol of NTP and PTP (IEEE 1588) to generate a precise microsecond-level timestamp. The timestamp calculation formula is as follows: ,in To synchronize the system time with PTP, This is the network latency compensation value. The inherent delay of the sensor is pre-calibrated; the central server broadcasts the PTP timestamp to all edge nodes. After receiving the PTP timestamp, the edge nodes calculate and compensate for the network delay. Combined with the pre-calibrated inherent delay of the sensor, a microsecond-level accurate timestamp is generated through a dedicated timestamp calculation formula to achieve a time synchronization accuracy of ±10μs for the entire system.

[0122] The specific implementation of the adaptive time alignment algorithm is as follows: taking the image acquisition frame rate of 30fps as the benchmark, calculate the alignment point of each modal data, use interpolation to generate a feature vector matching the frame rate for the 48kHz audio data, and use the inter-frame difference method for the 15fps infrared data to complete the precise alignment of each modal data on the time axis.

[0123] After completing the time alignment of image, audio, and infrared data at different acquisition frequencies, the data is encapsulated into structured data packets containing a unique data ID, modality type, microsecond-level acquisition timestamp, device ID, and acquisition location coordinates, following the process of system initialization, data acquisition, timestamp processing, time alignment, and storage, ensuring that the acquisition latency is ≤200ms.

[0124] The formula for aligning images and audio is:

[0125]

[0126] in, For aligned audio features, Original audio features The image frame rate is 30fps. The audio sampling rate is 48kHz; Define the time domain. The total acquisition time for a single set of multimodal data is used to ensure signal alignment is completed within the complete acquisition cycle. For infrared data, the inter-frame difference method is used to complete time alignment, ultimately achieving accurate matching of the three types of modal data on the time axis.

[0127] By combining NTP and PTP hybrid protocols with dual-compensation timestamp calculation, a unified microsecond-level clock reference is achieved across the entire system. Using the image acquisition frame rate as a benchmark, rigid alignment of the time axis for data with different sampling rates is achieved, ensuring that multimodal data with the same timestamp corresponds to the same operating state of the corresponding devices. Furthermore, data traceability is achieved through structured data encapsulation, providing a high-quality data foundation with spatiotemporal consistency for subsequent multimodal fusion.

[0128] Furthermore, the specific process of the data preprocessing is as follows:

[0129] The image and infrared data preprocessing involves performing standardized preprocessing on the acquired visible light images and infrared thermal imaging data. This includes four operations: size normalization and pixel normalization, infrared image denoising and temperature correction, frequency domain anomaly enhancement, and multi-scale feature encoding preparation. Specifically, all images are uniformly scaled to 224×224 pixels and channel-level normalization is performed. Adaptive denoising of infrared images affected by environmental interference is performed using dual-tree complex wavelet transform (DT-CWT). Anomaly spectral components are enhanced in the Fourier domain using a frequency domain mask enhancement strategy. Finally, the preprocessed image data is output.

[0130] The audio data preprocessing employs pre-emphasis processing. The system is framed with a frame length of 32ms and a frame shift of 16ms, and Hamming window filtering is applied. Mel spectrum or GFCC features are extracted, and key feature dimensions are filtered by SVM-RFE, retaining the first 20 dimensions to reduce computational complexity.

[0131] To address the challenges of field-acquired data being susceptible to environmental interference, exhibiting weak fault characteristics, and lacking standardized model input specifications, a standardized preprocessing system tailored to each modality is constructed. For image and infrared data, input specifications are unified through normalization, the true temperature distribution of the equipment is restored through wavelet denoising, and overheating fault-related features are amplified through frequency domain enhancement. For audio data, preprocessing ensures short-term signal stability, extracts acoustic features strongly correlated with mechanical faults, and filters core dimensions, thereby reducing computational load while improving the data signal-to-noise ratio and the identifiability of fault characteristics.

[0132] Furthermore, the specific implementation process of the bidirectional frequency domain cross-attention fusion mechanism Bi-FCFM is as follows:

[0133] Multimodal feature extraction and dimensionality unification: Three encoders are used to extract the native features of visible light images, infrared images, and audio respectively. The native features of all modal visible light images, infrared images, and audio are uniformly mapped to 768 dimensions through a linear projection layer.

[0134] Frequency domain feature extraction and bidirectional cross-attention fusion: Two-dimensional Fourier transform is performed on the dimensional unified projection features of each modality to convert them into frequency domain features; through the bidirectional cross-attention mechanism, mutual attention, dynamic interaction and preliminary fusion of frequency domain features of each modality are realized;

[0135] Dynamic weight adjustment based on homoscedastic uncertainty loss: calculate the variance and mean of each modal feature to obtain the uncertainty parameter, determine the basic weight of each modal fusion, and perform weighted integration of the preliminary fusion features to obtain the basic fusion features;

[0136] Feature optimization after fusion: The basic fusion features are input into a lightweight multilayer perceptron (MLP). The basic fusion features are processed and optimized through linear transformation of the MLP and nonlinear expression enhancement of the activation function, and finally the optimized fusion features of 768 dimensions are output.

[0137] Secondary adjustment of operating condition adaptive modal weights: Real-time acquisition of operating condition parameters such as load current, ambient temperature, and historical temperature rise rate of the equipment; construction of an operating condition-modal correlation mapping function based on the operating condition parameters to obtain the operating condition modulation factor of each mode; multiplication of the operating condition modulation factor with the fusion basis weight calculated by homoscedasticity uncertainty loss to obtain the final fusion weight of each mode; and secondary weight adjustment of the optimized fusion features.

[0138] The original features of three modalities—image, infrared, and audio—are extracted using a three-tower encoder, respectively:

[0139] Image modalities were extracted from the preprocessed 224×224 pixel normalized visible light image using a ResNet-50 encoder. The image feature extraction formula is as follows:

[0140]

[0141] in, The image is a normalized visible light image with a size of 224×224 pixels. The extracted image features... Where H=7, W=7, C=2048, representing a 7×7 spatial dimension and 2048 channels;

[0142] Infrared modalities are extracted from the preprocessed infrared image using a Freq-ViT encoder. The infrared feature extraction formula is as follows:

[0143]

[0144] in, Infrared features are extracted from the preprocessed infrared image. Where H=14, W=14, C=768, representing a spatial dimension of 14×14 and 768 channels;

[0145] The audio modality is processed by a Hubert encoder to extract features from the preprocessed audio features. The audio feature extraction formula is as follows:

[0146]

[0147] in, The original audio features are used to extract the audio features. Where T=128 and C=768 represent 128 time steps and 768 channels.

[0148] The three modal features are uniformly mapped to d=768 dimensions through a linear projection layer. The linear projection formula is as follows:

[0149]

[0150] in, For the first 768-dimensional features after modal projection For the first The raw features output by the modal encoder. This is the linear projection weight matrix for the corresponding mode. This is the projection bias term for the corresponding mode.

[0151] A two-dimensional Fourier transform is performed on the 768-dimensional features projected from each mode to extract the frequency domain feature representation of each mode. The transform formula is as follows:

[0152]

[0153] in, Represents the two-dimensional Fourier transform operator. For the first The frequency domain features of each modality are analyzed. Dynamic interaction of frequency domain features between modalities is achieved through a bidirectional cross-attention mechanism. First, the query vector, key vector, and value vector of each modality's frequency domain features are calculated using the following formula:

[0154]

[0155] in, , respectively representing different modes, , , The first matrix is the learnable weight matrix; then the attention weight matrix between modalities is calculated using the following formula:

[0156]

[0157] in, The feature dimension is defined as follows: Finally, cross-modal feature aggregation is performed based on attention weights to generate cross-modal interaction features for each modality, as shown in the formula: .

[0158] in , For the first Cross-modal interaction features of each modality For intermodal attention weights, For the first A value vector of modalities.

[0159] The basic fusion weights for each modality are calculated based on the homoscedastic uncertainty loss. First, the variance of each modality's features is calculated using the following formula:

[0160]

[0161] in, For the first The characteristic variance of each modality For the sample size, For the first The first sample Characteristics of class modality For the first The feature mean of each modality is calculated; then the uncertainty parameters and basic fusion weights of each modality are calculated using the following formula:

[0162]

[0163] in, Let be the uncertainty parameter of the i-th mode. The basic fusion weights for the i-th modality, For modal indexing, .

[0164] The formula for calculating fusion features is: ;in, The initial fusion features have 768 dimensions. For the first Cross-modal interaction features of each modality.

[0165] The fused features are further processed using a lightweight MLP to improve feature representation capabilities, as shown in the formula: ;

[0166] in, The final output is a 768-dimensional standardized multimodal fusion feature. The weight matrix for post-processing fusion is 768×768. is the fusion post-processing bias term with dimension 768, and ReLU is a non-linear activation function.

[0167] Real-time acquisition of load current in the device system Ambient temperature Historical temperature rise rate Operating parameters are weighted using an operating condition-modal correlation mapping function to obtain the final fused weights for each mode. The mapping function is as follows:

[0168] in, For the operating condition modulation factor, when hour, , ;when and hour, , This is the rated load current of the equipment.

[0169] First, the feature dimensions are unified by the three-tower encoder and the linear projection layer. Then, bidirectional cross-attention fusion is carried out based on frequency domain features to capture the periodic features of faults and achieve intermodal information complementarity. Adaptive allocation of fusion weights is achieved through homoscedastic uncertainty loss. The weights are adjusted in the second step by combining the real-time operating parameters of the equipment, so that the fusion strategy can be adapted to different operating scenarios, which greatly improves the fault discrimination capability and environmental robustness of the fusion features.

[0170] Furthermore, S2 includes the following steps:

[0171] S21. Single-modal supervised contrastive learning augmentation: Modality-adaptive data augmentation is performed on image, infrared, and audio data respectively. Single-modal features are extracted using dedicated encoders for each modality. Each encoder is independently trained and iteratively updated based on an improved supervised contrastive loss function to improve the discriminative power of single-modal features.

[0172] S22, Cross-modal self-supervised alignment: The output features of each modal encoder are mapped to a unified shared feature space through a cross-modal alignment weight matrix. A cross-modal contrastive loss function is constructed using multimodal data from the same device as positive samples and multimodal data from different devices as negative samples. The cross-modal alignment weight matrix is optimized to achieve alignment of the multimodal feature space.

[0173] S23. Multimodal fusion contrast learning: Dynamically weighted fusion of spatially aligned multimodal features, calculation of cosine similarity between fused features, construction of multimodal fusion contrast loss function based on fused features, iterative optimization of fusion weights of each modality, and obtaining the optimal multimodal fusion feature;

[0174] S24. Small sample optimization: Through modal adaptive data augmentation, high-confidence pseudo-label generation and iterative optimization with physical consistency verification, and adaptive regularization, the training configuration is improved to expand the effective training data and enhance the training effect and generalization ability of the model under small sample conditions.

[0175] First, we enhance the fault feature extraction capability of each modality-specific encoder through single-modal independent supervised contrastive learning. Then, we bridge the semantic gap between heterogeneous modalities by leveraging the natural physical correlation of multimodal data from the device through cross-modal self-supervised alignment. Subsequently, we guide the model to learn the optimal modality fusion strategy through multimodal fusion contrastive learning. Finally, we expand the effective training data and suppress model overfitting through multi-dimensional few-sample optimization strategies, thus solving the core problem of insufficient model generalization ability in few-sample scenarios.

[0176] Furthermore, the improved supervised contrastive loss function sets the batch size to 256 and the temperature parameter to 0.1, treats only samples with the same label as positive samples, constructs a negative sample set through an implicit method, and calculates the loss value based on cosine similarity; the cross-modal contrastive loss function uses multimodal data from the same device as positive samples and multimodal data from different devices as negative samples, with the optimization objective being to maximize the cross-modal feature similarity of the same device and minimize the cross-modal feature similarity of different devices, with a training cycle of 30 epochs and a batch size of 128.

[0177] The improved supervised contrastive loss function is the core optimization objective for training a single-modal encoder, and its complete formula is:

[0178]

[0179] In the formula, This represents the batch size, with a value of 256. For cosine similarity operator; This is a temperature parameter, with a value of 0.1. The set of positive samples contains only samples that are positive. Samples with the same label; the negative sample set is implicitly constructed from all dissimilar samples, without the need for explicit screening; No. Feature vectors of each sample; Positive samples with the same label For the sample The set of positive samples Any sample feature vector within.

[0180] First, a linear projection layer is used to uniformly map the native features extracted by each modal encoder to the same dimension. Let the first... The features obtained after processing by a single-mode encoder and projection layer are: , , .

[0181] Subsequently, a cross-modal alignment weight matrix is constructed. The features of each modality are mapped to a unified shared feature space through matrix multiplication.

[0182] The formula for cross-modal feature space mapping is:

[0183]

[0184] In the formula For cross-modal alignment weight matrix, Aligned features within a shared feature space For the first 768-dimensional features after modal projection;

[0185] This alignment process optimizes cross-modal contrastive loss by using multimodal data collected from the same device at the same time as positive sample pairs and multimodal data from different devices as negative sample pairs. The weight matrix is iteratively optimized by maximizing the cosine similarity of cross-modal features from the same device and minimizing the similarity of cross-modal features from different devices. This enables semantic alignment of heterogeneous modal features in a shared space.

[0186] The formula for cross-modal similarity measurement is:

[0187]

[0188] in, This is a cross-modal similarity function with an output value range of [-1, 1]. The closer the value is to 1, the higher the similarity between the feature vectors of the two different modalities and the better the feature alignment effect. , The first Class, No. The 768-dimensional feature vector of the class modality after mapping through the shared feature space, where These correspond to three modalities: image, infrared, and audio. For feature vectors The transpose of is used to perform the dot product operation of two vectors, ensuring dimension matching in matrix multiplication; , respectively, feature vectors , The L2 norm is used to normalize vectors, eliminating the interference of vector magnitude on similarity calculation and retaining only the correlation information of vector direction; As a constraint, it is clarified that this formula is only used to calculate the feature similarity between different modalities, rather than the similarity calculation within the same modality, so as to accurately match the core goal of cross-modal feature alignment.

[0189] The rules for constructing positive and negative samples are as follows: multimodal data from the same device at the same time are used as positive samples, and multimodal data from different devices are used as negative samples.

[0190] The optimization objective is: This achieves the maximization of cross-modal feature similarity for the same device and the minimization of cross-modal feature similarity for different devices. The cross-modal alignment weight matrix is the parameter to be optimized. For cross-modal contrast loss;

[0191] The overall optimization objective for small sample training is:

[0192]

[0193] To use model parameters To optimize the variables and minimize the total loss ; The total loss function is composed of a weighted sum of supervised contrastive loss, pseudo-label loss, and adaptive regularization term. The range of values for the monitoring loss weighting coefficient is as follows: The typical value of this scheme is This is used to balance the contributions of supervision loss and pseudo-label loss; To supervise the contrastive loss, it is calculated based on labeled samples to enhance the class discriminativeness of features; The pseudo-label loss is calculated based on the high-confidence pseudo-labels generated by the model for unlabeled samples, and is used to augment small sample training data. To adapt the regularization strength coefficient, the number of fault categories and the number of samples in each category are dynamically adjusted to avoid overfitting during small sample training.

[0194] This is a regularization term used to constrain model complexity and improve generalization ability.

[0195] The training hyperparameters were set as follows: 30 training epochs, batch size 128, optimizer AdamW, and learning rate. Weight decay .

[0196] By precisely defining the core parameters of the loss function and the rules for constructing positive and negative samples, stable convergence of contrastive learning and maximization of feature learning effectiveness are achieved. The improved supervised contrastive loss can amplify the commonalities of similar features and the differences of dissimilar features, strengthening the discriminative power of single-modal features; the cross-modal contrastive loss utilizes the natural correlation of multimodal data from devices to achieve self-supervised alignment, guiding the model to learn cross-modal semantic associations and bridging the differences in feature space distribution between heterogeneous modalities.

[0197] Furthermore, the multimodal fusion contrastive learning specifically includes the following steps:

[0198] Using spatially aligned trimodal features as input, dynamic fusion weights are assigned to each modality to generate fusion features;

[0199] Calculate the cosine similarity between fused features, and construct a multimodal fusion contrast loss function with fused features from the same device as positive samples and fused features from different devices as negative samples;

[0200] The fusion weights are iteratively optimized with the goal of minimizing the fusion contrast loss to obtain the optimal multimodal fusion features.

[0201] Using the spatially aligned trimodal features as input, the formula for generating the fused features is:

[0202]

[0203] In the formula, The trimodal features are aligned across the modal space. These are learnable, dynamically fused weights.

[0204] The similarity between fused features is calculated using cosine similarity, and the formula is:

[0205]

[0206] in , These represent the fused feature vectors of the two devices (or samples), respectively. Represents the dot product of vectors. This represents the Euclidean norm of a vector, and its value ranges from [-1, 1]. The closer it is to 1, the more similar the two fused features are.

[0207] The multimodal fusion contrast loss function is constructed using fusion features from the same device as positive samples and fusion features from different devices as negative samples. Its core form is consistent with the supervised contrast loss, and the temperature parameter is set to 0.1.

[0208] The iterative optimization objective of the fusion weights is:

[0209]

[0210] In the formula To compensate for the loss in multimodal fusion comparison, the fusion weights of each mode are iteratively optimized through backpropagation to maximize the retention of fault discrimination information in the fusion features, and finally obtain the optimal multimodal fusion strategy adapted to power equipment fault classification. This represents the total number of samples within the training batch. For the sample Positive sample fusion features; For the sample Negative sample fusion features; This refers to temperature hyperparameters.

[0211] Using the spatially aligned trimodal features as input, fusion features are generated through dynamic weights. Then, a fusion contrast loss function is constructed with fault classification as the core objective. This guides the model to learn a weight allocation scheme that maximizes the intra-class compactness and inter-class differences of the fusion features, enabling the fusion features to integrate the fault discrimination advantages of each modality and improve the model's adaptive adaptability to different fault types.

[0212] Furthermore, the modality-adaptive data augmentation includes:

[0213] The image modality employs random cropping, color dithering, and small-angle rotation;

[0214] The infrared mode employs temperature scaling, Gaussian noise, and horizontal flipping.

[0215] The audio modality employs time stretching, reverb addition, and pitch adjustment;

[0216] The pseudo-label generation and iterative optimization include:

[0217] Use the trained model to generate pseudo-labeled samples with a confidence level greater than 0.85;

[0218] The consistency verification module for operational status verifies the physical and logical consistency between the multimodal characteristics of the pseudo-label samples and the operating parameters, thus filtering out contradictory samples.

[0219] The total loss function is constructed by fusing supervised contrastive loss and pseudo-label cross-entropy loss, and pseudo-label accuracy is improved through up to 5 iterations of training.

[0220] The adaptive regularization includes:

[0221] Regularization terms are constructed based on the Frobenius norm of the weight matrix;

[0222] The regularization strength adaptively increases as the number of samples in each class decreases to suppress overfitting on small samples;

[0223] The training configuration is as follows: the AdamW optimizer is used, the batch size is 32, the training epochs are 30, and the parameters are optimized using the overall objective function that integrates supervised contrastive loss, pseudo-label loss and adaptive regularization term.

[0224] Image modality enhancement algorithms: ,in The output visible light image matrix after data augmentation is used to expand the fault classification training dataset; As a function composition operator, the execution order of image transformation is specified as follows: from right to left, random rotation is performed first, then color jitter is performed, and finally random cropping is performed. To perform random rotation transformation, the input image is rotated at a random angle to simulate the scene of drone inspection from different shooting angles, thereby enhancing the model's angle robustness. To achieve color jitter transformation, the image brightness, contrast, saturation, and hue are randomly adjusted to simulate the imaging effect under different lighting and weather conditions, thereby enhancing the model's environmental adaptability. For random cropping transformation, random regions of the image are cropped and then scaled back to the original size to simulate the local features of the device at different shooting distances, thereby enhancing the model's ability to extract fault details. The image is the original visible light input image after normalization preprocessing, and the visible light data is from power equipment collected by drone inspection.

[0225] The random cropping ratio is 0.7-1.0, and the brightness, contrast, and saturation are all ±0.3 during color jitter, with a random rotation angle of ±15°.

[0226] Infrared mode enhancement algorithm: ,in The augmented output infrared image matrix is used to expand the training dataset for power equipment fault classification. As a function composition operator, the execution order of image transformation is specified as follows: from right to left, random flipping is performed first, followed by Gaussian noise injection, and finally temperature scaling. To achieve random flip transformation, the image is randomly flipped horizontally or vertically to simulate the imaging effect of drone inspection from different shooting directions, thereby enhancing the model's adaptability to changes in device attitude. By injecting a transformation into Gaussian noise and adding random Gaussian noise to the infrared image, the native imaging noise of the infrared sensor is simulated, thereby improving the model's anti-interference capability.

[0227] To achieve temperature scaling transformation, the grayscale value distribution range of infrared images is randomly adjusted to simulate the differences in thermal imaging under different ambient temperatures and equipment loads, thereby enhancing the model's ability to detect abnormal equipment temperatures. The image is the original infrared input image after non-uniformity correction and normalization preprocessing, which comes from infrared thermal imaging data of power equipment collected by UAV inspection.

[0228] Temperature scaling ±10%, Gaussian noise standard deviation Randomly flip horizontally;

[0229] Audio modal enhancement algorithm: ,in The augmented output audio time-domain signal is used to expand the training dataset for power equipment fault classification. As a function composition operator, the execution order of audio transformation is specified as follows: from right to left: first pitch shift, then reverb superposition, and finally time stretching; To achieve pitch shift transformation, the audio pitch is randomly adjusted without changing the audio duration to simulate the frequency differences of abnormal noises of different device models and different fault degrees, thereby enhancing the model's ability to extract abnormal audio frequency features. To achieve reverberation superposition transformation, random environmental reverberation effects are added to the audio to simulate the acoustic environment differences in different inspection scenarios of substations and transmission lines, thereby improving the model's environmental adaptability. The time-stretching transformation randomly adjusts the audio playback speed without changing the pitch, simulating the difference in the duration of abnormal noises at different stages of equipment failure development, thus enhancing the model's robustness to the temporal characteristics of abnormal audio. The original equipment operation audio signal is obtained after sampling rate alignment and noise reduction preprocessing, and comes from audio data such as power equipment discharge and mechanical abnormal noise collected during inspection.

[0230] Time stretch ±15%, reverberation time The pitch can be adjusted by ±2 semitones.

[0231] The formula for generating pseudo-tags is:

[0232]

[0233] In the formula, For the first Pseudo-labels for each sample, category ; This is a classification head, a fully connected layer, that outputs the class probability; For the first The fusion features of each sample. This is a normalization function that maps features to a probability distribution. It generates high-quality pseudo-labels for small sample data, expanding the effective training data.

[0234] The pseudo-label filtering rule is as follows: only samples with a model output confidence score > 0.85 are retained. Simultaneously, the physical consistency verification module verifies the physical and logical consistency between the multimodal features of the samples and the operating parameters, filtering out contradictory samples. Typical verification rules are as follows:

[0235] When the counterfeit label indicates that the connector is overheating, it must meet the maximum infrared temperature requirement. And load ;

[0236] When the false tag indicates mechanical loosening, the audio energy must be concentrated in the 1-3kHz range and the image must show blurred vibrations.

[0237] The formula for the total loss function during iterative training is:

[0238]

[0239] In the formula, the balance coefficient , To monitor and compare losses, The loss is pseudo-label cross-entropy; the maximum number of training iterations is 5.

[0240] The regularization term is constructed based on the Frobenius norm of the weight matrix, and the formula is as follows:

[0241]

[0242] In the formula It is the Frobenius norm, used to constrain the complexity of model weights; For regularization terms, For the first Layer weight matrix.

[0243] The formula for calculating the adaptive regularization strength is:

[0244]

[0245] In the formula This represents the total number of fault categories. The number of labeled samples per class, and the regularization strength. It adaptively improves as the number of samples in each class decreases, specifically suppressing model overfitting in small sample scenarios.

[0246] The diversity of training samples is enhanced by modality-specific data enhancement while ensuring the physical rationality of the enhanced samples. Iterative optimization with pseudo-labels with physical consistency verification filters out erroneous labels and expands the effective training set, avoiding error accumulation. An adaptive regularization mechanism that is negatively correlated with the number of samples dynamically constrains the model complexity, specifically addressing the overfitting problem in small sample training and improving the model's generalization ability.

[0247] Furthermore, S3 includes the following steps:

[0248] S31. Lightweight Model Design: Knowledge distillation technique using the inverse KL divergence strategy is employed to compress the number of parameters in a large model;

[0249] By combining the depth and breadth pruning techniques of the ESA algorithm to remove redundant filters in each convolutional layer, the number of model parameters is compressed to less than 5% of the original number of parameters.

[0250] The model is quantized using the GPTQ approximate second-order quantization technique, reducing the computational resource requirements.

[0251] S32, Edge Device Adaptation: Convert the PyTorch format model to ONNX format and deploy it to Jetson Xavier NX or substation robot edge devices via the OpenCVDNN module;

[0252] Hybrid precision quantization technology is used to dynamically adjust the quantization bit width, balancing inference accuracy and speed;

[0253] S33. Real-time inference optimization: A fault feature priority-aware real-time inference framework is adopted. The front-end of inference is optimized through lightweight fault type prediction, dynamic routing of feature processing, and dynamic pruning of modal features. Combined with the computing resource scheduling with a built-in fault feature caching mechanism, the real-time performance of the entire inference process is improved.

[0254] Knowledge distillation employs an inverse KL divergence strategy to transfer knowledge from the teacher model to the student model; structural pruning uses a deep and breadth joint pruning technique based on the ESA algorithm to remove redundant filters in the convolutional layers, ultimately compressing the number of model parameters to less than 5% of the original number of parameters; model quantization uses GPTQ approximate second-order quantization to complete weight quantization, combined with mixed-precision quantization to dynamically adjust the quantization bit width, balancing inference accuracy and speed.

[0255] The model conversion process is as follows: the trained model in PyTorch format is converted into the ONNX universal format, and the deployment and adaptation of edge devices such as Jetson Xavier NX and substation inspection robots are completed through the OpenCVDNN module.

[0256] By employing a progressive lightweight strategy, the model achieves extreme compression while minimizing accuracy loss, significantly reducing hardware computing power requirements. Furthermore, through general format conversion and mixed precision quantization, cross-platform adaptation of the model with mainstream inspection edge hardware is achieved. Finally, through a fault feature priority-aware inference framework, dynamic scheduling of computing power and redundant computation pruning are realized, improving the real-time performance of edge inference while ensuring classification accuracy, thus supporting the engineering implementation of the solution.

[0257] Furthermore, the real-time inference optimization specifically includes the following steps:

[0258] S331: Fault Feature Priority Awareness: Determines the fault type and confidence level through lightweight fault type prediction, and realizes dynamic routing of feature processing and dynamic pruning of modal features based on the prediction results;

[0259] S332: Computational resource scheduling: Real-time perception of edge device hardware status, maximizing inference efficiency through resource scheduling, computational pipeline reorganization, and fault feature caching;

[0260] S333: Physical consistency guarantee: Embed physical constraints of power equipment during the inference optimization process and establish an error feedback closed loop to ensure that the inference results conform to physical laws.

[0261] To address the pain points of limited computing power on edge devices and excessive redundant computations in traditional static inference, a three-pronged real-time inference optimization system is constructed. Through a fault feature priority awareness mechanism, precise allocation of computing power and reduction of redundant computations are achieved; through computing resource scheduling, adaptive matching between inference strategies and hardware operating states is realized, reducing pipeline congestion and redundant computations; and through a physical consistency guarantee mechanism, physical constraints of devices are embedded throughout the entire inference process, establishing an error feedback closed loop to achieve dual guarantees of inference speed and result reliability.

[0262] Furthermore, in S331, the lightweight fault type prediction specifically refers to:

[0263] A lightweight fault type prediction module with less than 50K parameters and based on MobileNetV2 is integrated at the front end of the inference process. This module takes three types of raw modal data—preprocessed visible light images, infrared thermal imaging, and audio—as input, outputs the possible fault types and their confidence levels within 15ms, and consumes less than 5% of computing resources, providing a basis for decision-making in subsequent inference optimization.

[0264] The feature processing dynamic routing specifically involves: allocating computing resources based on the fault type prediction results, and adopting dedicated routing strategies for different fault types.

[0265] When transformer overheating is predicted, 85% of computational resources are prioritized for infrared features, executing a complete infrared encoder and frequency domain attention mechanism. 10% is used to simplify image processing, specifically including downsampling the visible light image to 64×64 pixels and using ResNet-18 to extract only shallow features. 5% skips audio processing, not running the HuBERT encoder, and directly uses all-zero vectors as placeholder features for the audio modality, with zero weights for this modality in subsequent fusion stages.

[0266] When a line break is predicted, 80% of computational resources are prioritized for image features. 15% is used to simplify infrared processing, specifically including downsampling the infrared thermal image to 112×112 pixels and skipping the frequency domain anomaly enhancement step, extracting only temperature field statistical features. 5% is skipped for audio processing, which follows the same method.

[0267] When a motor bearing failure is predicted, 75% of computational resources are preferentially allocated to audio features. 20% is used to simplify infrared processing. 5% is used to simplify image processing, specifically by reducing the image feature extraction network to a shallow stage of ResNet-18 and attenuating its input weights to 10% of the original weights during multimodal fusion.

[0268] The dynamic trimming of modal features specifically involves: intelligently trimming each modal feature map based on fault type and physical characteristics to ensure that key information is not lost, specifically including:

[0269] Infrared features retain temperature hotspot areas for overheating faults, and the feature map size is cropped from 224×224 to 128×128.

[0270] The image features are focused on the fracture area of the line fault, and the feature map size is cropped from 224×224 to 160×160.

[0271] The audio features are focused on a specific frequency range for bearing failures, and the feature map length is cropped from 128 to 64.

[0272] The ultra-lightweight fault prediction module completes coarse fault classification with minimal computing power and time delay, providing decision guidance for subsequent inference. Based on the prediction results, a fault-specific routing strategy is designed to achieve precise on-demand allocation of computing power. The feature map is intelligently cropped by combining fault physical characteristics, reducing the amount of computation without losing key fault information, significantly reducing redundant inference calculations, and accelerating inference at the edge.

[0273] Furthermore, in S332, the computing resource scheduling includes a resource-aware scheduler, which monitors the CPU load and network latency of edge devices in real time and dynamically adjusts the inference strategy.

[0274] When CPU load is <60%, enable the full inference process, including:

[0275] 1. Full-modal parallel coding: Simultaneously perform full modal-specific encoder calculations on three types of preprocessed raw data: visible light images, infrared thermal imaging, and audio. Specifically, the image modality fully runs ResNet-50 up to stage 5, the infrared modality fully runs the Freq-ViT encoder, and the audio modality fully runs the HuBERT encoder.

[0276] 2. Bidirectional frequency domain cross-attention full fusion: The three-modal coding features are uniformly projected to 768 dimensions, and after performing a two-dimensional Fourier transform, a complete six-directional cross-attention calculation is performed to generate cross-modal interaction features of each modality;

[0277] 3. Full-category prototype matching: Calculate the cosine similarity between the fused features of the query set and the fault prototype vectors of all $C$ classes constructed from the support set to obtain the matching score for each class;

[0278] 4. Temperature scaling and adaptive threshold decision: Perform temperature scaling calibration on the scores of all categories, and output the final fault classification result after adaptive threshold determination.

[0279] When the CPU load is ≥60%, it automatically switches to lightweight inference mode, handling only the two fault types with the highest confidence. Lightweight inference mode specifically includes:

[0280] 1. On-demand single-modal or dual-modal encoding: Based on the highest confidence fault type output by the lightweight fault type prediction module, only the priority mode encoder strongly associated with that fault type is activated. Specifically: if the prediction is transformer overheating, only the infrared mode Freq-ViT encoder is fully run; if the prediction is line breakage, only the image mode ResNet-50 encoder is fully run; if the prediction is motor bearing fault, only the audio mode HuBERT encoder is fully run; other non-priority modes are simplified or skipped, generating placeholder feature vectors.

[0281] 2. Unidirectional frequency domain attention or skip fusion: The cross-attention calculation is simplified from six-way bidirectional interaction to a unidirectional attention transfer only from the preferred modality to other modalities, or the frequency domain cross-attention module is skipped entirely, and features are directly fused based on the features of the preferred modality;

[0282] 3. Targeted Prototype Matching: Only the cosine similarity between the query set features and the prototype vectors of the two fault categories with the highest predicted confidence and their similar categories is calculated, reducing the number of matching categories from all C categories to at most 4 categories;

[0283] 4. Quick Threshold Determination: Skip the temperature scaling calibration step and directly compare the original matching score with a fixed threshold (e.g., 0.85). If the score is greater than the threshold, output the corresponding category; otherwise, classify it as "Other Faults" or "Requires Manual Review".

[0284] When the resource-aware scheduler detects that the CPU load exceeds the 60% threshold, the inference main control module sends a mode switching command to each computing unit:

[0285] Image encoder: If the current task is not the priority mode, ResNet-50 is bypassed and replaced with the first 3 stages of ResNet-18, and the input resolution is downsampled from 224×224 to 112×112;

[0286] Infrared encoder: If the current task is not the priority mode, the number of frequency domain attention iterations of Freq-ViT is reduced from 12 to 4.

[0287] Audio encoder: If the current task is not the priority mode, the HuberT encoder is skipped and directly outputs a 768-dimensional all-zero placeholder vector;

[0288] Attention fusion module: The cross-attention matrix dimension is reduced from 3×3 to 1×1 (only the preferred modality self-attention is retained) or 2×2 (bimodal interaction).

[0289] Prototype matching module: The number of fault category prototype vectors involved in the calculation is reduced from a total of C to a maximum of 4;

[0290] Decision module: Skip temperature scaling and switch to fixed threshold binary classification.

[0291] When network latency is greater than 50ms, the feature resolution is automatically reduced to 160×160.

[0292] The computing resource scheduling includes lossless computing pipeline reorganization, specifically: reconstructing the inference computing pipeline, executing fault type prediction and feature extraction in parallel, optimizing the execution order of high-frequency computing operations, reducing pipeline blockage through computing dependency analysis, and achieving lossless acceleration of inference;

[0293] The computing resource scheduling includes a fault feature caching mechanism, specifically: establishing a fault feature cache pool to store recently detected initial fusion features. When identical or similar faults are detected, and the cosine similarity of the fault similarity is >0.95, features are directly retrieved from the cache, skipping some calculations, resulting in a cache hit rate ≥40%. The complete workflow of the fault feature caching mechanism is as follows:

[0294] 1. Cache Construction: When a complete fault diagnosis inference process is successfully completed, and the confidence level of the final classification result is... At that time, the 768-dimensional fusion features generated during this reasoning process will be used. The fault type, along with its corresponding fault type label and key operating parameters (such as load current and ambient temperature), are encapsulated into a single record and stored in the fault feature cache pool. The cache pool maintains at least 50 of the most recent inference records using a first-in, first-out (FIFO) strategy.

[0295] 2. Cache matching: For a new input test sample First, a lightweight prediction of the complete fault type is performed to obtain a coarse-grained fault category direction. Then, preliminary feature extraction is performed using only this lightweight module to generate a low-cost query feature vector. Next, Fusion features from all records in the same coarse-grained fault direction in the cache pool Perform cosine similarity calculation.

[0296] 3. Cache Hit and Reuse: When there exists a cache record K such that ( , When the value is greater than 0.95, it is considered a cache hit. At this point, the entire computationally intensive process, including encoder calculation and frequency domain cross-attention fusion, is skipped, and the fused features recorded in the cache are used directly. As the fused feature output for this inference, it enters the subsequent classification and decision calibration stage. At the same time, the timestamp of this cached record is updated.

[0297] 4. Cache update: When the target device's operating condition changes significantly during continuous operation, or multiple consecutive samples miss the cache, the system automatically triggers a complete inference process to calibrate the model and stores or updates the cache pool with new high-confidence fusion features.

[0298] By using a resource-aware scheduler, the inference strategy is dynamically switched based on edge hardware load and network status to ensure a stable and smooth inference process. By reorganizing the computation pipeline, core operations are executed in parallel, reducing pipeline blockage and achieving lossless acceleration of inference. By using a fault feature caching mechanism to reuse high-frequency fault features, the overhead of repeated computation is reduced, inference efficiency is improved, and the adaptability of the model on different edge hardware is ensured.

[0299] Furthermore, in S333, the physical constraint reasoning specifically refers to:

[0300] Embed power equipment-specific physical constraints in each stage of the reasoning optimization process:

[0301] During infrared feature processing, thermodynamic formulas are applied to constrain the temperature change rate to <5℃ / min;

[0302] During image feature processing, device structure constraints are applied to ensure that the fracture area is within the device outline;

[0303] When processing audio features, acoustic propagation model constraints are applied, limiting the frequency range to 500-5000Hz;

[0304] The error feedback closed loop specifically involves: establishing an error feedback mechanism between the diagnostic results and the actual state of the equipment; automatically adjusting the feature priority weights when the diagnostic results do not match the actual state of the equipment; and continuously optimizing the feature priority strategy through learning to make the reasoning process conform to physical reality.

[0305] When the diagnosis indicates that the transformer is overheating but the actual temperature of the equipment is normal, the weight of the image features is automatically increased.

[0306] By embedding specific physical constraints of power equipment's thermodynamics, structure, and acoustics into the entire reasoning process, abnormal features and diagnostic results that violate operating rules are filtered out, eliminating misjudgments due to physical contradictions from the front end. An error feedback closed loop between diagnostic results and the actual state of the equipment is established, and feature priority weights are adjusted in reverse based on diagnostic errors to continuously optimize the reasoning strategy and ensure the physical rationality and field applicability of the diagnostic results.

[0307] Furthermore, S4 includes the following steps:

[0308] S41. Input the preprocessed three-modal features into the model, and use the bidirectional frequency domain attention Bi-FCFM mechanism to dynamically allocate the weights of each modal feature according to the differences in equipment fault types to complete feature fusion.

[0309] The feature fusion in S41 specifically involves: using a bidirectional frequency domain cross-attention fusion mechanism to fusion the preprocessed image, infrared, and audio modal features, adaptively allocating the weights of each modal feature according to the fault type corresponding to the data to be detected, and generating fused features for subsequent fault classification.

[0310] During the inference phase, the bidirectional frequency domain attention module is reused. For the preprocessed three-modal features of the input, they are first converted to frequency domain features via a two-dimensional Fourier transform, as shown in the formula:

[0311]

[0312] In the formula It is a two-dimensional Fourier transform operator; then, intermodal information interaction is achieved through a bidirectional cross-attention mechanism, the formula is:

[0313]

[0314]

[0315]

[0316] Based on the fault type corresponding to the data to be detected, the adaptive weights of each mode are calculated using the homoscedastic uncertainty loss, as shown in the formula:

[0317]

[0318]

[0319] The final fault-adaptive fusion feature is generated using the following formula:

[0320]

[0321] The fused features are directly used for subsequent few-sample inference and fault classification.

[0322] During the inference phase, the bidirectional frequency domain attention module is reused to dynamically allocate modal fusion weights based on the fault type of the data to be detected. This enables the model to focus on the modal information most relevant to the current fault type during inference, suppresses interference from irrelevant modalities, improves the ability of fused features to discriminate the current fault, and ensures the accuracy of fault identification for different types during the inference phase.

[0323] S42. Based on few-sample inference logic, a fault category prototype library is constructed using the support set. The cosine similarity between the query set features and the prototype library is calculated to achieve preliminary fault matching. Furthermore, the homoscedastic uncertainty loss is used to automatically balance the weights of each mode, completing the preliminary fault matching. The specific process of constructing the fault category prototype library is as follows: For each fault category, using the support set samples labeled during the training phase, fusion features are extracted through the trained three-tower encoder and bidirectional frequency domain cross-attention fusion mechanism. The prototype vector for that category is calculated using the following formula:

[0324]

[0325] in, For the first The support set of samples for each type of fault. To support the centralization of this type of sample size, For the sample The 768-dimensional fusion feature vector extracted by the model.

[0326] The matching process between query set features and prototype library is as follows: For samples to be classified Extract its fusion features Calculate the cosine similarity with the prototype vectors of each category to obtain the preliminary matching score, using the following formula:

[0327]

[0328] The preliminary classification result is the fault category with the highest similarity, and the formula is:

[0329]

[0330] The specific implementation of the automatic balancing of modality weights using homoscedastic uncertainty loss is as follows: During prototype matching, matching scores with the corresponding modality prototype library are calculated for different modalities. The matching confidence weights for each modality are calculated using homoscedastic uncertainty loss, and the multimodal matching scores are then weighted and fused. The modality matching score calculation formula is as follows:

[0331]

[0332] in, For query sample number Modal coding features For the first Class 1 The prototype vector of the mode. The simplified form of the homoscedastic uncertainty loss is:

[0333]

[0334] in, For the first The prediction variance of the modality For the first The modality matching loss is used to take the category corresponding to the highest score as the initial fault matching result.

[0335] S43. Based on historical fault data, and considering equipment type and fault characteristics, adaptively adjust the classification thresholds for each category. Simultaneously, introduce temperature scaling technology to improve prediction reliability. Use a set probability threshold as the fault determination standard to complete dynamic decision calibration, achieving efficient equipment fault classification and reliable determination. The specific implementation of adaptively adjusting the classification thresholds for each category based on historical fault data and equipment type and fault characteristics is as follows: using a confidence score... As the core indicator, different types of faults are identified. A differentiated judgment strategy is adopted. The confidence score is defined as:

[0336]

[0337] in, C represents the highest matching score, and C represents the total number of fault categories.

[0338] For progressive faults such as transformer overheating and loose connections, an equalization threshold strategy is adopted, and the threshold is determined as follows:

[0339]

[0340] in, This represents the historical occurrence count of type c faults; The larger the value, the richer the empirical data for this type of fault. The threshold should be appropriately lowered to improve detection sensitivity.

[0341] For sudden faults such as insulation breakdown and conductor breakage, a high-confidence threshold strategy is adopted, with the judgment threshold fixed at [value missing]. =0.92, to minimize the risk of false alarms.

[0342] The temperature scaling technique improves prediction reliability by learning the optimal temperature parameter T on the calibration set and scaling the classifier's output logits to make the predicted probability distribution more accurately reflect the true confidence level. The formula is as follows:

[0343]

[0344] in, This is the raw logits output for class c. These are the calibrated probability values. The temperature parameter T is optimized by minimizing the negative log-likelihood loss on the calibration set:

[0345]

[0346] in, Number of calibration set samples For the sample The true category label, and the typical optimal temperature parameter value range are: [1.2,2.8].

[0347] The fault determination process is as follows: calculate the calibration probability of the sample to be classified after temperature scaling. , with adaptive threshold Compare:

[0348] when When the fault category is specified, the corresponding fault category is directly output as the classification result.

[0349] when 0.55 and When the condition is met, multimodal cross-validation is initiated, and the independent inference results of each modality are calculated separately. If the classification results of at least two of the three modalities are consistent, the consistent result is output; otherwise, it is judged as "suspected fault, manual review is recommended".

[0350] when When the value is less than 0.55, it is determined that "the fault characteristics are not significant" and the normal status is output.

[0351] By leveraging the synergistic effect of adaptive threshold adjustment, temperature scaling calibration, and multi-level decision-making strategies, the impact of model prediction uncertainty on classification results in small sample scenarios is effectively reduced, achieving highly reliable output of fault diagnosis results.

[0352] During the inference phase, a bidirectional frequency domain cross-attention mechanism is used to achieve feature fusion and enhance the feature representation of the fault to be detected. Then, based on the prototype learning logic, accurate matching of small-sample faults is completed, and the contribution of each mode is balanced by the loss of homoscedastic uncertainty. Finally, through adaptive classification threshold adjustment and temperature scaling technology, the problem of misjudgment caused by the difference in the distribution of different fault samples and the inaccuracy of prediction confidence is solved, and highly reliable power equipment fault classification results are output.

[0353] Example 2:

[0354] The present invention also provides a few-sample power equipment image classification device based on multimodal contrastive learning, comprising:

[0355] The multimodal data acquisition and preprocessing module is used to build an acquisition system through a central server, edge acquisition nodes and multimodal sensors. Relying on the time synchronization unit composed of NTP server and PTP hardware clock module, it adopts a hybrid time synchronization protocol of NTP and PTP to achieve microsecond-level time synchronization of the acquisition system. It completes the synchronous acquisition of three types of data: visible light image, infrared thermal imaging and audio of power equipment. It performs modality-specific preprocessing on each type of data and completes multimodal feature fusion through the bidirectional frequency domain cross-attention fusion mechanism Bi-FCFM.

[0356] The multimodal contrastive learning model training module is connected to the multimodal data acquisition and preprocessing module. It is used to sequentially perform single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning to complete the basic training of the classification model. It also combines modality adaptive data augmentation, pseudo-label generation and iterative optimization, and adaptive regularization mechanism to complete the training optimization of the model in extreme small sample scenarios.

[0357] The model lightweighting and edge deployment module is connected to the multimodal contrastive learning model training module. It is used to complete the model lightweighting and compression through knowledge distillation, deep and breadth joint pruning combined with ESA algorithm, and model quantization technology. The compressed model is then converted to a new format and deployed to the power inspection edge device. At the same time, a real-time inference optimization strategy is adopted to ensure the inference efficiency of the model at the edge.

[0358] The classification prediction and decision calibration module is connected to the multimodal data acquisition and preprocessing module and the model lightweighting and edge deployment module, respectively. It is used to input the preprocessed multimodal features into the trained model, obtain the preliminary classification result of power equipment faults through feature fusion and few-sample inference, and output the final power equipment fault classification result after decision calibration.

[0359] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the technical principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A few-sample power equipment image classification method based on multimodal contrastive learning, characterized in that, Includes the following steps: S1. Multimodal data synchronous acquisition and preprocessing: The acquisition system, which is constructed by a central server, edge acquisition nodes and multimodal sensors, and a time synchronization unit composed of an NTP server and a PTP hardware clock module, achieves microsecond-level time synchronization of the acquisition system through the NTP server and PTP hybrid time synchronization protocol. Image, infrared and audio data are acquired, modal preprocessing is performed separately, and multimodal feature fusion is completed through a bidirectional frequency domain cross-attention fusion mechanism. S2. Multimodal contrastive learning model training: Sequentially perform single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning, and combine modality adaptive data augmentation, pseudo-label generation and iterative optimization, and adaptive regularization to complete small sample training optimization; S3, Model Lightweighting and Edge Deployment: Model compression is achieved through knowledge distillation, combined with deep and broad pruning and model quantization using the ESA algorithm. The model is then converted to a new format and deployed to edge devices. Real-time inference optimization is used to ensure inference efficiency. S4. Classification prediction and decision calibration: The preprocessed multimodal features are input into the trained model. Preliminary classification results are obtained through feature fusion and few-sample inference. After decision calibration, the final power equipment fault classification results are output.

2. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1, characterized in that, S1 includes the following steps: S11. Multimodal data synchronous acquisition: A data acquisition system consisting of a central server, edge acquisition nodes, and multimodal sensors is built. The time synchronization unit is equipped with an NTP server and a PTP hardware clock module. The NTP server and PTP hybrid time synchronization protocol is used to achieve microsecond-level time synchronization of the entire system. The time alignment of image, audio, and infrared data with different acquisition frequencies is completed through an adaptive time alignment algorithm. Data acquisition, timestamp processing, and structured storage are completed according to a standardized process. S12. Data Preprocessing: Perform multi-dimensional standardized preprocessing on visible light images, infrared thermal imaging data, and audio signals to complete size normalization, noise reduction, and enhancement processing of the infrared image data; perform frame filtering, feature extraction, and dimensionality reduction on the audio signals to output preprocessed data that meets the model input requirements. S13. Multimodal Feature Fusion: A three-tower encoder is used to extract the original features of the visible light image, infrared image, and audio three modalities respectively, and the dimension is unified by a linear projection layer. The initial fusion of frequency domain features of each modality is completed based on frequency domain feature extraction and bidirectional frequency domain cross-attention fusion mechanism. The basic fusion weight of each modality is calculated by combining the homoscedasticity uncertainty loss of each modality feature. After optimization by a lightweight MLP, the final fusion weight is adjusted by the working condition adaptive module, and the optimized multimodal fusion feature is output.

3. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1, characterized in that, S2 includes the following steps: S21. Single-modal supervised contrastive learning augmentation: Modality-adaptive data augmentation is performed on image, infrared, and audio data respectively. Then, the corresponding single-modal features are extracted from the augmented data using the ResNet-50 encoder for the image modality, the Freq-ViT encoder for the infrared modality, and the HuBERT encoder for the audio modality. Each encoder is independently trained and iteratively updated based on the improved supervised contrastive loss function to improve the discriminativeness of single-modal features. S22, Cross-modal self-supervised alignment: The output features of each modal encoder are mapped to a unified shared feature space through a cross-modal alignment weight matrix. A cross-modal contrastive loss function is constructed using multimodal data from the same device as positive samples and multimodal data from different devices as negative samples. The cross-modal alignment weight matrix is optimized to achieve alignment of the multimodal feature space. S23. Multimodal fusion contrast learning: Dynamically weighted fusion of spatially aligned multimodal features, calculation of cosine similarity between fused features, construction of multimodal fusion contrast loss function based on fused features, iterative optimization of fusion weights of each modality, and obtaining the optimal multimodal fusion feature; S24. Small Sample Training Optimization: Through modal adaptive data augmentation, high-confidence pseudo-label generation and iterative optimization with physical consistency verification, and adaptive regularization, training configuration is performed to expand effective training data and improve the training effect and generalization ability of the model under small sample conditions.

4. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1, characterized in that, S3 includes the following steps: S31. Lightweight Model Design: Knowledge distillation technique using the inverse KL divergence strategy is employed to compress the number of parameters in a large model; By combining the depth and breadth pruning techniques of the ESA algorithm to remove redundant filters in each convolutional layer, the number of model parameters is compressed to less than 5% of the original number of parameters. The model is quantized using the GPTQ approximate second-order quantization technique, reducing the computational resource requirements. S32, Edge Device Adaptation: Convert the PyTorch format model to ONNX format and deploy it to Jetson Xavier NX or substation robot edge devices via the OpenCVDNN module; Hybrid precision quantization technology is used to dynamically adjust the quantization bit width, balancing inference accuracy and speed; S33. Real-time inference optimization: The inference front-end is optimized by lightweight prediction of fault types, dynamic routing of feature processing, and dynamic pruning of modal features. Combined with the computing resource scheduling with a built-in fault feature caching mechanism, the real-time performance of the entire inference process is improved.

5. The small-sample power equipment image classification method based on multimodal contrastive learning according to claim 1, characterized in that, S4 includes the following steps: S41. Input the preprocessed image, infrared and audio modal features into the model, and dynamically allocate the weights of each modal feature according to the differences in equipment fault types through a two-way frequency domain cross-attention mechanism to complete feature fusion. S42. Based on small sample reasoning logic, a fault category prototype library is constructed using the support set. The cosine similarity between the query set features and the prototype library is calculated to achieve preliminary fault matching. The weights of each mode are automatically balanced by combining homoscedastic uncertainty loss to complete the preliminary fault matching. S43. Based on historical fault data, the classification thresholds for each category are adaptively adjusted according to the equipment type and fault characteristics. At the same time, temperature scaling technology is introduced to improve prediction reliability. The set probability threshold is used as the fault judgment standard to complete dynamic calibration of decision-making and realize efficient classification and reliable judgment of equipment faults.

6. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 2, characterized in that, The microsecond-level time synchronization is achieved by using a hybrid NTP and PTP time synchronization protocol to generate a precise microsecond-level timestamp. The timestamp calculation formula is as follows: ,in To synchronize the system time with PTP, This is the network latency compensation value. The inherent delay of the sensor is pre-calibrated; the central server broadcasts the PTP timestamp to all edge nodes. After receiving the PTP timestamp, the edge nodes calculate and compensate for the network delay. Combined with the pre-calibrated inherent delay of the sensor, a microsecond-level accurate timestamp is generated through a dedicated timestamp calculation formula to achieve a time synchronization accuracy of ±10μs for the entire system. The specific implementation of the adaptive time alignment algorithm is as follows: based on the image acquisition frame rate, the alignment point of each modal data is calculated, the audio data is generated by interpolation to match the frame rate feature vector, and the infrared data is generated by inter-frame difference to complete the precise alignment of each modal data on the time axis. After completing the time alignment of image, audio, and infrared data at different acquisition frequencies, the data is encapsulated into structured data packets containing a unique data ID, modality type, microsecond-level acquisition timestamp, device ID, and acquisition location coordinates, following the process of system initialization, data acquisition, timestamp processing, time alignment, and storage, ensuring that the acquisition latency is ≤200ms.

7. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1 or 2, characterized in that, The image and infrared data preprocessing involves performing standardized preprocessing on the acquired visible light images and infrared thermal imaging data. This includes four operations: size normalization and pixel normalization, infrared image denoising and temperature correction, frequency domain anomaly enhancement, and multi-scale feature encoding preparation. Specifically, all images are uniformly scaled to 224×224 pixels and channel-level normalization is performed. Adaptive denoising of infrared images affected by environmental interference is performed using dual-tree complex wavelet transform. Anomaly spectral components are enhanced in the Fourier domain through a frequency domain mask enhancement strategy. Finally, the preprocessed image data is output. The audio data preprocessing employs pre-emphasis processing. The system is framed with a frame length of 32ms and a frame shift of 16ms, and Hamming window filtering is applied. Mel spectrum or GFCC features are extracted, and key feature dimensions are filtered by SVM-RFE, retaining the first 20 dimensions to reduce computational complexity.

8. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1 or 2, characterized in that, The specific implementation process of the bidirectional frequency domain cross-attention fusion mechanism is as follows: Multimodal feature extraction and dimensionality unification: Three encoders are used to extract the native features of visible light images, infrared images, and audio respectively. The native features of all modal visible light images, infrared images, and audio are uniformly mapped to 768 dimensions through a linear projection layer. Frequency domain feature extraction and bidirectional cross-attention fusion: Two-dimensional Fourier transform is performed on the dimensional unified projection features of each modality to convert them into frequency domain features; through the bidirectional frequency domain cross-attention fusion mechanism, mutual attention, dynamic interaction and preliminary fusion of frequency domain features of each modality are realized; Dynamic weight adjustment based on homoscedastic uncertainty loss: calculate the variance and mean of each modal feature to obtain the uncertainty parameter, determine the basic weights for each modal fusion, and perform weighted integration of the preliminary fusion features to obtain the basic fusion features; Feature optimization after fusion: The basic fusion features are input into a lightweight multilayer perceptron (MLP). The basic fusion features are processed and optimized through linear transformation of the MLP and nonlinear expression enhancement of the activation function, and finally the optimized fusion features of 768 dimensions are output. Secondary adjustment of operating condition adaptive modal weights: Real-time acquisition of operating condition parameters such as load current, ambient temperature, and historical temperature rise rate of the equipment; construction of an operating condition-modal correlation mapping function based on the operating condition parameters to obtain the operating condition modulation factor for each mode; The operating condition modulation factor is multiplied by the fusion basis weights calculated from the homoscedasticity uncertainty loss to obtain the final fusion weights for each mode, and the optimized fusion features are then subjected to a secondary weight adjustment.

9. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 3, characterized in that, The improved supervised contrastive loss function sets the batch size to 256 and the temperature parameter to 0.

1. It only uses samples with the same label as positive samples and constructs a negative sample set using all non-same-label samples within the same batch. The loss value is calculated based on cosine similarity. The cross-modal contrastive loss function uses multimodal data from the same device as positive samples and multimodal data from different devices as negative samples. The optimization objective is to maximize the cross-modal feature similarity of the same device and minimize the cross-modal feature similarity of different devices. The training epochs are 30 and the batch size is 128.

10. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1 or 3, characterized in that, The aforementioned multimodal fusion contrastive learning specifically includes the following steps: Using spatially aligned image, infrared, and audio modal features as input, dynamic fusion weights are assigned to each modality to generate fusion features; Calculate the cosine similarity between fused features, and construct a multimodal fusion contrast loss function with fused features from the same device as positive samples and fused features from different devices as negative samples; The fusion weights are iteratively optimized with the goal of minimizing the fusion contrast loss to obtain the optimal multimodal fusion features.

11. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1 or 3, characterized in that, The modality-adaptive data augmentation includes: The image modality employs random cropping, color dithering, and small-angle rotation; The infrared mode employs temperature scaling, Gaussian noise, and horizontal flipping. The audio modality employs time stretching, reverb addition, and pitch adjustment; The pseudo-label generation and iterative optimization include: Use the trained model to generate pseudo-labeled samples with a confidence level greater than 0.85; The consistency verification module for operational status verifies the physical and logical consistency between the multimodal characteristics of the pseudo-label samples and the operating parameters, thus filtering out contradictory samples. The total loss function is constructed by fusing supervised contrastive loss and pseudo-label cross-entropy loss, and pseudo-label accuracy is improved through up to 5 iterations of training. The adaptive regularization includes: Regularization terms are constructed based on the Frobenius norm of the weight matrix; The regularization strength adaptively increases as the number of samples in each class decreases to suppress overfitting on small samples.

12. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 1 or 4, characterized in that, The real-time inference optimization specifically includes the following steps: S331: Fault Feature Priority Awareness: Determines the fault type and confidence level through lightweight fault type prediction, and realizes dynamic routing of feature processing and dynamic pruning of modal features based on the prediction results; S332: Computational resource scheduling: Real-time perception of edge device hardware status, maximizing inference efficiency through resource scheduling, computational pipeline reorganization, and fault feature caching; S333: Physical consistency guarantee: Embed physical constraints of power equipment during the inference optimization process and establish an error feedback closed loop to ensure that the inference results conform to physical laws.

13. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 5, characterized in that, The feature fusion in S41 specifically involves: using a bidirectional frequency domain cross-attention fusion mechanism to fusion the preprocessed image, infrared, and audio modal features, adaptively allocating the weights of each modal feature according to the fault type corresponding to the data to be detected, and generating fused features for subsequent fault classification.

14. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 12, characterized in that, In S331, the lightweight fault type prediction specifically refers to: A lightweight fault type prediction module with less than 50K parameters and based on MobileNetV2 is integrated at the front end of the inference process. This module takes the input data as input, outputs the possible fault types and their confidence levels within 15ms, and consumes less than 5% of the computing resources, providing a basis for decision-making for subsequent inference optimization. The feature processing dynamic routing specifically involves: allocating computing resources based on the fault type prediction results, and adopting dedicated routing strategies for different fault types. When the transformer is predicted to be overheating, 85% of the computing resources are prioritized for infrared features, 10% are used to simplify image processing, and 5% are skipped for audio processing. When a line break is predicted, 80% of computing resources are prioritized for image features, 15% are used to simplify infrared processing, and 5% are skipped for audio processing. When the fault is predicted to be a motor bearing failure, 75% of the computing resources are prioritized for audio features, 20% for simplifying infrared processing, and 5% for simplifying image processing. The dynamic trimming of modal features specifically involves: intelligently trimming each modal feature map based on fault type and physical characteristics to ensure that key information is not lost, specifically including: Infrared features retain temperature hotspot areas for overheating faults, and the feature map size is cropped from 224×224 to 128×128. The image features are focused on the fracture area of the line fault, and the feature map size is cropped from 224×224 to 160×160. The audio features are focused on a specific frequency range for bearing failures, and the feature map length is cropped from 128 to 64.

15. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 12, characterized in that, The computing resource scheduling in S332 includes a resource-aware scheduler that monitors the CPU load and network latency of edge devices in real time and dynamically adjusts the inference strategy. When CPU load is less than 60%, enable the full inference process; When the CPU load is ≥60%, it automatically switches to lightweight inference mode, only handling the two fault types with the highest confidence. When network latency is greater than 50ms, the feature resolution is automatically reduced to 160×160. The computing resource scheduling includes lossless computing pipeline reorganization, specifically: reconstructing the inference computing pipeline, executing fault type prediction and feature extraction in parallel, optimizing the execution order of high-frequency computing operations, reducing pipeline blockage through computing dependency analysis, and achieving lossless acceleration of inference; The computing resource scheduling includes a fault feature caching mechanism, specifically: establishing a fault feature cache pool to store recently detected key fault features; when the same or similar faults are detected, and the cosine similarity of the fault similarity is >0.95, the features are directly obtained from the cache, skipping part of the calculation, and the cache hit rate is ≥40%.

16. The few-sample power equipment image classification method based on multimodal contrastive learning according to claim 12, characterized in that, The specific meaning of embedding power equipment-specific physical constraints in each stage of reasoning optimization in S333 is as follows: During infrared feature processing, thermodynamic formulas are applied to constrain the temperature change rate to <5℃ / min; During image feature processing, device structure constraints are applied to ensure that the fracture area is within the device outline; When processing audio features, acoustic propagation model constraints are applied, limiting the frequency range to 500-5000Hz; The error feedback closed loop specifically involves: establishing an error feedback mechanism between the diagnostic results and the actual state of the equipment; automatically adjusting the feature priority weights when the diagnostic results do not match the actual state of the equipment; and continuously optimizing the feature priority strategy through learning to make the reasoning process conform to physical reality. When the diagnosis indicates that the transformer is overheating but the actual temperature of the equipment is normal, the weight of the image features is automatically increased.

17. A small-sample power equipment image classification device based on multimodal contrastive learning, characterized in that, include: The multimodal data acquisition and preprocessing module is used to build an acquisition system through a central server, edge acquisition nodes and multimodal sensors. Relying on the time synchronization unit composed of NTP server and PTP hardware clock module, it adopts a hybrid time synchronization protocol of NTP and PTP to achieve microsecond-level time synchronization of the acquisition system. It completes the synchronous acquisition of three types of data: visible light image, infrared thermal imaging and audio of power equipment. It performs modality-specific preprocessing on each type of data and completes multimodal feature fusion through a bidirectional frequency domain cross-attention fusion mechanism. The multimodal contrastive learning model training module is connected to the multimodal data acquisition and preprocessing module. It is used to sequentially perform single-modal supervised contrastive learning, cross-modal self-supervised alignment, and multimodal fusion contrastive learning to complete the basic training of the classification model. It also combines modality adaptive data augmentation, pseudo-label generation and iterative optimization, and adaptive regularization mechanism to complete the training optimization of the model in extreme small sample scenarios. The model lightweighting and edge deployment module is connected to the multimodal contrastive learning model training module. It is used to complete the model lightweighting and compression through knowledge distillation, deep and breadth joint pruning combined with ESA algorithm, and model quantization technology. The compressed model is then converted to a new format and deployed to the power inspection edge device. At the same time, a real-time inference optimization strategy is adopted to ensure the inference efficiency of the model at the edge. The classification prediction and decision calibration module is connected to the multimodal data acquisition and preprocessing module and the model lightweighting and edge deployment module, respectively. It is used to input the preprocessed multimodal features into the trained model, obtain the preliminary classification result of power equipment faults through feature fusion and few-sample inference, and output the final power equipment fault classification result after decision calibration.