A method and system for intelligent monitoring and early warning of power plant equipment defects based on a multi-modal large model

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By fusing image, sound, sensor, and text data through a multimodal Transformer model, intelligent monitoring and early warning of thermal power plant equipment can be achieved. This solves the limitations and poor adaptability of single-modal monitoring, improves monitoring accuracy and early warning capabilities, and adapts to different equipment and environments.

CN120449035BActive Publication Date: 2026-06-30ANHUI ELECTRIC POWER DESIGN INST CEEC

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ANHUI ELECTRIC POWER DESIGN INST CEEC
Filing Date: 2025-04-25
Publication Date: 2026-06-30

Application Information

Patent Timeline

25 Apr 2025

Application

30 Jun 2026

Publication

CN120449035B

IPC: G06F18/2433; G06F18/2415; G06F18/25; G06F18/15; G06N3/0455; G06N3/0895; G06N3/096; G06T7/00; G06F11/07; G06Q10/20; G06Q50/06

AI Tagging

Technology Topics

Scale modelTransformer

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A comprehensive soil saturation model and a test method for critical saturation intervals
CN122307069AScale modelSoil science
Method for intelligent operation of industry based on knowledge evolution of large model and related equipment
CN122240760ASemantic analysis Inference methodsScale modelLinguistic model
Method for modifying water body damping based on hydrodynamic model of dry tree cylindrical floating production storage and offloading unit
CN122263713ADesign optimisation/simulation CAD numerical modellingScale modelMarine engineering
Resource recommendation method and device, electronic equipment and storage medium
CN122285950Aeasy to understandScale modelEngineering
A visualization correlation method and system for multi-scale models of material simulation
CN122314200Aimprove understandingFast spatial queryComputational scienceScale model

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN120449035B_ABST

Patent Text Reader

Abstract

This invention relates to an intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large-scale model, comprising: real-time collection of equipment operating data, i.e., multimodal data; construction of a multimodal Transformer model; cross-modal data fusion and analysis; calculation of equipment anomaly degree; prediction of future equipment status and the probability of potential failures; and optimization of the multimodal Transformer model. This invention also discloses an intelligent monitoring and early warning system for equipment defects in thermal power plants based on a multimodal large-scale model. This invention employs a multimodal Transformer model, combining image, sound, sensor, and text data for cross-modal feature fusion, resulting in higher accuracy and lower false alarm and false negative rates. Trend analysis of equipment status can predict the time of failure occurrence, significantly improving prediction lead time. By combining anomaly detection and prediction, the probability of sudden failures is reduced. It also exhibits strong adaptability across different thermal power plants and different equipment.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent monitoring and early warning technology for power equipment, and in particular to an intelligent monitoring and early warning method and system for equipment defects in thermal power plants based on a multimodal large model. Background Technology

[0002] Equipment monitoring in thermal power plants involves real-time collection and analysis of operational data from key equipment such as instrument transformers, energy meters, and sensors to assess equipment status, detect anomalies, and predict potential failures. Its goal is to improve equipment reliability, reduce downtime, and optimize operational efficiency. Currently, equipment monitoring methods in thermal power plants mainly suffer from the following problems:

[0003] First, traditional monitoring methods for thermal power plant equipment mainly rely on single data modalities, such as sensor data for temperature, vibration, and current, and lack the ability to fuse multimodal information such as images, sound, and text, resulting in incomplete monitoring information and affecting the accuracy of fault identification.

[0004] Second, traditional shallow machine learning or rule-based expert systems are difficult to effectively cope with the complex operating environment of thermal power plants, especially the problems of hidden faults, nonlinear fault modes and equipment aging.

[0005] Third, existing monitoring methods often can only detect problems after a failure has occurred, making it difficult to provide sufficient early warning.

[0006] Fourth, there are significant differences in equipment types, environments, and working conditions among different thermal power plants. When traditional fault monitoring methods are migrated to different equipment or new environments, they require a large amount of manual adjustment and labeling, as well as a large number of professional engineers to conduct data analysis and equipment inspection. This is time-consuming and labor-intensive, and is subject to strong subjective factors, making it difficult to generalize and perform adaptive optimization. Summary of the Invention

[0007] To address the limitations, poor adaptability, and low monitoring accuracy of existing thermal power plant equipment monitoring methods, which rely on single-modal monitoring, the primary objective of this invention is to provide a multimodal large-scale model-based intelligent monitoring and early warning method for thermal power plant equipment defects. This method improves monitoring accuracy, reduces false alarm and false alarm rates, lowers the probability of sudden failures, and exhibits strong adaptability by fusing cross-modal features.

[0008] To achieve the above objectives, the present invention adopts the following technical solution: a method for intelligent monitoring and early warning of equipment defects in thermal power plants based on a multimodal large model, the method comprising the following sequential steps:

[0009] (1) Collect the device's operating data, i.e., multimodal data, in real time. The multimodal data includes image data, sound data, sensor data, and text data. Transmit the multimodal data to the multimodal Transformer model.

[0010] (2) Constructing a multimodal Transformer model: The BEiT-3 model, Whisper model, Informer model, T5 model, Flamingo model and autoencoder together form a multimodal Transformer model, which encodes different types of multimodal data and generates a unified multimodal representation;

[0011] (3) Perform cross-modal data fusion and analysis on the unified multimodal representation to obtain unified fusion features;

[0012] (4) Based on the unified fusion characteristics, the anomaly degree of the device is calculated by using contrastive learning and autoencoder to determine whether the device is abnormal.

[0013] (5) Based on the unified fusion characteristics, predict the future equipment status, predict the probability of potential failures, and provide operation and maintenance suggestions and early warnings;

[0014] (6) Optimize the multimodal Transformer model through incremental learning and knowledge distillation.

[0015] Step (1) specifically includes the following steps:

[0016] (1a) Image Acquisition: High-definition industrial cameras installed in the visible area of the electrical equipment are used to acquire images of the equipment's exterior and detect physical defects; the acquired images are then subjected to Gaussian filtering to remove noise and minimize its impact. i Gaussian filtering is applied to the pixels at position (x, y) in the images captured at each moment to remove noise:

[0017]

[0018] Where σ is the Gaussian kernel standard deviation, I filtered (x,y,t i ) is t i The result of Gaussian filtering to denoise the pixel at position (x,y) at time t; I(u,v,t) i ) represents t i The pixel value of the image captured at position (u,v) at any given time;

[0019] (1b) Sound acquisition: Operating sounds are acquired using high-sensitivity microphones placed in noise-sensitive areas of the electrical equipment to identify abnormal noises and process the data. j The sound signal S(f,t) collected at each momentj ) Applying spectral subtraction for noise reduction:

[0020] S clean (f,t j )=max(S(f,t j )-α f N(f),0)

[0021] Where N(f) is the noise estimate; α f The noise reduction coefficient is set to 1.5; S clean (f,t j ) for t j The audio signal after spectral subtraction noise reduction at any given moment;

[0022] The noise estimate N(f) is obtained using the minimum statistical method:

[0023]

[0024] Among them, W t β represents the time window; β is the correction factor, with a value of 1.2.

[0025] (1c) Sensor data acquisition: Electrical operating parameters are acquired through current sensors, voltage sensors, temperature sensors, and vibration sensors to monitor equipment status. Low-pass filtering is applied to the sensor data X(t) acquired at time t.

[0026]

[0027] Where X(τ) represents the raw sensor data acquired at time τ; h(t) represents the impulse response of the filter, and the calculation of h(t) is as follows:

[0028]

[0029] Where, ω c =2πf c f c It is the cutoff frequency; u(t) is the unit step function, which is 1 when t≥0, and 0 otherwise.

[0030] (1d) Collect text data: Provide historical data and contextual information through operation logs, electricity meter readings, and maintenance records; perform word segmentation and standardized format processing on the text.

[0031] (1e) Data synchronization and transmission: Real-time synchronous transmission, data alignment via timestamps to obtain synchronized multimodal data, the synchronization formula is:

[0032] D sync (t)={I filtered (ti ),S clean (t j ),X filtered (t k ),D text (t m )}

[0033] Among them, D sync (t) represents the synchronous multimodal data at time t; I filtered (t i ) represents t i Image data after filtering at any time, S clean (t j ) represents t j The noise-reduced audio data at all times, X filtered (t k ) represents t k Apply low-pass filtered sensor data at all times, D text (t m ) = Tokenize(T(t) m )) represents t m Text data after word segmentation at each time step; T(t) m ) represents t m The original text data at any given moment; Tokenize represents the word segmentation operation;

[0034] |t i -t j |<∈,|t i -t k |<∈,|t i -t m |<∈, where ∈ represents the time deviation.

[0035] Step (2) specifically includes the following steps:

[0036] (2a) Image encoding using the BEiT-3 model:

[0037] v I =BEiT(x I )

[0038] in, The output image feature vector represents the visual features of the device appearance and thermal imaging, d I 1024; x I For the input image, Both H and W are 224;

[0039] (2b) Encode the audio signal using a Whisper model encoder:

[0040] vA =Whisper(x A )

[0041] in, The output sound feature vector reflects the abnormal noise patterns during equipment operation, d A 512; x A The input sound spectrum;

[0042] (2c) Encode the sensor data using the encoder of the Informer model:

[0043] v S =Informer(x S )

[0044] in, This represents the output sensor feature vector, capturing dynamic changes in the device's state. S 512; x S The input is the sensor time series;

[0045] (2d) Text encoding using the T5 model:

[0046] v T =T5(x T )

[0047] in, The output text feature vector contains semantic information about operation and maintenance logs and historical faults; d T 1024; x T Represents the input text sequence;

[0048] (2e) The feature vectors of each modality are mapped to a unified dimension D of 1024 for each modality representation vector through a linear projection layer, so as to facilitate subsequent fusion:

[0049] m i =W i ·v i +b i

[0050] In the formula, m i Let i represent the unified representation of the i-th mode, where i∈{I,A,S,T}; Let d represent the projection matrix. i This represents the dimension of the eigenvector of the i-th modality; V represents the bias vector. i This represents the eigenvector of the i-th mode;

[0051] The final output is a unified set of multimodal representation vectors V:

[0052] V = [mI ,m A m S ,m T ]

[0053] In the formula, m I The representation vector of the image modality; m A The representation vector of the sound mode; m S The vector representing the sensor mode; m T A representation vector representing the text modality.

[0054] Step (3) specifically refers to: using Flamingo to achieve extended modal fusion and obtain a unified fusion feature v. multimodal :

[0055] v multimodal =Flamingo(m I ,m A ,m S ,m T )

[0056] In the formula, v multimodal For unified integration features, m I The representation vector of the image modality; m A The representation vector of the sound mode; m S The vector representing the sensor mode; m T A representation vector representing the text modality.

[0057] Step (4) specifically includes the following steps:

[0058] (4a) Unified fusion features of the input v multimodal Normalization is performed to obtain the normalized feature v. norm :

[0059]

[0060] Where, μ v This represents all v in the training set. multimodal The mean; σ v This represents all v in the training set. multimodal Standard deviation; This represents the normalized features, where D is the unified dimension of the modal representation vectors.

[0061] (4b) Contrastive learning detection: A large set of feature vectors {v} under normal operating conditions is collected from the normal sample library. normal,1 ,v normal,2 ,…,v normal,M}, M is the number of normal samples; the negative sample library collects a set of known outlier samples {v abnormal，1 ,…,vabnormal,N}, where N is the number of outlier samples;

[0062] Cosine similarity is used to measure the normalized features v. norm The degree of similarity to normal samples sin(v) norm ,v normal,i ):

[0063]

[0064] Among them, v normal,i This represents the feature of the i-th normal sample;

[0065] Define the loss function for contrastive learning Encourage v norm More closely resembles normal samples and further away from abnormal samples:

[0066]

[0067] Among them, v j The j-th sample in the total set of normal and abnormal samples represents the characteristics of the sample; τ represents the temperature coefficient, which is 0.1.

[0068] Finally, calculate v. norm Average similarity S with normal samples contrastive :

[0069]

[0070] Among them, S contrastive A value close to 1 indicates that the current device status is normal, while a value close to 0 indicates an abnormality.

[0071] (4c) Autoencoder detection: An autoencoder consists of an encoder and a decoder. The encoder is used to convert v... norm Compressed to a low-dimensional potential space:

[0072] z = f enc (v norm ) = W enc ·ReLU(W in v norm +b in )+b enc

[0073] in, d z <D,d z The value is set to 256, where D represents the dimension of the unified representation vector for each modality, and is set to 1024; W enc W in Both represent weight matrices; b in b encAll represent bias; ReLU represents the activation function; z represents the low-dimensional feature vector encoded by the encoder;

[0074] The decoder reconstructs the original dimension from z:

[0075]

[0076] Among them, W dec W hidden Both represent weight matrices; b hidden b dec Both represent bias; Indicates the characteristics of reconstruction; f dec (z) indicates decoding the low-dimensional feature vector z;

[0077] The autoencoder is trained using normal data, and a reconstruction loss is defined. To minimize the reconstruction error of normal data:

[0078]

[0079] After training is complete, the reconstruction error S is calculated on the test data. recon :

[0080]

[0081] S recon A larger value indicates that the current device status is abnormal;

[0082] S recon Normalization is performed using the statistics of the training set:

[0083]

[0084] Where, μ recon Represents all S in the training set recon The average value; σ recon Represents all S in the training set recon Standard deviation; S' recon This represents the normalized reconstruction error;

[0085] (4d) The results of contrastive learning and autoencoder are fused to generate the final anomaly score S. anomaly :

[0086] S anomaly =α(1-S contrastive )+(1-α)S' recon

[0087] Where α∈[0,1] represents the weight parameter, which is 0.5; 1-S contrastive This indicates that similarity is converted into anomaly.

[0088] Set a threshold T ano When S anomaly >T ano When this happens, it is judged as abnormal;

[0089] If an anomaly is detected, the fault type needs to be further classified, which can be achieved through a fully connected neural network:

[0090] P(fault) = softmax(W) c ·v norm +b c )

[0091] in, Let represent the weights and biases, respectively; K is the number of fault categories; P(fault) represents the probability distribution of each fault category; and softmax represents the activation function.

[0092] During training, labeled fault data is needed to optimize the cross-entropy loss:

[0093]

[0094] Among them, y k P represents the true label. k This represents the k-th term in the probability distribution of the model output. This represents the cross-entropy loss, used for training the model.

[0095] Step (5) specifically includes the following steps:

[0096] (5a) Regarding the unified fusion feature v multimodal Time series data are constructed and normalized to extract feature vectors V for T consecutive time steps from historical data. t =[v multimodal,t-T+1 ,v multimodal,t-T+2 ,…,v multimodal,t ],in D m The feature dimension is the same size as the unified dimension D of each modality representation vector, i.e., 1024; then each dimension is normalized:

[0097]

[0098] Where, μ t , σ t V' represents the time series mean and standard deviation of the training set; t This represents the normalized time series;

[0099] (5b) The Informer model is used to predict the future state characteristics of the device. The placeholder of the future time step is input, first passed through the Transformer encoding layer of the Informer model decoder, and then input together with the output of the Informer model encoder into the multi-head self-attention layer to output the predicted value:

[0100]

[0101] in, This is the predicted value for the next step;

[0102] During training, we need to minimize the mean squared error between the predicted values and the true future states.

[0103]

[0104] In the formula, v multimodal,t+1 This represents the actual state of the next step, i.e., the actual future state. This indicates the predicted value for the next step; This represents the mean square error between the predicted value and the actual future state.

[0105] (5c) Based on prediction results Early warning can be achieved through deviation analysis and failure probability assessment:

[0106] (5c1) Deviation Analysis: Define a normal state baseline and calculate the average eigenvector v using normal operation data. normal_benchmark As a baseline for normal conditions:

[0107]

[0108] Where M represents the number of normal samples; This represents the feature vector of the i-th normal sample;

[0109] Then calculate the predicted state. Compared with the normal state benchmark v normal_benchmark Euclidean distance as a measure of deviation D deviation :

[0110]

[0111] in, Indicates the predicted state; D deviation Used to measure the degree of deviation between the predicted state and the normal state baseline;

[0112] Finally, set a deviation threshold T. dev When D deviation >T dev If it exists, trigger an alert;

[0113] (5c2) Failure probability assessment:

[0114] First, construct the fault mode library {v fault,1 ,…,v fault,K K represents the number of fault categories, and then cosine similarity is used to evaluate the predicted state. Proximity to each failure mode S k :

[0115]

[0116] Among them, v fault,k Representing the k-th failure mode, the result is normalized to a probability:

[0117]

[0118] Where, τ p P(fault) represents the temperature coefficient, taken as 0.1; k ) represents the probability of the k-th failure mode occurring;

[0119] when When it exists, T prob T represents a threshold. prob Take 0.7; This indicates the failure mode with the highest probability of occurrence. This means that as long as the probability of the failure mode occurring exceeds the threshold T... prob If so, the fault with the highest probability of occurrence is determined to be a high-risk fault type;

[0120] use The sequence number represents the fault mode with the highest probability of occurrence; according to For the highest probability category, provide corresponding suggestions; Indicates from P(fault) k Get the index of the column with the highest value from P(fault). k Each column in the table corresponds to the probability of a fault category occurring, i.e., the category with the highest probability.

[0121] Step (6) specifically includes the following steps:

[0122] (6a) New Data X new After multimodal encoding and multimodal fusion, the fused feature vector v is obtained. multimodal,new :

[0123] v multimodal,new

[0124] =Flamingo(BEiT(x) I,new ),Whisper(x A,new),Informer(x S,new ),T5(x T,new ))

[0125] In the formula, x I，new This represents the image data in the new data; x A,new Indicates the sound signal in the new data; x S,new This represents sensor data from the new data; x T,new This represents text data in the new data; v multimodal,new Represents the fused feature vector of the new data;

[0126] Then for v multimodal,new Normalize:

[0127]

[0128] Where μ and σ are the mean and standard deviation calculated based on historical data, respectively; v norm,new Represents the normalized fused feature vector;

[0129] In trend forecasting, new data is organized into a time series V with a window length of T. new,t =[v multimodal,new,t-T+1 ,…,v multimodal,new,t ];

[0130] (6b) The incremental learning mechanism is used to learn new data step by step. The main steps are as follows:

[0131] (6b1) Data Buffering and Selection:

[0132] First, a buffer pool of fixed size needs to be maintained. Capacity 1000, storing historical data samples {(v norm,i ,y i )} and new data {(v norm,new ,y new )},y i With y new Both represent status labels;

[0133] Then, the buffer pool is updated using random replacement to ensure a balance between old and new data:

[0134]

[0135] in, This indicates the number of samples already in the buffer pool; size(v) new ) represents the number of newly added data samples; P(keep) represents the probability that each old sample in the buffer pool is retained;

[0136] (6b2) Model Update:

[0137] Fine-tuning the model on new data while preserving performance on old data; fine-tuning the model using data from the buffer pool; for the device anomaly detection module, the loss function is:

[0138]

[0139] In the formula, τ c This represents the temperature coefficient, taken as 0.1; v normal Indicates a normal sample; v j This represents the j-th sample in the buffer pool;

[0140] For the equipment status trend prediction module, the loss function is:

[0141]

[0142] In the formula, This represents the predicted value for the next step corresponding to the new data, i.e., the prediction state; v multimodal,new，t+1 This represents the actual state of the next step corresponding to the new data, i.e., the actual future state;

[0143] Simultaneously, model parameters are updated using mini-batch gradient descent;

[0144] (6c) Knowledge distillation optimization

[0145] First, obtain the teacher model's output on the new data:

[0146]

[0147] In the formula, Represents the teacher model; v norm,new,i Normalized features of the i-th new data; This represents the output of the teacher model;

[0148] Then, the student model's output on the new data is obtained:

[0149]

[0150] In the formula, Representing the student model; This represents the output of the student model;

[0151] Finally, the KL divergence loss is calculated.

[0152]

[0153] Among them, D KL Indicates KL divergence;

[0154] Calculate total loss

[0155]

[0156] Where θ represents the balance factor, which is 0.7; The formula for representing task-specific losses is:

[0157]

[0158] In the formula, To detect the loss, To predict losses;

[0159] Through optimization The process involves updating the student model, replacing the old parameters of each module in the system with the updated student model, triggering an update every time a certain amount of new data is received, and rolling back to the old model if the performance degradation after the update exceeds a certain percentage.

[0160] Another objective of this invention is to provide an intelligent monitoring and early warning system for equipment defects in thermal power plants based on a multimodal large model, comprising:

[0161] The multimodal data acquisition module collects the device's operational data, i.e., multimodal data, in real time and transmits the multimodal data to the multimodal Transformer model;

[0162] The multimodal data encoding module uses the BEiT-3 model, Whisper model, Informer model, and T5 model to encode different types of multimodal data and generate a unified multimodal representation.

[0163] The multimodal fusion analysis module uses the Flamingo model to perform cross-modal data fusion and analysis on a unified multimodal representation, resulting in unified fusion features.

[0164] The equipment anomaly detection and fault diagnosis module uses an automatic encoder. Based on unified fusion features, it uses contrastive learning and an autoencoder to calculate the anomaly degree of the equipment, thereby determining whether the equipment has an anomaly.

[0165] The equipment status trend prediction and early warning module adopts the Informer model and predicts the future equipment status based on unified fusion features, predicts the probability of potential failures, and provides operation and maintenance suggestions and early warnings.

[0166] The online learning and module optimization module optimizes the multimodal Transformer model through incremental learning and knowledge distillation.

[0167] As can be seen from the above technical solution, the beneficial effects of this invention are as follows: First, this invention adopts a multimodal Transformer model, combining image, sound, sensor, and text data, and utilizes Flamingo for cross-modal feature fusion. Compared with traditional methods, it has higher accuracy and lower false alarm and false negative rates. Second, this invention uses an Informer time series prediction model to perform trend analysis on equipment status, which can predict the time of failure and greatly improve the prediction lead time, providing maintenance personnel with more maintenance time. By combining anomaly detection and prediction, the probability of sudden failures is reduced. Third, this invention uses an online learning mechanism to automatically learn the operating characteristics of different thermal power plants and different equipment. When migrating to new equipment, only a small amount of new data is needed for fine-tuning to achieve accurate monitoring. At the same time, knowledge distillation is used to achieve incremental improvement of the model. The invention is novel, enabling it to continuously adapt to new equipment operating modes without affecting its original performance; fourth, it employs a T5 text analysis model, combined with historical maintenance logs, to automatically generate fault cause descriptions and provide maintenance suggestions; through BEiT image analysis and Whisper sound diagnosis, it can generate visual and audio playback fault diagnosis reports, helping engineers quickly locate the source of faults; fifth, through intelligent automated monitoring, it reduces manual inspection time, provides automatic alarms and remote diagnosis, improves fault handling efficiency, reduces unplanned downtime, increases equipment availability, and lowers the average annual maintenance cost of thermal power plants; sixth, it adopts a pre-training-fine-tuning mechanism for multimodal Transformer models, exhibiting strong transfer adaptability between different thermal power plants and different equipment, with reduced training costs, allowing for rapid deployment to different plants and reducing deployment costs. Attached Figure Description

[0168] Figure 1 This is a flowchart of the method of the present invention;

[0169] Figure 2 This is a system framework diagram of the present invention;

[0170] Figure 3 This is a flowchart of the multimodal data encoding module in this invention.

[0171] Figure 4 This is an architecture diagram of the BEiT model in this invention;

[0172] Figure 5 This is a structural diagram of the Whisper model encoder in this invention;

[0173] Figure 6 This is a structural diagram of the encoder in the Informer model of this invention;

[0174] Figure 7 This is a structural diagram of the decoder of the Informer model in this invention. Detailed Implementation

[0175] like Figure 1 As shown, a method for intelligent monitoring and early warning of equipment defects in thermal power plants based on a multimodal large model is presented. This method includes the following sequential steps:

[0176] (1) Collect the device's operating data, i.e., multimodal data, in real time. The multimodal data includes image data, sound data, sensor data, and text data. Transmit the multimodal data to the multimodal Transformer model.

[0177] (2) Constructing a multimodal Transformer model: The BEiT-3 model, Whisper model, Informer model, T5 model, Flamingo model and autoencoder together form a multimodal Transformer model, which encodes different types of multimodal data and generates a unified multimodal representation;

[0178] (3) Perform cross-modal data fusion and analysis on the unified multimodal representation to obtain unified fusion features;

[0179] (4) Based on the unified fusion characteristics, the anomaly degree of the device is calculated by using contrastive learning and autoencoder to determine whether the device is abnormal.

[0180] (5) Based on the unified fusion characteristics, predict the future equipment status, predict the probability of potential failures, and provide operation and maintenance suggestions and early warnings;

[0181] (6) Optimize the multimodal Transformer model through incremental learning and knowledge distillation.

[0182] Step (1) specifically includes the following steps:

[0183] (1a) Image Acquisition: High-definition industrial cameras installed in the visible area of the electrical equipment are used to acquire images of the equipment's exterior and detect physical defects; the acquired images are then subjected to Gaussian filtering to remove noise and minimize its impact. i Gaussian filtering is applied to the pixels at position (x, y) in the images captured at each moment to remove noise:

[0184]

[0185] Where σ is the Gaussian kernel standard deviation, I filtered (x,y,t i ) is t i The result of Gaussian filtering to denoise the pixel at position (x,y) at time t; I(u,v,t) i ) represents t iThe pixel value of the image captured at position (u,v) at any given time;

[0186] (1b) Sound acquisition: Operating sounds are acquired using high-sensitivity microphones placed in noise-sensitive areas of the electrical equipment to identify abnormal noises and process the data. j The sound signal S(f,t) collected at each moment j ) Applying spectral subtraction for noise reduction:

[0187] S clean (f,t j )=max(S(f,t j )-α f N(f),0)

[0188] Where N(f) is the noise estimate; α f The noise reduction coefficient is set to 1.5; S clean (f,t j ) for t j The audio signal after spectral subtraction noise reduction at any given moment;

[0189] In the monitoring of electrical equipment in thermal power plants, considering that noise (such as discharge noise or environmental interference) may change dynamically over time, noise estimation is performed using the minimum statistical method. This method is based on the statistical characteristics of the signal power spectrum, assuming that the noise power is at a local minimum over a short period. The approach involves tracking the minimum value of the spectrum within a time window and applying a bias correction to estimate the noise.

[0190]

[0191] Among them, W t β represents the time window; β is the correction factor, with a value of 1.2.

[0192] (1c) Sensor data acquisition: Electrical operating parameters are acquired through current sensors, voltage sensors, temperature sensors, and vibration sensors to monitor equipment status. Low-pass filtering is applied to the sensor data X(t) acquired at time t.

[0193]

[0194] Where X(τ) represents the raw sensor data acquired at time τ; h(t) represents the impulse response of the filter, and the calculation of h(t) is as follows:

[0195]

[0196] Where, ω c =2πf c f cIt is the cutoff frequency; u(t) is the unit step function, which is 1 when t≥0, and 0 otherwise.

[0197] (1d) Collect text data: Historical data and contextual information are provided through operation logs, electricity meter readings, and maintenance records. The text is then segmented and standardized in format. Segmentation breaks down continuous text into meaningful words or phrases to facilitate subsequent analysis. Standardization mainly unifies the format to ensure data consistency, such as unit conversion and time format unification.

[0198] (1e) Data synchronization and transmission: Real-time synchronous transmission, data alignment via timestamps to obtain synchronized multimodal data, the synchronization formula is:

[0199] D sync (t)={I filtered (t i ),S clean (t j ),X filtered (t k ),D text (t m )}

[0200] Among them, D sync (t) represents the synchronous multimodal data at time t; I filtered (t i ) represents t i Image data after filtering at any time, S clean (t j ) represents t j The noise-reduced audio data at all times, X filtered (t k ) represents t k Apply low-pass filtered sensor data at all times, D text (t m ) = Tokenize(T(t) m )) represents t m Text data after word segmentation at each time step; T(t) m ) represents t m The original text data at any given moment; Tokenize represents the word segmentation operation;

[0201] |t i -t j |<∈,|t i -t k |<∈,|t i -t m |<∈, where ∈ represents the time deviation.

[0202] like Figure 3As shown, step (2) specifically includes the following steps:

[0203] (2a) Image encoding is performed using the BEiT-3 model (Masked Image Modeling Transformer), such as Figure 4 As shown:

[0204] v I =BEiT(x I )

[0205] in, The output image feature vector represents the visual features of the device appearance and thermal imaging, d I 1024; x I For the input image, Both H and W are 224;

[0206] (2b) The sound signal is encoded using an encoder based on the Whisper model (OpenAI speech model), such as... Figure 5 As shown:

[0207] v A =Whisper(x A )

[0208] in, The output sound feature vector reflects the abnormal noise patterns during equipment operation, d A 512; x A The input sound spectrum;

[0209] (2c) Encode the sensor data using an Informer model (self-attention temporal prediction model), such as... Figure 6 As shown:

[0210] v S =Informer(x S )

[0211] in, This represents the output sensor feature vector, capturing dynamic changes in the device's state. S 512; x S The input is the sensor time series;

[0212] (2d) Text encoding is performed using the T5 model (Text-to-Text Transfer Transformer):

[0213] v T =T5(x T )

[0214] in, The output text feature vector contains semantic information about operation and maintenance logs and historical faults; d T 1024; x T Represents the input text sequence;

[0215] (2e) The feature vectors of each modality are mapped to a unified dimension D of 1024 for each modality representation vector through a linear projection layer, so as to facilitate subsequent fusion:

[0216] m i =W i ·v i +b i

[0217] In the formula, m i Let i represent the unified representation of the i-th mode, where i∈{I,A,S,T}; Let d represent the projection matrix. i This represents the dimension of the eigenvector of the i-th modality; V represents the bias vector. i This represents the eigenvector of the i-th mode;

[0218] The final output is a unified set of multimodal representation vectors V:

[0219] V = [m I ,m A m S ,m T ]

[0220] In the formula, m I The representation vector of the image modality; m A The representation vector of the sound mode; m S The vector representing the sensor mode; m T A representation vector representing the text modality.

[0221] Step (3) specifically refers to: using Flamingo (DeepMind Multimodal Transformer) to achieve extended modal fusion and obtain a unified fusion feature v. multimodal :

[0222] v multimodal =Flamingo(m I ,m A ,m S ,m T )

[0223] In the formula, v multimodal For unified integration features, m I The representation vector of the image modality; m A The representation vector of the sound mode; mS The vector representing the sensor mode; m T A representation vector representing the text modality. v multimodal It integrates features from images, sound, sensors, and text to capture comprehensive information about the device's status. The Flamingo model is a large-scale visual language model that interweaves images and text.

[0224] Step (4) specifically includes the following steps:

[0225] (4a) Unified fusion features of the input v multimodal Normalization is performed to obtain the normalized feature v. norm :

[0226]

[0227] Where, μ v This represents all v in the training set. multimodal The mean; σ v This represents all v in the training set. multimodal Standard deviation; The normalized features are represented by D, which is the unified dimension of the representation vectors of each modality. Model training requires a dataset, which consists of a series of {images, sounds, sensors, text} pairs. The dataset is divided into a training set and a test set according to a certain ratio. The training set is used to train the model, and the test set is used to test the generalization of the model.

[0228] (4b) Contrastive learning detection: A large set of feature vectors {v} under normal operating conditions is collected from the normal sample library. normal,1 ,v normal,2 ,…,v normal,M}, M is the number of normal samples; the negative sample library collects a set of known outlier samples {v abnormal，1 ,…,v abnormal,N}, where N is the number of outlier samples;

[0229] Cosine similarity is used to measure the normalized features v. norm The degree of similarity to normal samples sin(v) norm ,v normal,i ):

[0230]

[0231] Among them, v normal,i This represents the feature of the i-th normal sample;

[0232] Define the loss function for contrastive learning Encourage v norm More closely resembles normal samples and further away from abnormal samples:

[0233]

[0234] Among them, v j The j-th sample in the total set of normal and abnormal samples represents the characteristics of the sample; τ represents the temperature coefficient, which is 0.1.

[0235] Finally, calculate v. norm Average similarity S with normal samples contrastive :

[0236]

[0237] Among them, S contrastive A value close to 1 indicates that the current device status is normal, while a value close to 0 indicates an abnormality.

[0238] (4c) Autoencoder Detection: Autoencoders detect anomalies by assessing reconstruction errors and assume that normal data has relatively small reconstruction errors. An autoencoder consists of an encoder and a decoder; the encoder is used to convert v... norm Compressed to a low-dimensional potential space:

[0239] z = f enc (v norm ) = W enc ·ReLU(W in v norm +b in )+b enc

[0240] in, d z <D,d z The value is set to 256, where D represents the dimension of the unified representation vector for each modality, and is set to 1024; W enc W in Both represent weight matrices; b in b enc All represent bias; ReLU represents the activation function; z represents the low-dimensional feature vector encoded by the encoder;

[0241] The decoder reconstructs the original dimension from z:

[0242]

[0243] Among them, W dec W hidden Both represent weight matrices; b hidden b dec Both represent bias; Indicates the characteristics of reconstruction; f dec (z) indicates decoding the low-dimensional feature vector z;

[0244] The autoencoder is trained using normal data, which consists of {video, audio, sensor, and text} data, representing devices in a normal state. A reconstruction loss is defined. To minimize the reconstruction error of normal data:

[0245]

[0246] After training is complete, the reconstruction error S is calculated on the test data. recon :

[0247]

[0248] S recon A larger value indicates that the current device status is abnormal;

[0249] S recon Normalization is performed using the statistics of the training set:

[0250]

[0251] Where, μ recon Represents all S in the training set recon The average value; σ recon Represents all S in the training set recon Standard deviation; S' recon This represents the normalized reconstruction error;

[0252] (4d) The results of contrastive learning and autoencoder are fused to generate the final anomaly score S. anomaly :

[0253] S anomaly =α(1-S contrastive )+(1-α)S' recon

[0254] Where α∈[0,1] represents the weight parameter, which is 0.5; 1-S contrastive This indicates that similarity is converted into anomaly.

[0255] Set a threshold T ano When S anomaly >T ano When this happens, it is judged as abnormal;

[0256] If an anomaly is detected, the fault type needs to be further classified, which can be achieved through a fully connected neural network:

[0257] P(fault) = softmax(W) c ·v norm +b c )

[0258] in, These represent the weights and biases, respectively; K is the number of fault categories, such as "normal", "overheating", "abnormal vibration", etc.; P(fault) represents the probability distribution of each fault category; and softmax represents the activation function.

[0259] During training, labeled fault data is needed to optimize the cross-entropy loss:

[0260]

[0261] Among them, y k P represents the true label. k This represents the k-th term in the probability distribution of the model output. This represents the cross-entropy loss, used for training the model.

[0262] Step (5) specifically includes the following steps:

[0263] (5a) Predict the future operating status of the equipment and provide early warnings based on the prediction results. The Informer (self-attention time series model) model is used to process the time series data of the equipment status, and combined with historical failure modes and normal status benchmarks, potential failures are predicted and operation and maintenance suggestions are provided.

[0264] By utilizing time series data with fused features, the system predicts future equipment status trends and identifies potential failure risks in advance through deviation analysis from normal conditions. The final output includes future status predictions and early warning signals to guide intelligent operation and maintenance in thermal power plants.

[0265] For the unified fusion feature v multimodal Time series data are constructed and normalized to extract feature vectors V for T consecutive time steps from historical data. t =[v multimodal,t-T+1 ,v multimodal,t-T+2 ,…,v multimodal,t ],in D m The feature dimension is the same size as the unified dimension D of each modality representation vector, i.e., 1024; then each dimension is normalized:

[0266]

[0267] Where, μ t , σ t V' represents the time series mean and standard deviation of the training set; t This represents the normalized time series;

[0268] (5b) The Informer model is used to predict the future state characteristics of the device. Placeholders for future time steps are input, first passing through the Transformer encoding layer of the Informer model decoder, and then input together with the output of the Informer model encoder into a multi-head self-attention layer. The output is the predicted value, such as... Figure 7 As shown:

[0269]

[0270] in, This is the predicted value for the next step;

[0271] During training, we need to minimize the mean squared error between the predicted values and the true future states.

[0272]

[0273] In the formula, v multimodal,t+1 This represents the actual state of the next step, i.e., the actual future state. This indicates the predicted value for the next step; This represents the mean square error between the predicted value and the actual future state.

[0274] (5c) Based on prediction results Early warning can be achieved through deviation analysis and failure probability assessment:

[0275] (5c1) Deviation Analysis: Define a normal state baseline and calculate the average eigenvector v using normal operation data. normal_benchmark As a baseline for normal conditions:

[0276]

[0277] Where M represents the number of normal samples; This represents the feature vector of the i-th normal sample;

[0278] Then calculate the predicted state. Compared with the normal state benchmark v normal_benchmark Euclidean distance as a measure of deviation D deviation :

[0279]

[0280] in, Indicates the predicted state; D deviation Used to measure the degree of deviation between the predicted state and the normal state baseline;

[0281] Finally, set a deviation threshold T. dev When D deviation >T dev If it exists, trigger an alert;

[0282] (5c2) Failure probability assessment:

[0283] First, construct the fault mode library {v fault,1 ,…,v fault,K K represents the number of fault categories, and then cosine similarity is used to evaluate the predicted state. Proximity to each failure mode S k :

[0284]

[0285] Among them, v fault,k Representing the k-th failure mode, the result is normalized to a probability:

[0286]

[0287] Where, τ p P(fault) represents the temperature coefficient, taken as 0.1; k ) represents the probability of the k-th failure mode occurring;

[0288] when When it exists, T prob T represents a threshold. prob Take 0.7; This indicates the failure mode with the highest probability of occurrence. This means that as long as the probability of the failure mode occurring exceeds the threshold T... prob If so, the fault with the highest probability of occurrence is determined to be a high-risk fault type;

[0289] use The sequence number represents the fault mode with the highest probability of occurrence; according to For the highest probability category, corresponding suggestions are provided. For example, if the prediction result is "overheating", the suggestion is to "check the cooling system". Indicates from P(fault) k Get the index of the column with the highest value from P(fault). k Each column in the table corresponds to the probability of a fault category occurring, i.e., the category with the highest probability.

[0290] Step (6) specifically includes the following steps:

[0291] (6a) New Data X new After multimodal encoding and multimodal fusion, the fused feature vector v is obtained. multimodal,new :

[0292] v multimodal,new

[0293] =Flamingo(BEiT(x) I,new ),Whisper(x A,new ),Informer(x S,new ),T5(x T,new ))

[0294] In the formula, x I，new This represents the image data in the new data; x A,new Indicates the sound signal in the new data; x S,new This represents sensor data from the new data; x T,new This represents text data in the new data; v multimodal,new Represents the fused feature vector of the new data;

[0295] Then for v multimodal,new Normalize:

[0296]

[0297] Where μ and σ are the mean and standard deviation calculated based on historical data, respectively; v norm,new Represents the normalized fused feature vector;

[0298] In trend forecasting, new data is organized into a time series V with a window length of T. new,t =[v multimodal,new,t-T+1 ,…,v multimodal,new,t ];

[0299] (6b) The incremental learning mechanism is used to learn new data step by step. The main steps are as follows:

[0300] (6b1) Data Buffering and Selection:

[0301] First, a buffer pool of fixed size needs to be maintained. Capacity 1000, storing historical data samples {(v norm,i ,y i )} and new data {(v norm,new ,y new )},y i With y new Both represent status labels;

[0302] Then, the buffer pool is updated using random replacement to ensure a balance between old and new data:

[0303]

[0304] in, This indicates the number of samples already in the buffer pool; size(v) new) represents the number of newly added data samples; P(keep) represents the probability that each old sample in the buffer pool is retained;

[0305] (6b2) Model Update:

[0306] Fine-tuning the model on new data while preserving performance on old data; fine-tuning the model using data from the buffer pool; for the device anomaly detection module, the loss function is:

[0307]

[0308] In the formula, τ c This represents the temperature coefficient, taken as 0.1; v normal Indicates a normal sample; v j This represents the j-th sample in the buffer pool;

[0309] For the equipment status trend prediction module, the loss function is:

[0310]

[0311] In the formula, This represents the predicted value for the next step corresponding to the new data, i.e., the prediction state; v multimodal,new，t+1 This represents the actual state of the next step corresponding to the new data, i.e., the actual future state;

[0312] Simultaneously, model parameters are updated using mini-batch gradient descent;

[0313] (6c) Knowledge distillation optimization

[0314] First, obtain the teacher model's output on the new data:

[0315]

[0316] In the formula, Represents the teacher model; v norm,new,i Normalized features of the i-th new data; This represents the output of the teacher model;

[0317] Then, the student model's output on the new data is obtained:

[0318]

[0319] In the formula, Representing the student model; This represents the output of the student model;

[0320] Finally, the KL divergence loss is calculated.

[0321]

[0322] Among them, D KL Indicates KL divergence;

[0323] Calculate total loss

[0324]

[0325] Where θ represents the balance factor, which is 0.7; The formula for representing task-specific losses is:

[0326]

[0327] In the formula, To detect the loss, To predict losses;

[0328] Through optimization The process involves updating the student model, replacing the old parameters in each module of the system with the updated model, triggering an update every time a certain amount of new data is received, and rolling back to the old model if the performance degradation after the update exceeds a certain percentage. The "certain amount of new data" refers to 70 to 100 data entries, and the "degradation exceeding a certain percentage" refers to a degradation exceeding 3%.

[0329] like Figure 2 As shown, this system includes:

[0330] The multimodal data acquisition module collects the device's operational data, i.e., multimodal data, in real time and transmits the multimodal data to the multimodal Transformer model;

[0331] The multimodal data encoding module uses the BEiT-3 model, Whisper model, Informer model, and T5 model to encode different types of multimodal data and generate a unified multimodal representation.

[0332] The multimodal fusion analysis module uses the Flamingo model to perform cross-modal data fusion and analysis on a unified multimodal representation, resulting in unified fusion features.

[0333] The equipment anomaly detection and fault diagnosis module uses an automatic encoder. Based on unified fusion features, it uses contrastive learning and an autoencoder to calculate the anomaly degree of the equipment, thereby determining whether the equipment has an anomaly.

[0334] The equipment status trend prediction and early warning module adopts the Informer model and predicts the future equipment status based on unified fusion features, predicts the probability of potential failures, and provides operation and maintenance suggestions and early warnings.

[0335] The online learning and module optimization module optimizes the multimodal Transformer model through incremental learning and knowledge distillation.

[0336] In summary, this invention proposes to improve fault detection accuracy by employing multimodal fusion: traditional thermal power plant equipment monitoring systems mainly rely on single sensor data, resulting in low fault detection accuracy; while this invention uses a multimodal large model, combining image, sound, sensor and text data, and utilizes Flamingo for cross-modal feature fusion, which achieves higher accuracy and lower false alarm and false negative rates compared to traditional methods.

[0337] This invention significantly improves the lead time for early warning: Traditional monitoring systems typically use fixed thresholds or rule-based methods for equipment early warning, which can only issue alarms after a fault occurs or when it is about to occur; while this invention uses the Informer time series prediction model to perform trend analysis on equipment status, which can predict the time of fault occurrence, greatly improving the lead time for prediction and providing maintenance personnel with more time for repairs. By combining anomaly detection and prediction, the probability of sudden failures is reduced.

[0338] The present invention enhances the intelligence and adaptability of equipment fault diagnosis: Traditional systems require experts to manually set thresholds or maintain them based on rule bases, resulting in weak adaptability to different equipment types and operating environments. In contrast, the present invention, through an incremental learning mechanism, automatically learns the operating characteristics of different thermal power plants and equipment. When migrating to new equipment, only minor adjustments with a small amount of new data are needed to achieve accurate monitoring. Simultaneously, knowledge distillation is employed to achieve incremental updates to the model, allowing it to continuously adapt to new equipment operating modes without affecting its original performance.

[0339] This invention enhances the interpretability of fault causes: traditional AI systems often lack interpretability in their fault detection results, hindering rapid decision-making by maintenance personnel; while this invention uses a T5 model, combined with historical maintenance logs, to automatically generate fault cause descriptions and provide maintenance suggestions. Through BEiT image analysis and Whisper sound diagnostics, it can generate visual and audio playback fault diagnosis reports, helping engineers quickly locate the source of the fault.

[0340] This invention reduces manual operation and maintenance costs and improves operation and maintenance efficiency: Traditional thermal power plant equipment monitoring requires a large number of operation and maintenance personnel to conduct inspections, manual data analysis, and manual early warning, resulting in high operation and maintenance costs and slow response speed; This invention reduces manual inspection time, provides automatic alarms and remote diagnosis through intelligent automated monitoring, improves fault handling efficiency, reduces unplanned downtime, increases equipment availability, and reduces the average annual maintenance cost of thermal power plants.

[0341] This invention is applicable to different thermal power plant equipment and has strong versatility: traditional systems often require building separate detection models for different equipment, which is difficult to reuse; this invention adopts a multimodal large model pre-training-fine-tuning mechanism, which has strong transfer adaptability between different thermal power plants and different equipment, and reduces training costs, and can be quickly deployed to different plants, reducing deployment costs.

[0342] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the claimed invention. The scope of protection claimed by the appended claims and their equivalents is defined.

Claims

1. A method for intelligent monitoring and early warning of equipment defects in thermal power plants based on a multimodal large model, characterized in that: The method includes the following steps in sequence: (1) Collect the device's operating data, i.e., multimodal data, in real time. The multimodal data includes image data, sound data, sensor data, and text data. Transmit the multimodal data to the multimodal Transformer model. (2) Constructing a multimodal Transformer model: The BEiT-3 model, Whisper model, Informer model, T5 model, Flamingo model and autoencoder together form a multimodal Transformer model, which encodes different types of multimodal data and generates a unified multimodal representation; (3) Perform cross-modal data fusion and analysis on the unified multimodal representation to obtain unified fusion features; (4) Based on the unified fusion characteristics, the anomaly degree of the device is calculated by using contrastive learning and autoencoder to determine whether the device is abnormal; (5) Based on the unified fusion characteristics, predict the future equipment status, predict the probability of potential failures, and provide operation and maintenance suggestions and early warnings; (6) Optimize the multimodal Transformer model through incremental learning and knowledge distillation; Step (2) specifically includes the following steps: (2a) Image encoding using the BEiT-3 model: ， in, The output image feature vector represents the visual features of the device appearance and thermal imaging. It is 1024; For the input image, , where both H and W are 224; (2b) Encoding the sound signal using a Whisper model encoder: ， in, The output sound feature vector reflects the abnormal noise patterns during equipment operation. It is 512; The input sound spectrum; (2c) Encode the sensor data using the encoder of the Informer model: ， in, This represents the output sensor feature vector, which captures dynamic changes in the device's state. It is 512; The input is the sensor time series; (2d) Text encoding using the T5 model: ， in, The output text feature vector contains semantic information about operation and maintenance logs and historical faults; It is 1024; Represents the input text sequence; (2e) The feature vectors of each modality are mapped to a unified dimension of the representation vectors of each modality through a linear projection layer. That is, 1024, for subsequent fusion: ， In the formula, Indicates the first A unified representation of each modality ; Represents the projection matrix. Indicates the first The dimension of each modal feature vector; This represents the bias vector. Indicates the first Feature vectors of each modality; The final output is a unified set of multimodal representation vectors. : ， In the formula, The representation vector representing the image modality; The representation vector representing the sound mode; The representation vector of the sensor modes; A representation vector representing the text modality; Step (3) specifically refers to: using Flamingo to achieve extended modal fusion and obtain unified fusion features. : ， In the formula, To achieve a unified integration feature, The representation vector representing the image modality; The representation vector representing the sound mode; The representation vector of the sensor modes; A representation vector representing the text modality.

2. The intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large model according to claim 1, characterized in that: Step (1) specifically includes the following steps: (1a) Image acquisition: High-definition industrial cameras installed in the visible area of the electrical equipment are used to acquire images of the equipment's exterior and detect physical defects; the acquired images are then subjected to Gaussian filtering to remove noise and prevent its influence. Images captured at different times are located in Gaussian filtering is applied to the pixels at the location for noise reduction. ， in, It is the Gaussian kernel standard deviation. yes Time is located The result of denoising the pixels at the specified location using Gaussian filtering; express The images collected in real time are located at The pixel value of the location; (1b) Sound acquisition: Operating sounds are acquired using high-sensitivity microphones placed in noise-sensitive areas of the electrical equipment to identify abnormal noise. Sound signals collected at all times Applying spectral subtraction for noise reduction: ， in, For noise estimation; The noise reduction factor is set to 1.5; for The audio signal after spectral subtraction noise reduction at any given moment; Noise estimation The following conclusions were drawn using the minimum statistical method: ， in, For time windows; This is a correction factor with a value of 1.2; (1c) Sensor data acquisition: Electrical operating parameters are acquired through current sensors, voltage sensors, temperature sensors, and vibration sensors to monitor equipment status and perform data collection. Sensor data collected in real time Apply low-pass filtering: ， in, Indicates the first Raw sensor data collected in real time; This represents the impulse response of the filter. The calculation is as follows: ， in, , It is the cutoff frequency; It is a unit step function, when hour, If it is 1, otherwise, =0; (1d) Collect text data: Provide historical data and contextual information through operation logs, electricity meter readings, and maintenance records; perform word segmentation and standardized format processing on the text. (1e) Data synchronization and transmission: Real-time synchronous transmission, data alignment via timestamps to obtain synchronized multimodal data, the synchronization formula is: ， in, Indicates time Synchronous multimodal data; express Image data after filtering at any given moment. express The sound data after noise reduction at all times. express Apply low-pass filtered sensor data at all times. express Text data after word segmentation at any given moment; express The original text data at any given moment; This indicates word segmentation operation; , , , This is due to time deviation.

3. The intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large model according to claim 1, characterized in that: Step (4) specifically includes the following steps: (4a) Unified fusion features of the input Normalization is performed to obtain the normalized features. : ， in, Indicates all of the training set The mean; Indicates all of the training set Standard deviation; Represents the normalized features. A unified dimension for the representation vectors of each modality; (4b) Contrastive learning detection: Collect a large set of feature vectors under normal operating conditions from the normal sample library. The number of normal samples; the negative sample library contains a collection of known outlier samples. This represents the number of abnormal samples. Cosine similarity is used to measure the normalized features. Similarity to normal samples : ， in, Indicates the first Characteristics of a normal sample; Define the loss function for contrastive learning ,encourage More closely resembles normal samples and further away from abnormal samples: ， in, This represents the first element in the total set of normal and abnormal samples. Features of each sample; This represents the temperature coefficient, taken as 0.1; Final calculation Average similarity with normal samples : ， in, A value close to 1 indicates that the current device status is normal, while a value close to 0 indicates an abnormality. (4c) Automatic encoder detection: An automatic encoder consists of an encoder and a decoder. The encoder is used to... Compressed to a low-dimensional potential space: ， in, , , Let 256 be the value, and D represent the dimension of the unified representation vector for each modality, which is set to 1024. , Both represent weight matrices; , Both represent bias; Indicates the activation function; This represents the low-dimensional feature vector after encoding by the encoder; decoder from Reconstructing back to the original dimension: ， in, , Both represent weight matrices; , Both represent bias; Indicates the characteristics of reconstruction; This indicates that the low-dimensional feature vector Decode; The autoencoder is trained using normal data, and a reconstruction loss is defined. This minimizes the reconstruction error of normal data. ， After training is complete, the reconstruction error is calculated on the test data. : ， A larger value indicates that the current device status is abnormal; Will Normalization is performed using the statistics of the training set: ， in, Indicates all of the training set The average value; Indicates all of the training set Standard deviation; This represents the normalized reconstruction error; (4d) The results of contrastive learning and autoencoder are fused to generate the final anomaly score. : ， in, This represents the weighting parameter, which is set to 0.

5. This indicates that similarity is converted into anomaly. Set a threshold ,when When this happens, it is judged as abnormal; If an anomaly is detected, the fault type needs to be further classified, which can be achieved through a fully connected neural network: ， in, , These represent the weights and biases, respectively. Number of fault categories; Represents the probability distribution of various types of faults; Indicates the activation function; During training, labeled fault data is needed to optimize the cross-entropy loss: ， in, Indicates the true label, The probability distribution of the model output represents the first... item, This represents the cross-entropy loss, used for training the model.

4. The intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large model according to claim 1, characterized in that: Step (5) specifically includes the following steps: (5a) The characteristics of unified integration Perform time series construction and normalization to extract continuous data from historical data. Feature vectors at each time step ,in , The feature dimension is a dimension that is consistent with the representation vectors of each modality. The values are all the same, i.e., 1024; then, each dimension is normalized: ， in, , This represents the time series mean and standard deviation of the training set; This represents the normalized time series; (5b) The Informer model is used to predict the future state characteristics of the device. The placeholders of the future time steps are input, first passing through the Transformer encoding layer of the Informer model decoder, and then input together with the output of the Informer model encoder into the multi-head self-attention layer to output the predicted value: ， in, This is the predicted value for the next step; During training, we need to minimize the mean squared error between the predicted values and the true future states. ， In the formula, This represents the actual state of the next step, i.e., the actual future state. This indicates the predicted value for the next step; This represents the mean square error between the predicted value and the actual future state. (5c) Based on prediction results Early warnings can be achieved through deviation analysis and failure probability assessment.

5. The intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large model according to claim 1, characterized in that: Step (6) specifically includes the following steps: (6a) New data After multimodal encoding and multimodal fusion, the fused feature vector is obtained. : ， In the formula, This represents the image data in the new data; This represents the sound signal in the new data; This represents sensor data from the new data; This represents text data within the new data set. Represents the fused feature vector of the new data; After that Normalize: ， in, , These are the mean and standard deviation, calculated based on historical data, respectively. Represents the normalized fused feature vector; When forecasting trends, new data is organized into windows with a length of [missing information]. time series ; (6b) Employ an incremental learning mechanism to learn new data step by step; (6c) Knowledge distillation optimization First, obtain the teacher model's output on the new data: ， In the formula, Representing the teacher model; No. Normalized features of new data; This represents the output of the teacher model; Then, the student model's output on the new data is obtained: ， In the formula, Representing the student model; This represents the output of the student model; Finally, the KL divergence loss is calculated. : ， in, Indicates KL divergence; Calculate total loss : ， in, This represents the balance factor, which is set to 0.

7. The formula for representing task-specific losses is: ， In the formula, To detect the loss, To predict losses; Through optimization The process involves updating the student model, replacing the old parameters of each module in the system with the updated student model, triggering an update every time a certain amount of new data is received, and rolling back to the old model if the performance degradation after the update exceeds a certain percentage.

6. The intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large model according to claim 4, characterized in that: Step (5c) specifically includes the following steps: (5c1) Deviation Analysis: Define a normal state baseline and calculate the average eigenvector using normal operation data. As a baseline for normal conditions: ， in, This represents the number of normal samples; This represents the feature vector of the i-th normal sample; Then calculate the predicted state. Compared with normal state benchmark Euclidean distance as a measure of deviation : ， in, Indicates the predicted state; Used to measure the degree of deviation between the predicted state and the normal state baseline; Finally, set a deviation threshold. ;when If it exists, trigger an alert; (5c2) Failure probability assessment: First, build a fault mode library. , The number of fault categories is used, followed by cosine similarity to evaluate the predicted state. Proximity to each failure mode : ， in, Indicates the first Types of failure modes, normalizing the results to probabilities: ， in, This represents the temperature coefficient, taken as 0.1; Indicates the first The probability of a failure mode occurring; when When it exists, Represents a threshold. Take 0.7; This indicates the failure mode with the highest probability of occurrence. This means that as long as the probability of the failure mode occurring exceeds the threshold... If so, the fault with the highest probability of occurrence is determined to be a high-risk fault type; use The sequence number represents the fault mode with the highest probability of occurrence; according to For the highest probability category, provide corresponding suggestions; Indicates from Get the index of the column with the highest value. Each column in the table corresponds to the probability of a fault category occurring, i.e., the category with the highest probability.

7. The intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large model according to claim 5, characterized in that: Step (6b) specifically includes the following steps: (6b1) Data Buffering and Selection: First, a buffer pool of fixed size needs to be maintained. Capacity 1000, storing historical data samples and new data , and Both represent status labels; Then, the buffer pool is updated using random replacement to ensure a balance between old and new data: ， in, This indicates the number of samples already in the buffer pool; This indicates the number of newly added data samples; This indicates the probability that each old sample in the buffer pool will be retained; (6b2) Model update: Fine-tuning the model on new data while preserving performance on old data; fine-tuning the model using data from the buffer pool; for the device anomaly detection module, the loss function is: ， In the formula, This represents the temperature coefficient, taken as 0.1; Indicates a normal sample; Indicates the first in the buffer pool One sample; For the equipment status trend prediction module, the loss function is: ， In the formula, This indicates the next predicted value corresponding to the new data, i.e., the prediction status; This represents the actual state of the next step corresponding to the new data, i.e., the actual future state; Simultaneously, the model parameters are updated using mini-batch gradient descent.

8. A system for implementing the intelligent monitoring and early warning method for equipment defects in thermal power plants based on a multimodal large model, as described in any one of claims 1 to 7, characterized in that: include: The multimodal data acquisition module collects the device's operational data, i.e., multimodal data, in real time and transmits the multimodal data to the multimodal Transformer model; The multimodal data encoding module uses the BEiT-3 model, Whisper model, Informer model, and T5 model to encode different types of multimodal data and generate a unified multimodal representation. The multimodal fusion analysis module uses the Flamingo model to perform cross-modal data fusion and analysis on a unified multimodal representation, resulting in unified fusion features. The equipment anomaly detection and fault diagnosis module uses an automatic encoder. Based on unified fusion features, it uses contrastive learning and an autoencoder to calculate the anomaly degree of the equipment, thereby determining whether the equipment has an anomaly. The equipment status trend prediction and early warning module adopts the Informer model and predicts the future equipment status based on unified fusion features, predicts the probability of potential failures, and provides operation and maintenance suggestions and early warnings. The online learning and module optimization module optimizes the multimodal Transformer model through incremental learning and knowledge distillation.