An improved transformer model-based industrial process fault detection method

By combining convolutional neural networks and sparse matrix Transformer models, the problem of fault detection in high-dimensional nonlinear industrial process data is solved, enabling in-depth mining of local information and elimination of noise, thereby improving the accuracy of fault detection.

CN116382231BActive Publication Date: 2026-06-23SHENYANG INSTITUTE OF CHEMICAL TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENYANG INSTITUTE OF CHEMICAL TECHNOLOGY
Filing Date
2023-03-07
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Traditional data-driven methods struggle to effectively extract crucial information from high-dimensional, nonlinear industrial process data for fault detection, and the Transformer model has shortcomings in local information extraction and noise processing.

Method used

By combining convolutional neural networks and sparse matrix Transformer models, local information is extracted through multi-layer convolutional neural networks, positional encoding is added to capture global information, and sparse functions are used to eliminate noise, thereby improving the accuracy of fault detection.

Benefits of technology

It effectively compensates for the insufficient local information extraction capability of the Transformer model, reduces the impact of noise, and significantly improves the accuracy of fault detection in industrial processes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116382231B_ABST
    Figure CN116382231B_ABST
Patent Text Reader

Abstract

The application discloses an improved Transformer model-based industrial process fault detection method, and relates to an industrial process fault detection method, which comprises the following steps: collecting data of an industrial production process and performing standardization processing on the data; secondly, integrating the data capturing local information into a Transformer model, and utilizing the powerful long-distance dependence relationship of the Transformer model to extract global information of the data; thirdly, performing a sparse operation on an attention matrix by using a sparse function, so that the attention matrix no longer participates in subsequent calculation processes, so as to eliminate noise; then, updating network parameters by using an Adam algorithm, and saving a model reaching an expected detection effect; finally, in a fault detection stage, after new collected data are subjected to standardization processing by using a mean value and a variance of training data, the new data are subjected to fault detection by using the model saved in the training stage. The method effectively performs industrial process fault detection and plays a good monitoring role on an industrial process running state.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an industrial process fault detection method, and more particularly to an industrial process fault detection method based on an improved Transformer model. Background Technology

[0002] In recent years, the widespread application of distributed control systems has enabled the collection and storage of vast amounts of production process data, providing a solid data foundation for deep learning-based process monitoring technologies. In large-scale, complex industrial processes, production typically involves multiple production units, each with numerous sensors and controllers. Due to the high dimensionality and nonlinearity of process data in such industrial production, traditional data-driven process monitoring methods struggle to achieve satisfactory monitoring results. How to extract crucial information from hundreds or thousands of dimensions of process data and monitor the process's operational status is a pressing research topic.

[0003] With the development of deep learning, some deep learning-based process monitoring technologies have been used to solve high-dimensional complexity problems in large-scale industrial production processes. Commonly used deep learning neural networks mainly include convolutional neural networks (CNNs) and recurrent neural networks (RNNs). CNNs, with their local connectivity and weight sharing characteristics, have significant advantages in capturing local information and reducing network parameters. However, due to their small kernel size, their receptive field is small, making it difficult to capture global information effectively. Therefore, the Transformer model emerged. It is a fully connected attention mechanism model that captures global dependencies in data by calculating the correlation between any two items. Initially applied to machine translation tasks with good results, this model has led to increased research into its operating mechanisms and principles, and its improvements to adapt it to various research fields. Numerous studies have demonstrated that the Transformer model has not only achieved good results in machine translation but has also achieved state-of-the-art experimental results in image, audio, and video processing.

[0004] While the Transformer model achieves good results in handling long-distance dependencies in time-series data, it still has some shortcomings. For example, the Transformer model ignores local information and has poor ability to extract local information. Secondly, because the Transformer is a model based on a fully connected self-attention mechanism, it calculates the correlation between any two items. Not all correlations between pairs of samples are useful information; this information may be noise, contributing nothing to the calculation results and potentially having a negative impact. When dealing with large-scale, high-dimensional, nonlinear, and other complex data characteristics, traditional data-driven process monitoring methods struggle to achieve good fault detection results. Therefore, there is an urgent need for an effective deep learning-based method to uniformly model and detect faults in the large-scale data collected from modern industrial production processes. Summary of the Invention

[0005] The purpose of this invention is to provide an industrial process fault detection method based on an improved Transformer model. This invention addresses the time-varying and nonlinear problems of data in industrial processes and the fault detection problem. It utilizes the powerful feature extraction and filtering capabilities of deep learning to extract features from industrial data even when the data is highly nonlinear and time-varying, thereby improving the accuracy of fault detection.

[0006] The technical solution adopted in this invention is:

[0007] An industrial process fault detection method based on an improved Transformer model, wherein the method is based on a convolutional neural network and a sparse matrix Transformer model, and applies the model to industrial process fault detection, including the following steps:

[0008] Step 1: Collect N sample data X∈R from the industrial process. N×m X = [X1, X2, ..., X i ,…X N ] T Where m represents the number of variables in the data, N represents the total number of samples collected, and X i Let X represent the i-th sample in data X, where i∈[1,N], and X is subjected to Z-score standardization.

[0009] Step 2: Map the m-dimensional X to d-dimensional X using a fully connected neural network (FC). model Dimension, obtained Where d modelThe dimension of the hidden layer of the neural network is defined; then, a multi-layer convolutional neural network is entered, and by setting different sizes of convolutional kernels for each layer, local information of the data is mined from a deep, multi-scale perspective.

[0010] Step 3: Capture the data X containing local information Conv To enable the Transformer model to utilize the positional information of the sequence, positional encoding (PE) is added to the Transformer model. Positional information is marked by using sine and cosine functions of different frequencies.

[0011] Step 4: Use the sparse function S θ The attention matrix AM is sparsified by setting the values ​​of the attention matrix whose relevance scores are less than the adaptive parameter θ (θ = mean(AM)) to zero, so that they are no longer involved in the subsequent calculation process, thereby eliminating noise.

[0012] Step 5: Update the network parameters using the Adam algorithm and save the model that achieves the expected detection effect;

[0013] Step Six: Fault Detection, the process is as follows:

[0014] 1) Collecting new data X from industrial processes test ∈R N×m Using the mean of the training data S and variance train Standardization process is required to obtain

[0015] 2) Standardize the data X Z-score Using a fully connected neural network to X Z-score ∈R N×m Mapping to higher-dimensional space To ensure dimensionality consistency before and after the data enters the convolutional layer, zeros are padded at both ends of the data. Conv ←Padding(X Conv Formula (5) is used to capture local information and obtain the output of the convolutional layer. Next, batch normalization, ReLU activation function, and residual connection operations are used to obtain the result of the convolutional layer.

[0016] 3) The data X that captured local information Conv Location information is obtained by using location encoding. Then use the parameter matrix W Q W K W V X ConvMapping to the three matrices Q, K, and V, the attention matrix AM∈R is obtained using formula (10). N×N ;

[0017] 4) Perform sparsification on the attention matrix AM using formula (11), then concatenate the attention matrices calculated by multiple heads with the parameter matrix W. O Mapping The results are processed sequentially using residual connections and layer normalization, feedforward neural network, residual connections and layer normalization to obtain the encoder layer result.

[0018] 5) The result X from the encoder trm Map to X using a fully connected layer trm ∈R N×1 Finally, the Sigmoid function is used to convert X... trm The value is converted to a value between [0,1]. If the value is greater than or equal to 0.5, it is set to 1, indicating that the model has detected the sample as a faulty sample; if the value is less than 0.5, it is set to 0, indicating that the model has detected the sample as a normal sample; that is, the model is used to detect faults in new data.

[0019] The aforementioned industrial process fault detection method based on an improved Transformer model, wherein the standardization process in step 1 is shown in formulas (1)-(3):

[0020]

[0021]

[0022]

[0023] Formulas (1) and (2) yield the mean and variance of the training dataset, respectively. Formula 3 uses the mean of the training dataset. S and variance train Standardize X to obtain X Z-score ∈R N×m .

[0024] The beneficial effects of this invention are as follows:

[0025] (1) This method utilizes the local connectivity characteristics of convolutional neural networks. By stacking multiple layers of convolutional neural networks and setting convolutional kernels of different sizes in each layer, it extracts local information between adjacent data from multiple perspectives at a deep level, thus making up for the problem of insufficient local information extraction capability of the Transformer model.

[0026] (2) This method utilizes the Transformer model, which is good at capturing long-distance dependencies, to extract global information from industrial process data.

[0027] (3) This method reduces the negative impact of irrelevant information on the calculation results. By introducing a sparse function to zero out the values ​​with low correlation scores in the attention matrix, the values ​​no longer participate in subsequent calculations, thus eliminating noise. This invention greatly improves the accuracy of industrial process fault detection. Attached Figure Description

[0028] Figure 1 This is a schematic diagram of the penicillin fermentation process.

[0029] Figure 2 This is a flowchart of the Transformer model based on convolutional neural networks and sparse matrices in this invention;

[0030] Figure 3 A heatmap of the original data;

[0031] Figure 4 This is a heatmap of the original data after passing through the fourth layer of a convolutional neural network.

[0032] Figure 5 A heatmap of the attention matrix;

[0033] Figure 6 This is a heatmap of the attention matrix after processing with a sparse function. Detailed Implementation

[0034] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.

[0035] The technical solution adopted by this invention is described in detail below:

[0036] A Transformer model based on convolutional neural networks and sparse matrices is applied to industrial process fault detection, including the following steps:

[0037] Step 1: Collect N sample data X∈R from the industrial process. N×m X = [X1, X2, ..., X i ,…X N ] T Where m represents the number of variables in the data, N represents the total number of samples collected, and X i Let X represent the i-th sample in data X, i∈[1,N]. X is subjected to Z-score standardization, and the standardization process is shown in formulas (1)-(3):

[0038]

[0039]

[0040]

[0041] Formulas (1) and (2) yield the mean and variance of the training dataset, respectively. Formula 3 uses the mean of the training dataset. S and variance train Standardize X to obtain X Z-score ∈R N×m .

[0042] Step 2: Map the m-dimensional X to d-dimensional X using a fully connected neural network (FC). model Dimension, obtained Where d model This represents the dimension of the hidden layers of the neural network. Then, a multi-layer convolutional neural network is introduced. By setting different sizes of convolutional kernels for each layer, local information of the data is mined from deep, multi-scale perspectives. For the problem of inconsistent input and output dimensions before and after the convolution operation, the common approach is to pad the ends of the sequence with zeros before performing the convolution operation to maintain consistency in the input and output dimensions, as shown in formula (4), where Padding(*) represents padding the ends of the sequence with zeros. The multi-layer convolution operation is represented by formula (5), where n represents the number of layers in the convolutional neural network. This represents the output of the nth layer of the convolutional neural network.

[0043] Padding(X Conv →X Conv (4)

[0044]

[0045] To avoid gradient vanishing as the network deepens and to ensure consistent distribution of results in intermediate layers, this invention adds batch normalization (BN) and ReLU activation functions after multiple convolutional operations. Therefore, the convolutional neural network, batch normalization, and ReLU activation function together constitute the convolutional layer. Simultaneously, to allow the model to focus on the currently differing parts and mitigate network degradation, residual connections are performed after the convolutional layers, and the results of these residual connections are... This information is used as input to the encoder part of the Transformer model to capture global information. The process described above is shown in Equation (6):

[0046]

[0047] Step 3: Capture the data X containing local information Conv To incorporate the Transformer model and enable it to utilize the positional information of the sequence, positional encoding (PE) is added. This involves using sine and cosine functions of different frequencies to label the positional information. The positional encoding is shown in equations (7) and (8), where pos is the position index of the time-series data in the sequence, and i is a certain dimension of the vector.

[0048]

[0049]

[0050] X Conv The data format after location encoding is as follows: Data labeled with location information enters a multi-head attention mechanism to capture global information. First, X... Conv With parameter matrix Mapped to These three matrices are decomposed into h (the number of heads in the multi-head attention mechanism) subspaces using a multi-head attention mechanism. Each head focuses only on d elements in the Q, K, and V matrices respectively. model / h=d k Information in each dimension. At this point, the query matrix, key matrix, and value matrix of each header have 10 dimensions. The attention matrix (AM) for each head is obtained through the dot-product self-attention mechanism. i ∈R NxN The above process can be described by formulas (9) and (10), d k d v This represents the dimension of the key vector and the value vector.

[0051] Q = X Conv ·W Q K = X Conv ·W K V = X Conv ·W V (9)

[0052]

[0053] Step 4: Use the sparse function S θThe sparsification operation on the attention matrix AM is performed by setting the values ​​in the attention matrix whose relevance scores are less than the adaptive parameter θ (θ = mean(AM)) to zero, so that they no longer participate in the subsequent calculation process, thereby achieving the purpose of eliminating noise. The processing of the sparsity function is as follows: First, calculate the average value θ of the attention matrix AM. Then, perform subtraction operations between the values ​​in AM and θ respectively. Set the data with the result greater than 0 to 1 and the data with the result less than or equal to 0 to 0. Then, perform subtraction operations with 1, multiplication operations with ∞, and addition operations with AM in sequence. Through the above steps, the values ​​in AM with values ​​greater than θ can be retained, and the values ​​with values ​​less than θ can be set to -∞. Finally, after processing by the Softmax function, the values ​​with values ​​less than θ become 0, thereby achieving the purpose of eliminating noise. The above process is shown in formulas (11) and (12), where INF represents ∞, sign(x) represents the sign function, when x > 0, sign(x) = 1; when x ≤ 0, sign(x) = 0. σ represents the Softmax function, and formula (12) is the specific implementation of Softmax. θ AM represents a sparse function. ij This represents the correlation score between the i-th sample and the j-th sample.

[0054] S θ (AM)=AM+(sign(AM-θ)-1)·(INF) (11)

[0055]

[0056]

[0057] After processing with a sparse function, invalid information is removed while important information is retained. Then, multiplying this by the value matrix V yields the output of the dot product attention mechanism. Where d v Let be the dimension of the value vector and satisfy d v =d k Then use the Concat function to concatenate the results of each header, and then combine them with... The matrix is ​​mapped to obtain the output of the multi-head attention mechanism. As shown in formulas (14)-(16), where Q i ,K i V i This represents the dimension information that the i-th head focuses on, where i ∈ [1, h]. i This represents the output result of the i-th head.

[0058]

[0059]

[0060] H=Concat(head1,head2,…,head i ,…,head h )·W o (16)

[0061] After processing with a multi-head attention mechanism, the Transformer model incorporates residual connections and layer normalization operations to address the issues of vanishing gradients and weight matrix degradation. Then it enters the feedforward neural network (FFN) layer, first processing the data H's d model Upgraded to D ff D ff This represents the dimension of the hidden layer of the FFN. The ReLU function is used to filter out information beneficial for classification, and then the dimension is reduced to d. model Finally, residual connections and layer normalization are used for processing. The above process is shown in formulas (17)-(19), where W1, W2, b1, and b2 are the weights and biases of the two linear transformations of the feedforward neural network. This is the final output of the encoder.

[0062] H' = LayerNorm(X) Conv +H) (17)

[0063] FFN(H')=max(0,H'W1+b1)W2+b2 (18)

[0064] X trm =LayerNorm(H'+FFN(H')) (19)

[0065] Use a fully connected layer Mapping to X trm ∈R N×1 Then use the Sigmoid function to convert X trm The value in the expression changes to a certain value between [0, 1], as shown in formula (20). If the value is greater than or equal to 0.5 after processing by the Sigmoid function, the final detection result of the model is 1, indicating that the model detected that the sample is a faulty sample. If the value is less than 0.5, the final detection result of the model is 0, indicating that the model detected that the sample is a normal sample.

[0066]

[0067]

[0068] Step 5: Update the network parameters using the Adam algorithm and save the model that achieves the expected detection results.

[0069] Step Six: Fault Detection Phase, the process is as follows:

[0070] 1) Collecting new data X from industrial processes test ∈R N×m Using the mean of the training data S and variance train Standard

[0071] Chemical treatment

[0072] 2) Standardize the data X Z-score Using a fully connected neural network to X Z-score ∈R N×m Mapping to higher-dimensional space To ensure dimensionality consistency before and after the data enters the convolutional layer, zeros are padded at both ends of the data. Conv ←Padding(X Conv Formula (5) is used to capture local information and obtain the output of the convolutional layer. Next, batch normalization, ReLU activation function, and residual connection operations are used to obtain the result of the convolutional layer.

[0073] 3) The data X that captured local information Conv Location information is obtained by using location encoding. Then use the parameter matrix W Q W K W V X Conv Mapping to the three matrices Q, K, and V, the attention matrix AM∈R is obtained using formula (10). N×N .

[0074] 4) Perform sparsification on the attention matrix AM using formula (11), then concatenate the attention matrices calculated by multiple heads with the parameter matrix W. O Mapping The results are processed sequentially using residual connections and layer normalization, feedforward neural network, residual connections and layer normalization to obtain the encoder layer result.

[0075] 5) The result X from the encoder trm Map to X using a fully connected layer trm ∈R N×1 Finally, the Sigmoid function is used to convert X... trmThe value is converted to a value between [0,1]. If the value is greater than or equal to 0.5, it is set to 1, indicating that the model has detected the sample as a faulty sample; if the value is less than 0.5, it is set to 0, indicating that the model has detected the sample as a normal sample.

[0076] This model is used to detect faults in new data.

[0077] Example 1

[0078] This invention takes the penicillin fermentation process as an example. A schematic diagram of the penicillin fermentation production process is shown below. Figure 1 As shown, the industrial process fault detection method based on the improved Transformer model is as follows: Figure 2 As shown.

[0079] Specific experimental steps:

[0080] Step 1: Collect sample data from the industrial process. The data used in this simulation experiment was generated by Pensim V2.0. The penicillin fermentation process has a total of 17 variables: aeration rate, agitator power, substrate feed rate, substrate feed temperature, substrate concentration, DO concentration, biomass concentration, penicillin concentration, culture medium volume, carbon dioxide concentration, pH, temperature, heat of reaction, acid flow rate, alkali flow rate, cold water flow rate, and hot water flow rate. The initial conditions, setpoint, and temperature controller type were set to default for this simulation. The variable causing the fault was the aeration rate, and the fault type was step perturbation. A 6% step perturbation was added to the training dataset at hour 100, and the simulation platform ran for 400 hours. A 6% step perturbation was added to the test dataset at hour 85, and the platform ran for 160 hours. The sampling interval was 0.2 hours. That is, the training dataset X was collected. train The test dataset X contains 2000 samples. test This involves 800 sample data points. Data X train ∈R 2000×17 , for X train Standardize the process.

[0081]

[0082]

[0083]

[0084] Step 2: Standardize the data X Z-score ∈R 2000×17 Using a fully connected neural network to X Z-score Mapping to a higher-dimensional space yields X Z-score ∈R 2000×256Then, the data enters a multi-layer convolutional neural network to mine local information from deep, multi-scale perspectives. Zeros are padded at both ends of the sequence before the convolution operation to maintain consistency between the input and output dimensions, as shown in equation (25). The multi-layer convolution operation is represented by equation (26). This represents the output of the nth layer of the convolutional neural network. In this example, n = 4.

[0085] Padding(X Conv →X Conv (25)

[0086]

[0087] To avoid gradient vanishing as the network deepens and to ensure consistent distribution of results in intermediate layers, this invention adds batch normalization and ReLU activation functions after multiple convolutional operations. Simultaneously, to allow the model to focus on the currently differing parts and mitigate network degradation, residual connections are performed after the convolutional layers, and the result of these residual connections, X... Conv ∈R 2000x256 This information is used as input to the encoder part of the Transformer model to capture global information. The process described above is shown in equation (27):

[0088]

[0089] Step 3: Capture the data X containing local information Conv To incorporate the Transformer model and enable it to utilize the positional information of the sequence, positional encoding is added. The positional encoding is shown in formulas (28) and (29), where pos is the position index of the time series data in the sequence, and i is a certain dimension of the vector.

[0090]

[0091]

[0092] X Conv The data format after location encoding is X. Conv ∈R 2000x256 The data marked with location information enters the multi-head attention mechanism to capture global information. First, X... Conv With parameter matrix W Q ∈R 256x256 W K ∈R 256x256 W V ∈R 256x256 Mapping to Q∈R 2000x256 K∈R 2000x256 V∈R 2000x256These three matrices are decomposed into h subspaces using a multi-head attention mechanism, with each head focusing only on d values ​​in the Q, K, and V matrices respectively. model / h=d k Information in each dimension. At this point, the query matrix, key matrix, and value matrix of each header have 10 dimensions. The attention matrix AM for each head is obtained through a dot product self-attention mechanism. i ∈R NxN , i∈[1,h]. The above process can be described by formulas (30) and (31), d k d v This represents the dimension of the key vector and value vector. In this embodiment, d model =256,d k =d v =64, h=4.

[0093] Q = X Conv ·W Q K = X Conv ·W K V = X Conv ·W V (30)

[0094]

[0095] Step 4: Use the sparse function S θ The attention matrix AM is sparsified by setting the values ​​with correlation scores less than the adaptive parameter θ to zero, thereby eliminating noise. Finally, after processing with the Softmax function, the smaller values ​​become 0, thus eliminating noise. The above process is shown in formulas (32) and (33).

[0096] S θ (AM)=AM+(sign(AM-θ)-1)·(INF) (32)

[0097]

[0098]

[0099] After processing with a sparse function, invalid information is removed while important information is retained. Then, multiplying this by the value matrix V yields the output head of the dot product attention mechanism. i ∈R 2000×64 Then use the Concat function to concatenate the results of each header, and then combine them with W. O ∈R 256×256 Perform mapping to obtain the multi-head attention output H∈R 2000×256As shown in formulas (35)-(37), where Q i ,K i V i This represents the dimension information that the i-th head focuses on, where i ∈ [1, 4]. i This represents the output result of the i-th head.

[0100]

[0101]

[0102] H=Concat(head1,head2,…,head i ,…,head h )·W o (37)

[0103] After processing with a multi-head attention mechanism, the Transformer model incorporates residual connections and layer normalization operations to address the issues of vanishing gradients and weight matrix degradation. Then it enters the feedforward neural network layer, first processing the data H's d model Upgraded to D ff In this embodiment, D ff =1024, use the ReLU function to filter out information that is beneficial for classification, and then reduce the dimensionality to d. model Finally, residual connections and layer normalization are used for processing. The above process is shown in formulas (38)-(40), where W1, W2, b1, and b2 are the weights and biases of the two linear transformations of the feedforward neural network; X trm ∈R 2000×256 This is the final output of the encoder.

[0104] H' = LayerNorm(X) Conv +H) (38)

[0105] FFN(H')=max(0,H'W1+b1)W2+b2 (39)

[0106] X trm =LayerNorm(H'+FFN(H')) (40)

[0107] Use a fully connected layer to X trm ∈R 2000×256 Mapping to X trm ∈R 2000×1 Then use the Sigmoid function to convert X trmThe value changes to a certain value between [0, 1], as shown in formula (41); if the value is greater than or equal to 0.5 after processing by the Sigmoid function, the final detection result of the model is 1, indicating that the model detected that the sample is a faulty sample; if the value is less than 0.5, the final detection result of the model is 0, indicating that the model detected that the sample is a normal sample.

[0108]

[0109]

[0110] Step 5: Update the network parameters using the Adam algorithm and save the model that achieves the expected detection results.

[0111] Step Six: Fault Detection Phase, the process is as follows:

[0112] 1) Collecting new data X from industrial processes test ∈R 800×17 Using the mean of the training data S and variance train Standardization processing

[0113] 2) Standardize the data X Z-score Using a fully connected neural network to X Z-score ∈R 800×17 Mapped to higher-dimensional space X Z-score ∈R 800×256 To ensure dimensionality consistency before and after data enters the convolutional layer, zeros (X) are padded at both ends of the data. Conv ←Padding(X Conv The output of the convolutional layer is obtained by capturing local information using formula (27). Next, batch normalization, ReLU activation function, and residual connections are used to obtain the final output X of the convolutional layer. Conv .

[0114] 3) Data X that captured local information Conv Location information is marked using location encoding. The parameter matrix W is used. Q W K W V X Conv Mapping to the three matrices Q, K, and V, the attention matrix AM∈R is obtained using formula (10). 800×800 .

[0115] 4) Perform sparsification on the attention matrix AM using formula (32), then concatenate the attention matrices computed by multiple heads with the parameter matrix W. O Mapping yields H∈R 800×256The results are processed sequentially using residual connections, layer normalization, feedforward neural networks, residual connections, and layer normalization to obtain the encoder layer's result X. trm ∈R 800×256 .

[0116] 5) The result X from the encoder trm Map to X using a fully connected layer trm ∈R 800×1 Finally, the Sigmoid function is used to convert X... trm The value is converted to a value between [0,1]. If the value is greater than or equal to 0.5, it is set to 1, indicating that the model has detected the sample as a faulty sample; if the value is less than 0.5, it is set to 0, indicating that the model has detected the sample as a normal sample.

[0117] Analysis of simulation experiment results:

[0118] In this embodiment, the fault detection accuracy rate is 98.75%. The results show that the process monitoring method of this invention demonstrates good fault detection performance.

[0119] Next, the fault detection performance of this invention will be analyzed. This invention uses heatmaps to visualize and analyze the operations of convolutional layers and sparse functions. Heatmaps visualize data using special highlighting, with the intensity of the color representing the magnitude of the value; darker colors indicate larger values.

[0120] Figures 3 to 4 The horizontal axis represents the samples, and the vertical axis represents the features. By observing the horizontal axis, we can observe the information contained in each sample in each feature dimension. Figure 3 Representing data X test The heatmap before entering the convolutional layer shows that the image is mainly light-colored, indicating that the values ​​are small and carry very little local and global information. Figure 4 The heatmap showing the data after passing through four layers of a one-dimensional convolutional neural network clearly shows that the colors become increasingly darker and the darker areas cover a larger area. This indicates that after processing by the convolutional layers, the values ​​increase, and the relationships between the horizontal axes become more pronounced. In other words, the convolutional layers extract local information between adjacent samples. Furthermore, by increasing the number of layers in the convolutional neural network, the receptive field of the network can be expanded, the area capturing information becomes larger, and the obtained feature information becomes more comprehensive, which is very helpful in improving the diagnostic rate of the model.

[0121] Figure 5 and Figure 6These figures represent the heatmaps of the attention matrix for the first 100 samples in the test set and the heatmap of the attention matrix after processing with a sparse function, respectively. The attention matrix uses numerical values ​​to represent the correlation between samples; higher correlation results in higher values, and lower correlation in lower values. Visualizing the attention matrix using heatmaps, darker colors indicate higher correlation, representing important information for the model and should be retained; lighter colors indicate lower correlation, which may not contribute to the calculation results and could even negatively impact them. This information should be removed by setting the values ​​in the lighter-colored areas to zero, preventing it from participating in subsequent calculations. Observing the two figures, we can see that the dark areas are the same, indicating that the sparse matrix retains this important information. The difference lies in... Figure 6 The fact that the light-colored areas have become even lighter indicates that the sparse matrix has zeroed out this part of the information, so that it no longer participates in the subsequent calculation process, thus achieving the purpose of removing noise and improving the detection performance of the model.

[0122] In summary, the process monitoring method of this invention effectively improves fault detection performance, fully verifying the effectiveness and feasibility of the process monitoring method of this invention.

[0123] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.

Claims

1. A method for detecting industrial process faults based on an improved Transformer model, characterized in that, The method is based on a Transformer model using convolutional neural networks and sparse matrices. It applies this model to industrial process fault detection and includes the following steps: Step 1: Collect data during the industrial process N Sample data , ,in The number of variables in the data. This represents the total number of samples collected. Representing data The Middle One sample, right Perform Z-score standardization; Step 2: Apply a fully connected neural network (FC) to the standardized data. m-dimensional mapping to Dimension, obtained ,in The dimension of the hidden layer of the neural network is defined; then, a multi-layer convolutional neural network is entered, and by setting different sizes of convolutional kernels for each layer, local information of the data is mined from a deep, multi-scale perspective. Step 3: Data that has captured local information To enable the Transformer model to utilize the positional information of the sequence, positional encoding (PE) is added to the Transformer model. Positional information is marked by using sine and cosine functions of different frequencies. Step 4: Use sparse functions Sparsification is performed on the attention matrix AM by placing elements in the attention matrix whose relevance scores are less than the adaptive parameters. Set the value to zero. This prevents the noise from participating in subsequent calculations, thereby eliminating the noise. Step 5: Update the network parameters using the Adam algorithm and save the model that achieves the expected detection effect; Step Six: Fault Detection, the process is as follows: 1) Collect new data from industrial processes Using the mean of the training data and variance Standardization process is required to obtain ; 2) Standardize the data Using a fully connected neural network Mapping to higher-dimensional space To ensure dimensionality consistency before and after the data enters the convolutional layer, zeros are padded at both ends of the data. ,use Capturing local information to obtain the output of the convolutional layer Then use batch normalization, The activation function and residual connection operation yield the result of the convolutional layer. In this context, n represents the number of layers in the convolutional neural network. This represents the output of the nth layer of the convolutional neural network; 3) Data that captures local information Location information is obtained by using location encoding. Then use the parameter matrix. , , Will Mapped to , , In the three matrices, using Derive the attention matrix ; 4) Use The attention matrix AM is sparsified, and then the attention matrices computed by multiple heads are concatenated with the parameter matrix. Mapping The results are processed sequentially using residual connections and layer normalization, feedforward neural network, residual connections and layer normalization to obtain the encoder layer result. ; 5) Results from the encoder Map using a fully connected layer to Finally, the Sigmoid function is used to... The numerical value is converted to If the value is greater than or equal to 0.5, set its value to 1, indicating that the model has detected the sample as a faulty sample; if the value is less than 0.5, set its value to 0, indicating that the model has detected the sample as a normal sample; that is, the model is used to detect faults in new data.

2. The industrial process fault detection method based on an improved Transformer model according to claim 1, characterized in that, The standardization process in step one is shown in formulas (1)-(3): (1) (2) (3) Formulas (1) and (2) yield the mean and variance of the training dataset, respectively. Formula (3) is used to calculate the mean of the training dataset. and variance right Standardization process is required to obtain .