Industrial bearing vibration time series signal fault prediction method and system fusing attention mechanism and lstm

By integrating the attention mechanism with the LSTM deep learning model, the challenges of feature extraction and fault quantification assessment in industrial bearing fault diagnosis were solved. This enabled simultaneous and accurate diagnosis of fault type identification and severity, improved the model's generalization ability and reliability, and constructed an intelligent diagnostic and early warning system.

CN121144702BActive Publication Date: 2026-06-23ZHONGXIN HANCHUANG BEIJING TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHONGXIN HANCHUANG BEIJING TECH CO LTD
Filing Date
2025-09-24
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies struggle to automatically extract deep temporal features in industrial bearing fault diagnosis, fail to accurately focus on key fault information, and struggle to simultaneously quantify fault type identification and severity assessment. Furthermore, the model training process suffers from a balance problem between classification and regression multi-task learning.

Method used

A deep learning model integrating attention mechanism and LSTM is adopted. Bidirectional time-series features of vibration time-series data are extracted through bidirectional LSTM layer, and the hidden state is weighted by coordinate attention mechanism layer. The model is trained by combining Adam optimizer and cross-entropy error function, and an early stopping mechanism is introduced to prevent overfitting. The model generates fault type identification results and fault degree evaluation values.

Benefits of technology

It improves the ability to extract and focus fault features, achieves simultaneous and accurate diagnosis of fault type and severity, enhances the generalization and reliability of the model, constructs an intelligent and automated diagnostic and early warning closed loop, and improves the efficiency and intelligence level of equipment health management.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121144702B_ABST
    Figure CN121144702B_ABST
Patent Text Reader

Abstract

The application discloses a kind of fusion attention mechanism and the industrial bearing vibration time series signal fault prediction method and system of LSTM.This method includes: collecting bearing vibration signal and carrying out filtering, noise reduction and normalization preprocessing;BiLSTM and coordinate attention mechanism combined deep learning model is constructed to extract bidirectional time series features and enhance key fault features;Multi-objective composite loss function and Adam optimizer are used for model training, and early stopping mechanism is introduced to prevent overfitting;Real-time vibration signal is used to identify fault type and degree evaluation using the trained model, and multi-scale spectral kurtosis features and nonlinear dynamics parameters are used for quantitative analysis;Finally, output fault diagnosis report, and trigger multi-level early warning based on adaptive threshold value.The application can realize high-precision, high-reliability bearing fault prediction and health state evaluation, and is suitable for industrial equipment intelligent operation and maintenance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of industrial equipment condition monitoring and fault prediction technology, and in particular to an intelligent fault prediction method and system for industrial bearing vibration signals based on deep learning. Background Technology

[0002] As a core component of rotating machinery, the operating status of industrial bearings directly affects the safety and reliability of the entire equipment and even the production line. Therefore, real-time and accurate fault prediction and health management (PHM) of bearings has significant economic and safety value. Vibration signal analysis is the most commonly used and effective method for bearing condition monitoring because the occurrence and development of faults directly change their vibration characteristics, producing specific impact responses.

[0003] Traditional bearing fault diagnosis methods heavily rely on signal processing techniques and manual feature extraction. For example, they extract time-domain and frequency-domain features of signals using Fast Fourier Transform (FFT), wavelet transform, and envelope spectrum analysis, then combine these with expert experience or machine learning classifiers (such as Support Vector Machines, SVM) for fault identification. However, these methods have significant limitations: First, manual feature extraction heavily relies on domain expertise, and the effectiveness and universality of the features are difficult to guarantee; second, traditional methods struggle to capture the complex nonlinear and non-stationary characteristics of vibration signals, as well as long-range time-series dependencies, and are insufficiently sensitive to early, subtle faults and complex faults.

[0004] In recent years, deep learning technology, especially Long Short-Term Memory (LSTM) networks, has shown great potential in the field of fault prediction due to its powerful temporal modeling capabilities. LSTM can automatically learn the temporal patterns and dependencies in vibration signals. However, standard LSTM models may suffer from diluted key information or difficulty in focusing on fault-sensitive periods when processing long sequences. The introduction of attention mechanisms can effectively alleviate this problem. By assigning different weights to different parts of the sequence, it enables the model to focus on the features most relevant to the fault. However, existing methods mostly employ conventional attention mechanisms, and their ability to coordinate attention to the spatiotemporal features of vibration signals still has room for improvement.

[0005] Furthermore, most existing intelligent diagnostic models focus only on fault type classification, neglecting the equally important quantitative assessment of fault severity, making it difficult to meet the accurate requirements of predictive maintenance for judging fault evolution trends. At the same time, the model training process faces challenges such as balancing classification and regression multi-task learning, overfitting, and ensuring consistency between prediction results and physical mechanisms.

[0006] In summary, developing an intelligent diagnostic method that can automatically extract deep temporal features, accurately focus on key fault information, and simultaneously identify fault types and quantify severity assessments has become a critical issue that the industry urgently needs to address. Summary of the Invention

[0007] The purpose of this invention is to overcome the shortcomings of the prior art and provide a bearing fault prediction method and system that can automatically extract key features, simultaneously identify fault types and quantify severity assessments, and has high reliability.

[0008] In a first aspect, embodiments of this application provide a method for predicting faults in industrial bearing vibration timing signals by integrating an attention mechanism and LSTM, the method comprising:

[0009] S1. Use vibration sensors to collect vibration time-domain signals during the operation of industrial bearings. Then, filter, reduce noise, and normalize the collected signals to obtain standardized vibration time-series data.

[0010] S2. Construct a deep learning model that includes a bidirectional LSTM layer and a coordinate attention mechanism layer. The bidirectional LSTM layer is used to extract bidirectional temporal features from vibration time series data, and the coordinate attention mechanism layer is used to perform time and feature dimension weighting on the hidden state output by the bidirectional LSTM layer to enhance the attention to key fault features.

[0011] S3. The deep learning model is trained using the labeled bearing vibration dataset. The parameters are optimized using the Adam optimizer, the loss function is the cross-entropy error function, and an early stopping mechanism is introduced to prevent overfitting.

[0012] S4. Input the real-time collected vibration data into the trained deep learning model, and output the fault type identification result and fault severity assessment value simultaneously. The fault severity assessment is based on the fault impact characteristics learned by the model and is quantitatively calculated.

[0013] S5. Generate a diagnostic report that includes both the fault type and the fault severity. When the fault severity assessment value exceeds the adaptive threshold set based on historical data, a multi-level early warning mechanism is triggered.

[0014] Secondly, embodiments of this application provide a fault prediction system for industrial bearing vibration timing signals that integrates attention mechanisms and LSTM, applied to the fault prediction method for industrial bearing vibration timing signals that integrates attention mechanisms and LSTM as described in the first aspect. The system includes:

[0015] The signal acquisition and preprocessing module is used to acquire the vibration time-domain signal of the industrial bearing during operation using vibration sensors, and to preprocess the acquired signal, including filtering and noise reduction and normalization, to obtain standardized vibration time series data.

[0016] The model building module is used to collect vibration time-domain signals of industrial bearings during operation using vibration sensors. The collected signals are then filtered, denoised, and normalized sequentially to obtain standardized vibration time-series data.

[0017] The model training module is used to train the deep learning model using a labeled bearing vibration dataset. The Adam optimizer is used for parameter optimization, the cross-entropy error function is used as the loss function, and an early stopping mechanism is introduced to prevent overfitting.

[0018] The real-time fault diagnosis module is used to input the real-time collected vibration data into the trained deep learning model and simultaneously output the fault type identification result and the fault severity assessment value. The fault severity assessment is based on the fault impact characteristics learned by the model for quantitative calculation.

[0019] The diagnostic report and early warning module is used to generate a diagnostic report that includes both the fault type and the fault severity. When the fault severity assessment value exceeds the adaptive threshold set based on historical data, a multi-level early warning mechanism is triggered.

[0020] Thirdly, embodiments of this application provide an electronic device, including:

[0021] processor;

[0022] Memory used to store processor-executable instructions;

[0023] The processor is configured to implement the industrial bearing vibration timing signal fault prediction method as described in the first aspect when executing the instructions.

[0024] Fourthly, embodiments of this application provide a computer-readable storage medium storing a program that instructs a device to execute the industrial bearing vibration timing signal fault prediction method that combines attention mechanism and LSTM as described in the first aspect.

[0025] The present invention provides a method and system for predicting faults in industrial bearing vibration timing signals by fusing attention mechanisms and LSTM, which has the following significant advantages:

[0026] 1. Improved ability to extract and focus fault features: By combining bidirectional BiLSTM and coordinate attention mechanism (CA), the model can not only capture the bidirectional long-range temporal dependencies in vibration signals, but also adaptively enhance the attention to fault-sensitive key information in both time and feature channel dimensions, effectively suppressing irrelevant noise interference, thereby significantly improving the ability to extract early weak fault features.

[0027] 2. Achieved simultaneous and accurate diagnosis of fault type and severity: The model adopts a multi-task learning architecture, which can output high-precision fault type classification results and continuous fault severity regression estimates in parallel. This solves the problem that traditional methods are mostly limited to fault classification and cannot quantify the severity assessment, providing a more comprehensive decision-making basis for predictive maintenance.

[0028] 3. Enhanced model generalization and reliability: By using a dynamically weighted multi-objective loss function, adaptive optimization strategy, and KL divergence constraint combined with physical mechanisms for training, the model training process is more stable, and the prediction results are not only data-driven but also conform to the physical characteristics of fault impact, significantly improving the model's generalization ability and the reliability of diagnostic results.

[0029] 4. A closed loop of intelligent and automated diagnosis and early warning has been constructed: The system can automatically generate structured diagnostic reports and adaptively set thresholds based on historical data to provide multi-level early warnings. At the same time, it provides specific maintenance decision suggestions, realizing a complete automated process from status perception and intelligent diagnosis to decision support, which greatly improves the efficiency and intelligence level of equipment health management. Attached Figure Description

[0030] Figure 1 A schematic diagram of the process for predicting faults in industrial bearing vibration timing signals using a fusion attention mechanism and LSTM, provided in an embodiment of this application.

[0031] Figure 2 The system architecture diagram of the industrial bearing vibration timing signal fault prediction system provided in this application is based on the fusion attention mechanism and LSTM.

[0032] Figure 3 A schematic diagram of an electronic device provided in an embodiment of this application. Detailed Implementation

[0033] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them.

[0034] It should be noted that in the embodiments of this application, "at least one" refers to one or more, and "more than one" refers to two or more. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the specification of this application is for the purpose of describing particular embodiments only and is not intended to be limiting of this application.

[0035] Based on the embodiments described in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0036] Example 1

[0037] Figure 1 This is a schematic diagram of a method for predicting faults in industrial bearing vibration timing signals using a fusion attention mechanism and LSTM, provided as an embodiment of this application. Figure 1 As shown, a fault prediction method for industrial bearing vibration timing signals that integrates attention mechanisms and LSTM includes:

[0038] S1. Vibration time-domain signals of industrial bearings during operation are collected using vibration sensors. The collected signals are then filtered, denoised, and normalized sequentially to obtain standardized vibration time-series data. This achieves the standardization of the original vibration signals. By collecting vibration time-domain signals of industrial bearings during operation using sensors, and sequentially performing filtering (preserving fault characteristic frequency bands and suppressing noise) and normalization (unifying data scale), high-quality standardized time-series data is generated, providing effective input for subsequent models.

[0039] Specifically, in this embodiment, the preprocessing in step S1 includes the following sub-steps:

[0040] S1.1 Signal Filtering: An adaptive bandpass filter is used, with a passband frequency range of 0.5 to 3 times the bearing's characteristic frequency. This preserves the fault characteristic frequency components and suppresses irrelevant frequency interference. Irrelevant frequency interference is filtered out, while key frequency components related to the bearing fault are retained. If the bearing's characteristic frequency is 100Hz, the filter only allows signals between 50Hz and 300Hz to pass through, effectively highlighting the fault impact signal.

[0041] S1.2 Noise Suppression Processing: A wavelet packet-based thresholding algorithm is adopted, using the db4 wavelet basis for multi-level wavelet packet decomposition, and an adaptive thresholding function is used to perform soft thresholding on the detail coefficients; this reduces random noise in the signal and improves the signal-to-noise ratio. Wavelet packet decomposition technology is used to remove subtle noise fluctuations in the signal, much like removing coffee stains, making fault characteristics clearer.

[0042] S1.3 Signal Normalization Processing: The maximum-minimum normalization method is used to normalize the signal amplitude to a preset numerical range; the data scale is unified to accelerate model training convergence. All vibration signal amplitudes are compressed to the range of [0,1] or [-1,1] to avoid some values ​​being too large or too small and affecting model learning.

[0043] S1.4 Data Augmentation Processing: The training dataset is expanded using time-series data augmentation techniques, including one or more of time warping, amplitude scaling, and adding Gaussian noise; this increases the amount of training data and improves the model's generalization ability. For example, stretching or compressing the original signal or adding slight noise generates more samples, allowing the model to learn more fully.

[0044] S1.5 Feature Extraction Preprocessing: Calculate the time-domain feature parameters of the vibration signal, including one or more of the following: root mean square value, kurtosis, peak factor, and impulse index. These features, along with the preprocessed time-series data, are used as input to the fault feature extraction model. Traditional time-domain features are extracted and fused with the original signal into the model. For example, features such as signal roughness (kurtosis) and impact intensity (impulse index) are calculated to provide the model with richer fault information.

[0045] These five sub-steps work in a progressive manner to ensure that the data input to the model is clean, standardized, and rich, laying a solid foundation for subsequent accurate diagnosis. It's like preparing ingredients for a chef: first screening (filtering), then washing (noise reduction), cutting and preparing (normalization), adding ingredients (data augmentation), and finally simple preprocessing (feature extraction), so that a good dish (accurate diagnosis) can be cooked.

[0046] S2. Construct a deep learning model containing a bidirectional LSTM layer and a coordinate attention mechanism layer. The bidirectional LSTM layer is used to extract bidirectional temporal features from the vibration time-series data, and the coordinate attention mechanism layer is used to weight the hidden states of the bidirectional LSTM layer output in terms of time and feature dimensions, enhancing the focus on key fault features. Construct a deep learning model that combines feature extraction and key information enhancement capabilities. A bidirectional BiLSTM layer is used to capture the bidirectional long-term temporal dependencies in the vibration signal, and the coordinate attention mechanism layer weights the BiLSTM output in terms of time and feature dimensions, strengthening the expression of fault-sensitive features and suppressing interference from irrelevant information.

[0047] Specifically, in this embodiment, the deep learning model constructed in step S2 is a hierarchical feature extraction and fusion network, specifically including:

[0048] S2.1, Bidirectional Temporal Feature Extraction Layer: A multi-layer bidirectional long short-term memory network (BiLSTM) is used as the core temporal feature extractor. The hidden state at each time step is used to capture the long-range bidirectional dependencies related to the fault in the vibration signal. Like reading an article from both sides, the bidirectional temporal feature extraction layer extracts past and future contextual information from the vibration signal to capture the long-term dependencies of the fault. For example, the BiLSTM network simultaneously analyzes the preceding and following segments of the vibration signal to identify abnormal periodic impact patterns (such as an impact sound occurring once per revolution).

[0049] The bidirectional temporal feature extraction layer in step S2.1 satisfies the following conditions:

[0050] The described multi-layer bidirectional long short-term memory (BiLSTM) network has 2 to 4 layers, with each layer containing 64 to 256 hidden units. This structural configuration clearly defines the network depth and capacity range, balancing model expressiveness and computational efficiency. Specifically, the BiLSTM layer count is 2 to 4 layers to avoid insufficient feature extraction due to shallow layers or overfitting due to excessive layers. The number of hidden units is 64 to 256, providing sufficient feature representation capabilities.

[0051] The BiLSTM layer receives the standardized vibration time-series data preprocessed in step S1. Its forward and backward LSTM units process the input sequence in forward and reverse time order, respectively, and concatenate the hidden states in both directions at each time step to form a complete hidden state output that integrates bidirectional contextual information. This bidirectional feature fusion mechanism fully captures temporal contextual information, enhancing the ability to express fault features. The forward LSTM processes the sequence in forward time order to capture historical dependencies; the backward LSTM processes the sequence in reverse time order to capture future dependencies; and the bidirectional hidden states are concatenated to form a complete feature representation that integrates bidirectional information.

[0052] The BiLSTM network employs weight normalization combined with activation value scaling during training. By reparameterizing the convolutional weight matrix to achieve a zero-mean, unit-variance distribution, and introducing a learnable scaling parameter before the activation function, the network stabilizes the activation distribution during training, accelerating convergence and mitigating internal covariate shift issues. Training stability optimization techniques address the internal covariate shift problem in deep network training, accelerating convergence. These include weight normalization (reparameterizing the weight matrix to achieve zero-mean, unit-variance distribution) and activation value scaling (introducing a learnable scaling parameter before the activation function to stabilize the activation distribution).

[0053] The complete hidden state output is not only passed to the subsequent coordinate attention mechanism layer, but its output at the last time step is also additionally used as a global temporal context feature. This feature is concatenated with the attention-weighted feature vector and input together into the classification and regression output layer to enhance the model's ability to perceive the overall state of the sequence. This global context enhancement strategy strengthens the model's perception of the overall sequence state and avoids excessive localization of the attention mechanism. The output of the BiLSTM at the last time step, the global temporal context feature, and the attention-weighted feature vector are concatenated and input together into the output layer, utilizing both local key information and global state information.

[0054] Suppose we are analyzing a bearing vibration signal: A bidirectional LSTM simultaneously observes the vibration pattern 0.5 seconds before and 0.5 seconds after the current moment to determine if a fault impact has occurred. Weight standardization, similar to standardizing food processing, ensures the stable distribution of processing results for each layer, avoiding learning difficulties for subsequent layers. Global context concatenation focuses on both the most abnormal waveform segment (attention focus) and the overall state of the entire signal segment (last time step output) when judging faults, avoiding misjudgments. Through specific structural design and technical means, we ensure that the bidirectional temporal feature extraction layer achieves optimal performance and stability in industrial vibration signal scenarios.

[0055] S2.2, Coordinate Attention Mechanism Layer: This layer receives the hidden state sequence from all time steps of the BiLSTM layer. It calculates attention weights along both the time and feature dimensions using a coordinate attention mechanism, generating a two-dimensional attention weight matrix. This matrix recalibrates and weights the BiLSTM output features to enhance the representation of fault-sensitive periods and feature channels. Like a searchlight, the coordinate attention mechanism automatically focuses on the most critical fault signal segments and feature channels in both the time and feature dimensions. The model pays special attention to the time point with the highest impact energy in the signal (when) and the feature channel that best represents the inner loop fault (where), and strengthens their weights.

[0056] Specifically, the implementation process of the coordinate attention mechanism layer in step S2.2 is as follows:

[0057] The system receives the complete hidden state sequence output from the BiLSTM layer. First, it performs one-dimensional global pooling operations along both the time and feature dimensions to generate global context descriptors in these dimensions. Then, it compresses the feature sequence output from the BiLSTM to capture the global statistical characteristics of the entire sequence in both the time and feature dimensions, providing a basis for subsequent weight calculations. Specifically, global pooling along the time dimension involves averaging across all time steps for each feature channel to generate a feature dimension descriptor. This characterizes the global importance of each feature channel. Global pooling along the feature dimension: averaging across all feature channels at each time step generates a time-dimensional descriptor. This represents the global importance of each time step.

[0058] The obtained two-dimensional descriptors are concatenated and fused, and then intermediate feature maps are generated through convolutional layers with shared weights and nonlinear activation functions to promote the interaction of temporal and channel information, and richer intermediate features are generated through nonlinear transformation.

[0059] The intermediate feature maps are separated along the original dimensions to obtain the temporal attention weight vector and the channel attention weight vector, respectively. Attention weights that specifically act on the temporal dimension and the feature dimension are decoupled from the fused features.

[0060] The temporal attention weight vector and the channel attention weight vector are multiplied by an outer product to generate a two-dimensional attention weight matrix. This two-dimensional attention weight matrix is ​​then multiplied element-wise with the complete hidden state sequence output by the BiLSTM to enhance fault-sensitive periods and key feature channels while suppressing irrelevant information. The one-dimensional temporal weights and channel weights are combined into a two-dimensional attention map for precise calibration of the original features.

[0061] Specifically, the coordinate attention mechanism layer in step S2.2 includes the following processing steps: given the complete hidden state sequence output by the BiLSTM layer... Where T is the total number of time steps and D is the feature dimension, one-dimensional global average pooling is performed along both the time dimension and the feature dimension to generate a time-dimensional global context descriptor. and feature dimension global context descriptor :

[0062] , ,

[0063] in, Let be the j-th column of the feature matrix H. Let represent the value of the j-th feature channel at all time steps (:). To sum along the feature dimension D, sum all D elements in the j-th column. This is the global context descriptor for the time dimension. It is a vector of length T, where each element... This represents the average response intensity across all feature channels at the i-th time step. It encodes the global importance of each time step. Let be the i-th row of the feature matrix H. Let represent the values ​​of all D feature channels at the i-th time step. This is a vector of length D. To sum along the time dimension T, sum all T elements in the i-th row. For the feature dimension, a global context descriptor. A vector of length D, where each element... This represents the average response intensity of the j-th feature channel across all time steps. It encodes the global importance of each feature channel.

[0064] The two-dimensional spatiotemporal feature H is compressed into two one-dimensional global description vectors, one of which captures global information in the time dimension. ), a global information that captures the feature channel dimension ( This prepares for the subsequent calculation of attention weights.

[0065] The two descriptors are concatenated, and then an intermediate feature map is generated by performing a 1x1 convolution with shared weights, batch normalization, and a non-linear activation function. :

[0066] ,

[0067] in, The sigmoid function is a non-linear activation function that compresses the output value to between 0 and 1, introducing non-linearity. For batch normalization operations, For a 1x1 convolution with a kernel size of 1, a single 1x1 convolutional layer is used to process the concatenated vectors. The purpose of 1x1 convolution is to perform cross-feature interaction and fusion, and it can also reduce or increase dimensionality. For the concatenation operation. The time descriptor... (Length T) and Channel Descriptor (Length D) are concatenated along the same dimension to form a longer vector with length T+D. This is a batch normalization operation. It standardizes the output after convolution, accelerating training and improving stability. F is the intermediate feature map. It is a transformed and activated feature vector whose dimension is determined by the number of output channels C of the 1x1 convolution, and its shape is [C, T+D]. It integrates global information from time and channels.

[0068] Decompose F along the original dimension into time attention weights. and channel attention weights And normalized using the Sigmoid function respectively:

[0069] , ;

[0070] in, Let F be the time dimension of the intermediate feature map. The first T elements of F are extracted. Let F be the feature dimension of the intermediate feature map. Then, extract the last D elements of F. This is the temporal attention weight vector. It's a vector of length T, where each element... It is a weight between 0 and 1, representing the importance of the i-th time step for fault diagnosis. The closer the value is to 1, the more important it is. This is the channel attention weight vector. It's a vector of length D, where each element... It is a weight between 0 and 1, representing the importance of the j-th feature channel for fault diagnosis. The closer the value is to 1, the more important it is.

[0071] Temporal and channel information are decoupled from the fused feature F, and attention weights are generated that act on the two dimensions respectively.

[0072] A two-dimensional attention weight matrix is ​​generated through outer product operations. :

[0073] ,

[0074] It's an outer product operation. It converts vectors... (Tx1) and vector Multiplying (1xD) together produces a matrix M(TxD). It is a two-dimensional attention weight matrix. Each element in the matrix... , which represents the weight of the overall importance of the i-th time step and the j-th feature channel.

[0075] The weight matrix M is multiplied element-wise with the original hidden state H to obtain the calibrated weighted feature representation. :

[0076] ,

[0077] Where ⊙ represents the element-wise multiplication (Hadamard product) operation, which multiplies the original feature matrix H(TxD) and the attention weight matrix M(TxD) at corresponding positions. This represents the weighted and calibrated feature sequence. Its shape is the same as H, [T, D]. After attention weighting, the values ​​of important time steps and important feature channels are enhanced, while the values ​​of insignificant time steps and feature channels are suppressed.

[0078] Finally, the weighted features This information is used for subsequent feature fusion and fault diagnosis output. Ultimately, a feature representation focusing on key fault information is obtained. This representation is then input into subsequent network layers for fault classification and severity regression, thereby improving the model's diagnostic performance and interpretability.

[0079] Assume the BiLSTM outputs a sequence with 100 time steps (T=100) and 64-dimensional features (D=64). Step 1: Generate a 100-dimensional vector for each feature. and a 64-dimensional vector Steps 2 & 3: After fusing and transforming these two vectors, split them into a new 100-dimensional weight. (Precisely corrected time importance) and a new 64-dimensional weighting (Channel importance after precise correction). For example, The 25th time step might show the highest weight, indicating that this moment is highly likely to be the fault impact point. Step 4: [The remaining text appears to be incomplete and requires further context.] and Combine them into a 100x64 matrix M, and then use this matrix to amplify the 25th time step, as well as those that were... The values ​​on the feature channels that are identified as important are used to obtain the enhanced features. .

[0080] The core value of this scheme lies in defining an efficient and accurate two-dimensional attention mechanism computation process, which is a key technology for improving the performance and interpretability of model fault diagnosis.

[0081] S2.3 Feature Fusion and Compression Layer: The feature sequence weighted by coordinate attention is weighted and summed along the time dimension, compressed into a fixed-dimensional feature vector. This vector integrates the key information most relevant to the fault from the entire sequence. The feature fusion and compression layer condenses and refines the rich information of the entire time series into a fixed-length feature vector containing the essence. For example, the analysis results of a vibration signal lasting 10 seconds can be compressed into a core feature vector containing the fault type of inner ring crack and the severity of 0.7.

[0082] The feature fusion and compression layer in step S2.3 is implemented in the following manner:

[0083] Receive coordinate attention-weighted feature sequences Where T is the total number of time steps and D is the feature dimension; important features that have been calibrated are received. The input to this step is the feature sequence that has been weighted by the coordinate attention mechanism. In this process, unimportant time and feature channel information has been suppressed, while important information has been enhanced.

[0084] A temporal fusion method based on attention weights is used to process the weighted feature sequences. Compression is applied along the time dimension to generate a fixed-dimensional global feature vector. :

[0085] ,

[0086] in, Indicates the first The weighted feature vector at each time step For the first The method uses fusion weight coefficients for each time step to adaptively (rather than averaging) compress the information from the entire time series into a single vector. This allows the method to assign different levels of attention based on the importance of each time step to the current fault diagnosis task. This represents the fusion weight coefficient at time step t. This coefficient is not preset but is calculated using a learnable scoring function, ensuring that the model only focuses on time steps containing key fault information.

[0087] The fusion weight coefficient The importance scoring function of time step The calculation is performed and normalized using the Softmax function to obtain:

[0088] ,

[0089] ,

[0090] in, This is the weight matrix. and For learnable parameters, The hyperbolic tangent activation function is used; an importance score is calculated for each time step and converted into weights. This is a learnable attention process that enables the model to learn which parts of the sequence are most critical to the final decision.

[0091] The final global feature vector It integrates the key information most relevant to the fault from the entire time series and uses it as input to subsequent classification and regression output layers. This generates the feature representation needed for the final decision. The output is a fixed-dimensional global feature vector. It is no longer time-series data, but a feature sample that condenses the essence of the entire sequence, which can be directly input into the fully connected layer for classification and regression.

[0092] S2.4 Classification and Regression Output Layer: This layer receives the fixed-dimensional feature vector, outputs the classification probability of the fault type through a fully connected layer and a Softmax function, and simultaneously outputs a regression estimate of the fault severity through another fully connected layer and a linear activation function. Like an expert consultation, the classification and regression output layer provides both a qualitative judgment (what kind of fault it is) and a quantitative assessment (how severe it is). For example, one branch outputs the fault type probability (e.g., inner ring crack: 85%, ball bearing spalling: 10%, normal: 5%), while another branch directly outputs the severity score (e.g., 0.72 / 1.0).

[0093] The classification and regression output layer in step S2.4 is implemented in the following way:

[0094] Receive the fixed-dimensional global feature vector It is then simultaneously input into two independent fully connected network branches.

[0095] Fault type classification branch: The feature vector is processed through a fully connected layer. Mapped to the number of fault categories Using the same dimensional space, and after processing with the Softmax function, the predicted probability distribution for each fault category is output. :

[0096] ,

[0097] , ,

[0098] in, and These are the trainable parameters for the classification branch; This is the weight matrix for the classification layer. It is a trainable parameter matrix that maps the feature vector v of dimension D to a space of dimension C (each category corresponds to a score). This is the bias vector for the classification layer. These are trainable parameters, with one bias value corresponding to each class. This is the raw output (Logits) of the classification layer. It's a vector of length C, where each element... This represents the raw (unnormalized) score by which the model believes the input sample belongs to the i-th category. This represents the predicted probability of the i-th fault category. Its value is between 0 and 1. The classification probability distribution representing the fault type. A vector of length C, in the form of... ,in Indicates that the sample belongs to the first The probability of the class. The global feature v is transformed into a probability distribution, which clearly represents the model's judgment of the fault type and its confidence level.

[0099] Fault severity regression branch: The feature vector is processed through another fully connected layer. Mapped to a one-dimensional space, and using a linear activation function, a continuous regression estimate of the fault severity is output. :

[0100] ,

[0101] in, , For the trainable parameters of the regression branch, This is the weight vector for the regression layer. It is a trainable parameter vector whose function is to extract the information most relevant to the severity of the fault from the feature vector v of dimension D. This is the global feature vector. It shares the same input feature vector as the classification branch. This is the bias term for the regression layer. A trainable scalar value. This is a regression estimate of the fault severity. A continuous scalar value representing the fault severity predicted by the model. Its specific range and meaning need to be defined before training (e.g., 0 for normal, 1 for severe fault). It transforms the global feature v into a continuous numerical value to quantify the severity of the fault.

[0102] Finally, the model synchronously outputs the classification probability distribution of the fault type. Regression estimates of the degree of failure This design allows the model to simultaneously and efficiently perform two core tasks in fault diagnosis: What is the fault? And how severe is it?

[0103] These four sub-steps constitute the core processing pipeline of the model: first, comprehensive reading (BiLSTM extraction) → then, key annotation (attention focusing) → then, summarizing the essence (feature compression) → finally, providing qualitative and quantitative conclusions (classification + regression). It's like an experienced engineer first listening to an entire recording of bearing sounds, then repeatedly listening to key abnormal segments, and finally comprehensively judging the type and severity of the fault.

[0104] S3. The deep learning model is trained using a labeled bearing vibration dataset. The Adam optimizer is used for parameter optimization, with the cross-entropy error function as the loss function. An early stopping mechanism is introduced to prevent overfitting. This optimizes the model parameters and improves generalization ability. The model is trained using labeled data, and the Adam optimizer is used to dynamically adjust the parameters. Cross-entropy error is used as the main loss function, and an early stopping mechanism is combined to prevent overfitting, ensuring stable model convergence and good generalization ability.

[0105] Specifically, in this embodiment, the model training and optimization process in step S3 includes:

[0106] (1) Design of multi-objective composite loss function:

[0107] The model training uses a classification-regression joint loss function, with the loss calculated from the cross-entropy error (CEE). and regression loss based on fault impact characteristics constitute:

[0108] ,

[0109] Among them, the weighting coefficient The training process is dynamically adjusted according to the training cycle, with the initial stage focusing on regression tasks and the later stage focusing on classification tasks.

[0110] ,

[0111] Let be the weight value in the t-th training epoch. These are the maximum and minimum values ​​of the weights, respectively. The decay rate. A hyperparameter greater than 0, controlling the weights from... decay to The speed. This is the index for the training epoch. The exponentially decaying term. As t increases, this term gradually approaches 0.

[0112] In the early stages of training (when t is small) The total loss was caused by The primary focus is on enabling the model to initially learn how to assess the severity of faults. As training progresses, Gradually decrease, weight As the size increases, the model begins to focus more on learning the types of faults. This is a curriculum learning strategy.

[0113] The classification loss uses the cross-entropy error function:

[0114] ,

[0115] Where N is the number of samples and C is the number of fault categories. For the sample In category The real labels on it This represents the probability predicted by the model.

[0116] The regression loss is combined with the mean squared error and the consistency constraint of the fault impact characteristics:

[0117] ,

[0118] in, To reflect the true extent of the fault, To predict the severity of a fault, KL divergence is used to measure the difference between two probability distributions, ensuring the predicted fault impact characteristic distribution. With the true distribution Consistent; This is the regularization intensity hyperparameter. It is used to control the proportion of the second term, KL divergence, in the regression loss. This represents the predicted fault impact characteristic distribution. It is a characteristic distribution derived from the signal or state predicted by the model. This represents the distribution of true fault impact characteristics. It describes the distribution followed by features extracted from real signals (such as kurtosis and impulse indices).

[0119] Regression loss not only requires accurate predicted values ​​(first point), but also requires that the predicted results be consistent with the characteristics of the real fault signal in terms of statistical properties (second point). This introduces physical prior knowledge, which can make the prediction more reliable.

[0120] (2) Adaptive optimization strategy: The Adam optimizer is used for parameter optimization, and its update rule is as follows:

[0121] ,

[0122] ,

[0123] , ,

[0124] ,

[0125] in, Let be the gradient of the loss function with respect to the parameters at step t. This is a first-moment estimate of the gradient (momentum). It is similar to the velocity term in SGD with momentum. This is the second moment estimate of the gradient. The learning rate is used to adaptively adjust each parameter. These are the decay rates of the first and second moments. They are usually set to 0.9 and 0.999, respectively. , These are the first and second moment estimates after bias correction. They are used to solve initialization problems. The deviation caused by initializing it to 0. These are the model parameters updated at step t. This represents the global learning rate. It is a very small constant to prevent the denominator from being zero.

[0126] The learning rate employs an exponential decay strategy:

[0127] ,

[0128] in, is the actual learning rate at step t (or Epoch). This is the initial learning rate. This represents the attenuation rate. This refers to the number of training steps or epochs.

[0129] (3) Regularization and early stopping mechanism:

[0130] A regularization strategy combining Dropout and weight decay is adopted, and the Dropout rate is dynamically adjusted as the training progress progresses.

[0131] The early stopping mechanism is based on the change in cross-entropy loss on the validation set:

[0132] Training stops when the validation set loss does not decrease for E consecutive epochs.

[0133] ,

[0134] in, Let E be the loss calculated on the validation set at the t-th epoch. E is the patience value. Training stops if the validation set loss does not decrease for E consecutive epochs.

[0135] (4) Gradient clipping and monitoring:

[0136] Global gradient clipping is used to prevent gradient explosion.

[0137] ,

[0138] Here, is the original gradient vector calculated in step t (or the t-th mini-batch), which may be the concatenation of all parameters or the gradient of a single layer. Gradient L The 2-norm (i.e., Euclidean distance) measures the overall magnitude of the gradient. The set threshold is usually an empirical value, such as 1.0, 5.0, 10.0, etc.

[0139] The training monitoring system is used to track the changing trends of cross-entropy loss and accuracy in real time.

[0140] By carefully designing a dynamically weighted multi-task loss function, an adaptive optimizer, and regularization strategies, this project aims to address the core challenges in bearing fault prediction: Task balancing: Coordinating classification and regression tasks with different properties through dynamic weights. Training stability: Employing the Adam optimizer, learning rate decay, and gradient pruning. Generalization ability: Preventing overfitting through Dropout, weight decay, and early stopping mechanisms. Physical consistency: Introducing a KL divergence term into the regression loss to ensure that the prediction results conform to the physical characteristics of the fault. These strategies collectively ensure that the model can be trained efficiently and stably, producing accurate and reliable prediction results.

[0141] S4. Input the real-time collected vibration data into the trained deep learning model, and simultaneously output the fault type identification result and the fault severity assessment value. The fault severity assessment is based on the fault impact characteristics learned by the model for quantitative calculation. This achieves simultaneous and accurate diagnosis of fault type and severity. The real-time vibration data is input into the trained model, and simultaneously outputs the fault type classification result (e.g., inner ring crack, ball bearing spalling, etc.) and the fault severity quantitative assessment value (e.g., minor, moderate, severe). The severity assessment is based on the fault impact characteristics learned by the model.

[0142] Specifically, in this embodiment, the fault severity assessment in step S4 is based on a quantitative calculation method that fuses multi-scale spectral kurtosis features with nonlinear dynamic parameters. The specific implementation process is as follows:

[0143] (1) Extracting multi-scale spectral kurtosis features from the preprocessed real-time vibration signal: The signal is decomposed into J scales using maximum overlap discrete wavelet packet transform (MODWPT), and the spectral kurtosis value of each sub-band signal is calculated:

[0144] ,

[0145] in, For the first The wavelet coefficients of each sub-band are obtained by decomposing the original vibration signal through maximum overlap discrete wavelet packet transform (MODWPT), representing the components of the signal at a specific frequency band j and time t. This indicates a time averaging operation, which calculates the average of the quantity within the parentheses over the entire time axis. This represents the fourth moment of the wavelet coefficients. It reflects the intensity of the impulse component of the signal in that subband. This represents the second moment of the wavelet coefficients. It approximates the power of the signal in that subband.

[0146] Constructing multi-scale spectral kurtosis feature vectors It consists of spectral kurtosis values ​​from J different sub-bands, comprehensively describing the distribution of fault impact characteristics across the entire frequency band. Among them, Represents the center frequency of the j-th sub-band The spectral kurtosis value at a given frequency band indicates the significance of impulsive fault characteristics within that band. A higher value indicates a stronger impulsive component and a greater likelihood of faulting within that band.

[0147] (2) Calculate nonlinear dynamic parameters: Using phase space reconstruction technology, the optimal embedding dimension m and time delay τ are determined through the CC method, and the correlation dimension is calculated. and the maximum Lyapunov index :

[0148] , ,

[0149] in For correlation integrals, in phase space, it measures the proportion of all pairs of points whose distance is less than radius r. It describes the degree of clustering of the system's orbits. r represents the radius of the hypersphere in phase space. This is the correlation dimension. It is used to quantify the complexity and degrees of freedom of a system's dynamic behavior. When a system transitions from a normal state to a fault state, its dynamic characteristics change. This change can be captured. Let be the distance vector between two adjacent points in the phase space at the initial moment; Let be the distance vector between these two adjacent points after time t; The norm of a vector is usually referred to as the Euclidean distance. This is the maximum Lyapunov exponent. It is used to measure the system's sensitivity to initial conditions (i.e., the degree of chaos). This usually implies that the system is chaotic, and the magnitude of its value can reflect the degree of disorder in the system, which is related to the development of faults. The above formula describes the changes in the nonlinear dynamic characteristics of the bearing-rotor system as a whole, and these changes are deep indicators of the occurrence and development of faults.

[0150] (3) Establish a deep feature fusion evaluation model: the preliminary estimated values ​​output by the regression branch. Multiscale spectral kurtosis characteristics K and nonlinear dynamic parameters Input to a lightweight feature fusion network:

[0151] ,

[0152] ,

[0153] in and Here are the parameters for the fusion layer: GeLU is the Gaussian error linear unit activation function, and σ is the Sigmoid function that compresses the output to the [0,1] interval. This is a concatenation operation. All the above features are concatenated into a longer, comprehensive feature vector. This represents the high-level features after fusion. and These are the weights and biases of the output layer. These are trainable parameters. This results in a final comprehensive fault severity score. A scalar value between 0 (normal) and 1 (severe fault) is used to quantify the severity of the fault. The above formula fuses the initial predictions of the deep learning model with physical features based on signal processing and dynamics to obtain a more accurate and reliable comprehensive fault severity score.

[0154] (4) Fault severity classification based on adaptive threshold: The threshold boundary is dynamically adjusted according to the historical operating data of the equipment to establish a fault severity classification function:

[0155]

[0156] Where the threshold parameter Adaptively determined from historical data using kernel density estimation methods:

[0157] ,

[0158] It is the inverse function of the cumulative distribution function of historical failure severity evaluation values. These are the preset quantile values. The fault severity evaluation value in historical data The cumulative distribution function. This is an adaptive threshold. Calculated based on the device's historical operating data, it is used to classify different fault levels. This allows the grading criteria to match the specific condition of the device, making it more personalized. The overall failure level score for the current sample is given. The final output is the severity level of the fault.

[0159] By fusing multi-source information, such as deep learning prediction, this method provides an initial end-to-end estimate. Signal processing features accurately capture the impact characteristics of faults. Dynamic system features reveal the intrinsic state changes of the system. By fusing these three types of information and employing adaptive thresholds based on historical data, this method significantly improves the accuracy and reliability of fault severity assessment, making it not only a mathematical prediction but also a diagnostic indicator with clear physical meaning and practical value. This formula can adaptively determine the fault level classification threshold based on the equipment's historical performance, achieving a more scientific and personalized health status assessment.

[0160] S5. Generate a diagnostic report that includes both the fault type and fault severity. When the fault severity assessment value exceeds an adaptive threshold set based on historical data, a multi-level early warning mechanism is triggered. Generate structured diagnostic results and trigger intelligent early warnings. Output a comprehensive report that includes both the fault type and severity assessment value, and dynamically trigger a multi-level early warning mechanism (such as general early warning, advanced early warning, and emergency early warning) based on an adaptive threshold set based on historical data, achieving closed-loop management from diagnosis to decision-making.

[0161] Specifically, in this embodiment, the diagnostic report generation and early warning mechanism in step S5 is implemented in the following manner:

[0162] (1) Multimodal diagnostic report generation: based on fault type classification probability distribution and comprehensive failure severity assessment value Generates a structured multimodal diagnostic report, including: fault type confidence analysis: outputting the fault class with the highest probability. and its confidence level .

[0163] Quantitative description of fault severity: based on The output value specifies the severity of the fault.

[0164] Fault evolution trend analysis: based on the most recent K time windows Value calculation trend indicator:

[0165] ,

[0166] in This is a sign function; it outputs +1 for positive inputs, -1 for negative inputs, and 0 for zero inputs. K is the time window size. It represents the number of recent consecutive time periods (e.g., the last 10 data collection cycles) used to calculate the trend. Score the degree of failure at the i-th time point; It serves as an indicator of the failure evolution trend.

[0167] (2) Adaptive early warning triggering mechanism: Construct an adaptive early warning threshold based on historical operating status. ,in and These are the historical normal states. The mean and standard deviation of the values, This is an adjustable sensitivity coefficient; This is an adaptive warning threshold. The threshold is not a fixed value, but rather calculated based on the device's historical health data, making it personalized and adaptable. The warning threshold is scientifically set according to the device's normal operating baseline, avoiding false alarms or missed alarms caused by individual device differences or changes in operating conditions when using fixed thresholds.

[0168] A multi-level alert is triggered when any of the following conditions are met:

[0169] Emergency Warning: and The fault is very serious and is worsening, requiring immediate attention.

[0170] Advanced warning: and The fault is serious and worsening, requiring close monitoring and planned repairs.

[0171] General warning: and The fault has just deviated from the normal range and shows a tendency to worsen, indicating an initial warning.

[0172] This indicates that the overall fault is showing a worsening trend. This indicates that the overall fault is showing a worsening trend. This indicates that the fault level is basically stable. It's not just about focusing on the current severity of the fault, but more importantly, judging its dynamic development trend to provide a basis for predictive maintenance decisions.

[0173] (3) Maintenance decision support: Based on the fault type and severity, specific maintenance recommendations are generated using a pre-set maintenance knowledge base. ,in It is a rule-based decision function that outputs specific maintenance instructions, including continued monitoring, planned maintenance, and emergency shutdown; this is a predefined logical mapping rule, rather than trainable parameters. The most likely type of failure. Rate the current level of failure. It serves as an indicator of the failure evolution trend. This outputs maintenance instructions. Based on the input fault information, specific suggestions are matched from a predefined maintenance knowledge base, such as: (inner ring crack, 0.15, 0.1) -> continue monitoring, pay attention to lubrication; (ball spalling, 0.65, 0.8) -> plan to shut down for maintenance next week; (cage breakage, 0.92, 0.9) -> emergency shutdown! Replace the bearing immediately. The model's predictions are automatically converted into executable maintenance operations, forming a closed loop and truly achieving the ultimate goal of predictive maintenance.

[0174] By combining the predictive results of artificial intelligence models with expertise in the field of operations and maintenance, we have achieved status awareness, intelligent early warning, decision support, and system integration: through standardization (JSON) and real-time communication protocol (MQTT), diagnostic results are integrated into the existing equipment management system. These formulas and logics together constitute a complete, automated, and intelligent fault prediction and health management (PHM) process.

[0175] (4) Report visualization and push:

[0176] The diagnostic results are stored in a structured JSON format and displayed in real time via a web interface. At the same time, the warning information is pushed to the mobile terminals of equipment managers via the MQTT protocol.

[0177] Example 2

[0178] like Figure 2As shown in the figure, this application provides an architecture diagram of an industrial bearing vibration time-series signal fault prediction system that integrates attention mechanism and LSTM. It is applied to the industrial bearing vibration time-series signal fault prediction system that integrates attention mechanism and LSTM as described in Embodiment 1. It includes a signal acquisition and preprocessing module 11, a model building module 12, a model training module 13, a real-time fault diagnosis module 14, and a diagnosis report and early warning module 15.

[0179] The signal acquisition and preprocessing module 11 is used to acquire the vibration time-domain signal of the industrial bearing during operation using a vibration sensor, and to preprocess the acquired signal, including filtering, noise reduction and normalization, to obtain standardized vibration time series data.

[0180] Model building module 12 is used to collect vibration time-domain signals during the operation of industrial bearings using vibration sensors, and to perform filtering, noise reduction and normalization processing on the collected signals in sequence to obtain standardized vibration time-series data.

[0181] The model training module 13 is used to train the deep learning model using a labeled bearing vibration dataset. The Adam optimizer is used for parameter optimization, the cross-entropy error function is used for the loss function, and an early stopping mechanism is introduced to prevent overfitting.

[0182] The real-time fault diagnosis module 14 is used to input the real-time collected vibration data into the trained deep learning model and simultaneously output the fault type identification result and the fault severity assessment value. The fault severity assessment is based on the fault impact characteristics learned by the model for quantitative calculation.

[0183] The diagnostic report and early warning module 15 is used to generate a diagnostic report that includes both the fault type and the fault severity. When the fault severity assessment value exceeds the adaptive threshold set based on historical data, a multi-level early warning mechanism is triggered.

[0184] Figure 3 This is an electronic device provided in one embodiment of this application. For example... Figure 3 As shown, the electronic device includes at least the following components: processor 101 and memory 100, communication interface 103, and bus 102.

[0185] In this embodiment of the application, memory 100 is used to store executable instructions of processor 101, which, when configured to execute instructions, implements the method as described in the first aspect.

[0186] In embodiments of this application, a computer-readable storage medium includes instructions that instruct a device to perform the method as described in the first aspect. For example, the instructions instruct the device to perform... Figure 1 The method is shown in the process steps.

[0187] In one embodiment of this application, the program operating in the electronic device may be a program that controls a central processing unit (CPU) or similar device to achieve the functions of the above-described embodiments of the present invention (a program that enables the computer to function). Information processed by these systems is then temporarily stored in random access memory (RAM) during processing, and subsequently stored in various ROMs such as read-only memory (FlashROM) and hard disk drives (HDDs), and read, corrected, and written by the CPU as needed.

[0188] It should be noted that a portion of the electronic device described above can also be implemented using a computer. In this case, the program for implementing the control function can be recorded on a computer-readable recording medium, and the program recorded on the recording medium can be read into the computer and executed.

[0189] It should be noted that the computer mentioned here refers to a computer built into an electronic device, employing hardware including an operating system and peripheral devices. Furthermore, computer-readable recording media refers to removable media such as floppy disks, magneto-optical disks, ROMs, and CD-ROMs, as well as storage systems such as hard drives built into the computer.

[0190] Furthermore, computer-readable recording media can include: media that dynamically stores programs for short periods of time, such as communication lines used when transmitting programs via networks like the Internet or communication lines like telephone lines; and media that store programs for fixed periods of time, such as volatile memory inside a computer that serves as a server or client in this case. In addition, the aforementioned program can be a program used to implement the above-mentioned functions, or it can be a program that can implement the above-mentioned functions by combining them with programs already recorded in the computer.

[0191] Furthermore, the electronic device in the above embodiments can also be implemented as an assembly (system group) composed of multiple systems. Each system constituting the system group can possess some or all of the functions or functional blocks of the electronic device in the above embodiments. As a system group, it is sufficient to have all the functions or functional blocks of the electronic device.

[0192] Those skilled in the art should recognize that the above embodiments are only used to illustrate this application and are not intended to limit this application. Any appropriate changes and variations made to the above embodiments within the essential spirit and scope of this application fall within the scope of protection claimed in this application.

Claims

1. A method of industrial bearing vibration time series signal fault prediction by fusing attention mechanism and LSTM, characterized in that, Includes the following steps: S1. Use vibration sensors to collect vibration time-domain signals during the operation of industrial bearings. Then, filter, reduce noise, and normalize the collected signals to obtain standardized vibration time-series data. S2. Construct a deep learning model that includes a bidirectional LSTM layer and a coordinate attention mechanism layer. The bidirectional LSTM layer is used to extract bidirectional temporal features from vibration time series data, and the coordinate attention mechanism layer is used to perform time and feature dimension weighting on the hidden state output by the bidirectional LSTM layer to enhance the attention to key fault features. The deep learning model constructed in step S2 is a hierarchical feature extraction and fusion network, specifically including: S2.1, Bidirectional Temporal Feature Extraction Layer: A multi-layer bidirectional long short-term memory network BiLSTM is used as the core temporal feature extractor to capture the long-range bidirectional dependencies related to the fault in the vibration signal; S2.2, Coordinate Attention Mechanism Layer: Receives the hidden state sequence from all time steps of the BiLSTM layer, calculates attention weights along both the time and feature dimensions using the coordinate attention mechanism, generates a two-dimensional attention weight matrix, and recalibrates and weights the BiLSTM output features, including: the complete hidden state sequence output by the BiLSTM layer where T is the total number of time steps, D is the dimension of features, one-dimensional global average pooling is performed along the time dimension and the feature dimension respectively, generating a time dimension global context descriptor and a feature dimension global context descriptor : , , where, is the j-th column of the feature matrix H, representing the values of the j-th feature channel over all time steps; is the sum over all D elements of the j-th column, summed along the feature dimension D; is the time-vindependent global context descriptor, a vector of length T, where each element represents the average response strength of the i-th time step over all feature channels; it encodes the global importance of each time step; is the i-th row of the feature matrix H, representing the values of all D feature channels at the i-th time step, a vector of length D; is the sum over all T elements of the i-th row, summed along the time dimension T; is the feature-vindependent global context descriptor; a vector of length D, where each element represents the average response strength of the j-th feature channel over all time steps, it encodes the global importance of each feature channel; The two-dimensional space-time features H are compressed into two one-dimensional global description vectors, one capturing global information of the time dimension , and the other capturing global information of the feature channel dimension , to prepare for subsequent calculation of attention weights The two descriptors are concatenated and passed through a 1x1 convolution with shared weights, batch normalization, and a nonlinear activation function to generate intermediate feature maps : , in, The sigmoid function is a non-linear activation function that compresses the output value to between 0 and 1, introducing non-linearity. For batch normalization operations, A 1x1 convolution is a one-dimensional convolution with a kernel size of 1, which is used to process the concatenated vectors; the role of 1x1 convolution is to perform cross-feature interaction and fusion, and can reduce or increase dimensionality. For the splicing operation; the time descriptor and feature channel descriptors They are concatenated along the same dimension to form a longer vector with a length of T+D; For batch normalization; standardizes the output after convolution to accelerate training and improve stability; F is the intermediate feature map; a transformed and activated feature vector whose dimension is determined by the number of output channels C of the 1x1 convolution, and its shape is [C, T+D]; it integrates global information of time and channels; Decompose F along the original dimension into time attention weights. and channel attention weights And normalized using the Sigmoid function respectively: , ; in, For the time dimension of the intermediate feature map F, split out the first T elements of F; The feature dimension of the intermediate feature map F is defined; the last D elements of F are extracted. It is the temporal attention weight vector; a vector of length T, where each element... It is a weight between 0 and 1, representing the importance of the i-th time step for fault diagnosis; the closer the value is to 1, the more important it is. It is the channel attention weight vector; a vector of length D, where each element... It is a weight between 0 and 1, representing the importance of the j-th feature channel for fault diagnosis; the closer the value is to 1, the more important it is. Temporal and channel information are decoupled from the fused feature F, and attention weights are generated that act on the two dimensions respectively. A two-dimensional attention weight matrix is ​​generated through outer product operation. : , It is the outer product operation; it transforms a vector... sum vector Multiply them to generate a matrix M; It is a two-dimensional attention weight matrix; each element in the matrix , which represents the combined importance weight for the i-th time step and the j-th feature channel; The weight matrix M is multiplied element-wise with the original hidden state H to obtain the calibrated weighted feature representation. : , Where ⊙ denotes element-wise multiplication, which multiplies the original feature matrix H and the attention weight matrix M at corresponding positions; This represents the weighted and calibrated feature sequence; its shape is the same as H, both being [T,D]; after attention weighting, the values ​​of important time steps and important feature channels are enhanced, while the values ​​of unimportant time steps and feature channels are suppressed; Finally, the weighted features This is used for subsequent feature fusion and fault diagnosis output; ultimately, a feature representation focusing on key fault information is obtained, which will be input into subsequent network layers for fault classification and severity regression, thereby improving the model's diagnostic performance and interpretability. S2.3 Feature Fusion and Compression Layer: The feature sequence weighted by coordinate attention is weighted and summed in the time dimension to compress it into a feature vector of fixed dimension. The feature fusion and compression layer in step S2.3 is implemented as follows: Receive coordinate attention-weighted feature sequences Where T is the total number of time steps and D is the feature dimension; calibrated key features are received; the input to this step is the feature sequence weighted by the coordinate attention mechanism. In this process, unimportant time and feature channel information has been suppressed, while important information has been enhanced. A temporal fusion method based on attention weights is used to process the weighted feature sequences. Compression is performed in the time dimension to generate a global feature vector of fixed dimensions. : , in, Indicates the first The weighted feature vector at each time step For the first The method uses fusion weight coefficients for each time step to adaptively compress the information of the entire time series into a single vector. It can also give different levels of attention to each time step based on its importance to the current fault diagnosis task. This represents the fusion weight coefficient at time step t; this coefficient is not preset, but is calculated through a learnable scoring function to ensure that the model only focuses on those time steps that contain key fault information. The fusion weight coefficient The importance scoring function of time step The calculation is performed and normalized using the Softmax function to obtain: , , in, This is the weight matrix. and For learnable parameters, The hyperbolic tangent activation function is used; the importance score at each time step is calculated and converted into weights; this is a learnable attention process that enables the model to learn to judge which parts of the sequence are most critical to the final decision; The final global feature vector It integrates the key information most relevant to the fault from the entire time series and uses it as input to the subsequent classification and regression output layers; it generates the feature representation required for the final decision; the output is a fixed-dimensional global feature vector. It is no longer time series data, but a feature sample that condenses the essence of the entire sequence, which can be directly input into the fully connected layer for classification and regression; S2.4, Classification and Regression Output Layer: Receives the fixed-dimensional feature vector, outputs the classification probability of the fault type through a fully connected layer and a Softmax function, and simultaneously outputs a regression estimate of the fault severity through another fully connected layer and a linear activation function; the classification and regression output layer in step S2.4 is implemented as follows: Receive the fixed-dimensional global feature vector This is then simultaneously input into two independent fully connected network branches; Fault type classification branch: The feature vector is processed through a fully connected layer. Mapped to the number of fault categories Using the same dimensional space, and after processing with the Softmax function, the predicted probability distribution for each fault category is output. : , , , in, and These are the trainable parameters for the classification branch; This is the raw output of the classification layer, for each element. This represents the original score by which the model believes the input sample belongs to the i-th category. Represents the predicted probability of the i-th fault category. The classification probability distribution representing the fault type is in the form of: ; Fault severity regression branch: The feature vector is processed through another fully connected layer. Mapped to a one-dimensional space, and using a linear activation function, a continuous regression estimate of the fault severity is output. : , in, , These are the trainable parameters for the regression branch; Finally, the model synchronously outputs the classification probability distribution of the fault type. Regression estimates of the degree of failure ; S3. The deep learning model is trained using the labeled bearing vibration dataset. The parameters are optimized using the Adam optimizer, the loss function is the cross-entropy error function, and an early stopping mechanism is introduced to prevent overfitting. The model training and optimization process specifically includes: (1) Multi-objective composite loss function design: The model training adopts a classification-regression joint loss function, which is composed of cross-entropy error (CEE) loss. and regression loss based on fault impact characteristics constitute: , Among them, the weighting coefficient The training process is dynamically adjusted according to the training cycle, with the initial stage focusing on regression tasks and the later stage focusing on classification tasks. , Let be the weight value in the t-th training period. These are the maximum and minimum values ​​of the weights, respectively. The attenuation rate, For the index of training cycles, Exponential decay term; , Where N is the number of samples and C is the number of fault categories. For the sample In category The real labels on The probability predicted by the model: , in, To reflect the true extent of the fault, To predict the degree of failure, KL represents the KL divergence. This is the hyperparameter for regularization strength. For the predicted fault impact characteristic distribution, This represents the true distribution of fault impact characteristics; (2) Adaptive optimization strategy: The Adam optimizer is used for parameter optimization, and the learning rate adopts an exponential decay strategy; (3) Regularization and early stopping mechanism: The regularization strategy combining Dropout and weight decay is adopted. The Dropout rate is dynamically adjusted with the training progress. The early stopping mechanism is based on the change of cross-entropy loss of the validation set. Training is stopped when the validation set loss does not decrease for E consecutive cycles. (4) Gradient clipping and monitoring: Global gradient clipping is used to prevent gradient explosion. , in, Let be the original gradient vector calculated in step t. The L2 norm of the gradient. For the set threshold, the training monitoring system is used to track the changing trends of cross-entropy loss and accuracy in real time; S4. Input the real-time collected vibration data into the trained deep learning model, and output the fault type identification result and fault severity assessment value simultaneously. The fault severity assessment is based on the fault impact characteristics learned by the model and is quantitatively calculated. S5. Generate a diagnostic report that includes both the fault type and the fault severity. When the fault severity assessment value exceeds the adaptive threshold set based on historical data, a multi-level early warning mechanism is triggered.

2. The method for predicting faults in industrial bearing vibration timing signals by fusing attention mechanism and LSTM as described in claim 1, characterized in that, The preprocessing in step S1 includes the following sub-steps: S1.1 Signal filtering: An adaptive bandpass filter is used, with a passband frequency range of 0.5 to 3 times the bearing characteristic frequency, in order to retain the fault characteristic frequency components and suppress irrelevant frequency band interference. S1.2 Noise Suppression Processing: A wavelet packet-based threshold denoising algorithm is adopted, which uses the db4 wavelet basis for multi-level wavelet packet decomposition and an adaptive threshold function to perform soft thresholding on the detail coefficients. S1.3 Signal normalization processing: The maximum-minimum normalization method is used to normalize the signal amplitude to a preset numerical range; S1.4 Data Augmentation Processing: Expanding the training dataset using time series data augmentation techniques, including one or more of time warping, amplitude scaling, and adding Gaussian noise; S1.5 Feature extraction preprocessing: Calculate the time-domain feature parameters of the vibration signal, which include one or more of the root mean square value, kurtosis, peak factor and impulse index, and use these features together with the preprocessed time series data as input to the fault feature extraction model.

3. The method for predicting faults in industrial bearing vibration timing signals by fusing attention mechanism and LSTM as described in claim 1, characterized in that, The bidirectional temporal feature extraction layer in step S2.1 satisfies the following conditions: The multilayer bidirectional long short-term memory network BiLSTM has 2 to 4 layers, and the number of hidden units in each layer can be configured between 64 and 256. The BiLSTM layer receives the standardized vibration time series data after preprocessing in step S1. Its forward and backward LSTM units process the input sequence in forward and reverse time order, respectively, and concatenate the hidden states in the two directions of each time step to form a complete hidden state output that integrates bidirectional context information. During training, the BiLSTM network employs weight normalization combined with activation scaling. By reparameterizing the convolutional weight matrix, it satisfies the distribution characteristics of zero mean and unit variance. At the same time, a learnable scaling parameter is introduced before the activation function to stabilize the activation distribution during network training, accelerate convergence, and alleviate the problem of internal covariate bias. The complete hidden state output is not only passed to the subsequent coordinate attention mechanism layer, but its output at the last time step is also used as a global temporal context feature. It is concatenated with the feature vector after attention weighting and summing, and then input into the classification and regression output layer to enhance the model's ability to perceive the overall state of the sequence.

4. The method for predicting faults in industrial bearing vibration timing signals by fusing attention mechanism and LSTM according to claim 1, characterized in that, The fault severity assessment in step S4 is based on a quantitative calculation method that fuses multi-scale spectral kurtosis features with nonlinear dynamic parameters. The specific implementation process is as follows: (1) Extract multi-scale spectral kurtosis features from the preprocessed real-time vibration signal: Decompose the signal into J scales using maximum overlap discrete wavelet packet transform, calculate the spectral kurtosis value of each sub-band signal, and construct a multi-scale spectral kurtosis feature vector K. (2) Calculate nonlinear dynamic parameters: Using phase space reconstruction technology, the optimal embedding dimension m and time delay τ are determined through the CC method, and the correlation dimension is calculated. and the maximum Lyapunov index ; (3) Establish a deep feature fusion evaluation model: the preliminary estimated values ​​output by the regression branch. Multiscale spectral kurtosis characteristics K and nonlinear dynamic parameters The input is fed into a lightweight feature fusion network to obtain the final comprehensive fault severity score. ; (4) Fault severity classification based on adaptive threshold: The threshold boundary is dynamically adjusted according to the historical operating data of the equipment, a fault severity classification function is established, and the classification threshold parameters are adaptively determined from the historical data through the kernel density estimation method.

5. A fault prediction system for industrial bearing vibration timing signals integrating attention mechanism and LSTM, applied to the fault prediction method for industrial bearing vibration timing signals integrating attention mechanism and LSTM as described in any one of claims 1 to 4, characterized in that, The system includes: The signal acquisition and preprocessing module is used to acquire the vibration time-domain signal of the industrial bearing during operation using vibration sensors, and to preprocess the acquired signal, including filtering and noise reduction and normalization, to obtain standardized vibration time series data. The model building module is used to collect vibration time-domain signals of industrial bearings during operation using vibration sensors. The collected signals are then filtered, denoised, and normalized sequentially to obtain standardized vibration time-series data. The model training module is used to train the deep learning model using a labeled bearing vibration dataset. The Adam optimizer is used for parameter optimization, the cross-entropy error function is used as the loss function, and an early stopping mechanism is introduced to prevent overfitting. The real-time fault diagnosis module is used to input the real-time collected vibration data into the trained deep learning model and simultaneously output the fault type identification result and the fault severity assessment value. The fault severity assessment is based on the fault impact characteristics learned by the model for quantitative calculation. The diagnostic report and early warning module is used to generate a diagnostic report that includes both the fault type and the fault severity. When the fault severity assessment value exceeds the adaptive threshold set based on historical data, a multi-level early warning mechanism is triggered.