A deep learning-based multi-channel time series data fault diagnosis method

CN122196416APending Publication Date: 2026-06-12ZHEJIANG XINYUE DIGITAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG XINYUE DIGITAL TECH CO LTD
Filing Date
2026-03-11
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies suffer from insufficient multi-channel data processing capabilities, low efficiency in processing long-sequence data, difficulty in handling non-exclusive multiple fault types, and a lack of targeted data augmentation techniques when dealing with multi-channel, long-sequence, and complex fault types in subway train transmission systems, resulting in low diagnostic efficiency and accuracy.

Method used

A deep learning-based multi-channel time-series data fault diagnosis method is adopted. Through multi-channel data merging, data augmentation and slicing, combined with the input layer, encoding layer and output layer design of the Transformer model, the self-attention mechanism is used to capture the relationship between channels, the multi-label binary cross-entropy loss function and AdamW optimizer are used for model training, and the voting mechanism is used for fault identification.

Benefits of technology

It achieves efficient integration of multi-channel time-series data, accurately identifies complex fault types, improves diagnostic accuracy and response speed, and is applicable to fault diagnosis and early warning of subway train transmission systems and other complex multi-channel industrial systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196416A_ABST
    Figure CN122196416A_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on deep learning multi-channel time series data fault diagnosis method, comprising the following steps: step 1: data acquisition processing includes multi-channel data merging, data slice, data enhancement;Step 2: model building adopts the Transformer model with input layer, coding layer, output layer, step 3 model training includes loss function design, optimizer, learning rate scheduling, batch size setting, round training, step 4 model inference includes data slice, independent prediction, voting mechanism.The application can capture complex patterns in data through multi-channel expansion and full attention mechanism, improve the accuracy of fault diagnosis, and improve the accuracy of classification on key components of subway train.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of multi-channel time-series data fault diagnosis technology, and more specifically, to a multi-channel time-series data fault diagnosis method based on deep learning. Background Technology

[0002] With the rapid development of urban rail transit, subway trains, as an important component of urban public transportation, have received widespread attention for their operational safety and reliability. The subway train transmission system, as a core component of train operation, is crucial for ensuring normal train operation through fault diagnosis and maintenance. Traditional fault diagnosis methods mainly rely on human experience and simple signal processing techniques. These methods often suffer from low diagnostic efficiency, low accuracy, and limited ability to identify fault modes when faced with complex multi-channel time-series data.

[0003] In recent years, with the rapid development of machine learning, especially deep learning, it has shown great potential in the field of fault diagnosis. Deep learning can improve the accuracy and robustness of fault diagnosis by automatically extracting data features and modeling complex nonlinear relationships. However, existing deep learning-based fault diagnosis methods are mostly focused on the analysis of single-channel or few-channel data. Classic models such as Convolutional Neural Networks (CNN), Long Short-Term Memory Networks (LSTM), and Transformers lack efficient and accurate diagnostic methods for time-series data involving multi-channel, long-sequence, and complex fault types in subway train transmission systems. Therefore, developing a deep learning-based fault diagnosis method for multi-channel time-series data has significant theoretical and practical value for improving the fault diagnosis capabilities of subway train transmission systems.

[0004] Current technical problems or shortcomings: The closest technical solution to this application is a multi-channel temporal data classification method based on iTransformer. This method utilizes iTransformer's self-attention mechanism to capture the relationships between channels and improves representation capabilities through a multi-head attention mechanism. Specifically, it involves mapping multi-channel temporal data to multiple fixed-size temporal embeddings, then using channel self-attention to extract features, and finally using a classification layer for category mapping.

[0005] Explain the reasons for these problems or defects: The main shortcomings of existing technologies include the following aspects:

[0006] 1. Insufficient multi-channel data processing capabilities: Existing methods typically involve simply overlaying or serializing multi-channel data, failing to fully explore the complex interrelationships and mutual influences between channels. This approach ignores the potential collaborative features within multi-channel data, resulting in the ineffective utilization of critical information and thus reducing the accuracy and reliability of fault diagnosis.

[0007] 2. Low efficiency in processing long data sequences: As the length of time-series data increases, the computational complexity of many existing methods grows exponentially. Particularly in subway drive systems, the signal acquisition frequency reaches 64kHz, meaning there are 64,000 time steps per second. This not only significantly increases the consumption of computational resources but also prolongs the response time for fault diagnosis, limiting its effectiveness in real-time monitoring and rapid response scenarios.

[0008] 3. Difficulty in handling multiple non-mutually exclusive fault types: Most existing methods assume that different fault types are mutually exclusive, meaning that only a single fault will occur in each diagnosis. However, in actual operation, multiple faults often coexist or influence each other, which makes traditional methods perform poorly in handling compound faults and unable to accurately distinguish and identify multiple superimposed fault types.

[0009] 4. Lack of targeted data augmentation techniques: Existing methods typically employ general data augmentation techniques, such as noise addition and time window sliding, but these techniques fail to fully consider the specific needs of fault diagnosis tasks. For example, certain fault features may be blurred or distorted during the augmentation process, resulting in generated data that cannot accurately reflect fault characteristics, thereby affecting the model's training effect and diagnostic performance.

[0010] While existing technologies, such as iTransformer, attempt to process multi-channel data using self-attention mechanisms, they treat channels as independent sequences for feature extraction, failing to effectively capture cross-channel temporal dependencies and exhibiting low computational efficiency in long-sequence scenarios. Furthermore, the single-classification-layer design limits the ability to distinguish multi-label fault types, making it unsuitable for fault diagnosis and early warning in subway transmission systems and other complex multi-channel industrial systems, such as power systems, manufacturing equipment, and transportation vehicles, addressing the diagnostic needs of complex faults. Therefore, there is an urgent need to develop a fault diagnosis method that can efficiently integrate multi-channel long-sequence data, accurately identify non-mutually exclusive fault combinations, and possess task-adaptive data augmentation mechanisms to overcome the bottlenecks of existing technologies. Summary of the Invention

[0011] The purpose of this application is to provide a multi-channel time-series data fault diagnosis method based on deep learning, which has the advantages of efficiently integrating multi-channel time-series data, accurately identifying complex fault types, and improving diagnostic accuracy and response speed.

[0012] This application provides a method for fault diagnosis of multi-channel time-series data based on deep learning, including the following steps:

[0013] Step 1: Data Acquisition and Processing

[0014] 1. Multi-channel data merging: Collect signals from multiple measuring points of the monitoring equipment, summarize them according to the same time step, and merge the signals from multiple channels into the same file to form the original sample;

[0015] II. Data Augmentation: Randomly select some channels for linear combination to merge signals, increasing data diversity;

[0016] 3. Data slicing: The data that has not undergone signal merging is combined with the data augmented to form augmented samples. The augmented samples are then cut into fixed-length segments to form signal slices.

[0017] Step 2: Model Building

[0018] It is constructed using a Transformer model with an input layer, an encoding layer, and an output layer.

[0019] 1. At the input layer: The multiple input channel signals are flattened and expanded to obtain a timing signal. The expanded timing signal is then divided into blocks and used as the input unit of the Transformer. Finally, the blocks of each channel are used to form a sequence block for position encoding.

[0020] II. At the encoding layer: The Transformer encoder layer is the base network, taking sequence blocks from all channels as input. Self-attention is used to calculate the relationships between these blocks. Through self-attention, the Transformer captures the dependencies and associated features between these blocks, thereby enhancing the representational ability of multi-channel data. The self-attention calculation formula is: Attention(Q,K,V)=\text{Softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V, where Q, K, and V represent the query, key, and value matrices, respectively, and d_k is the dimension of the key vector.

[0021] III. In the output layer: Initialize N binary classification layers based on the total number of fault categories N, using sigmoid as the activation function. Finally, combine the results of the N binary classification layers to achieve 2 N A multi-channel timing diagnostic model is established to identify each fault type;

[0022] Step 3: Model Training

[0023] Input the signal slice data obtained in step 1 into the multi-channel time-series diagnostic model established in step 2, and train it according to the following conditions.

[0024] I. Loss Function Design: A multi-label binary cross-entropy loss function is adopted.

[0025] II. Learning using an optimizer

[0026] Third, determine the number of training rounds based on the optimizer's learning and validation results;

[0027] Step 4: Fault Diagnosis and Identification

[0028] I. Independent Diagnosis: Perform independent diagnosis for each signal slice;

[0029] II. Voting Mechanism: A voting process is conducted based on the diagnostic results of all signal slices, selecting the label that appears most frequently as the final diagnostic result and outputting a composite fault type combination. This invention, through multi-channel expansion and a full attention mechanism, can capture complex patterns in the data, improving the accuracy of fault diagnosis and enhancing the resolution accuracy in fault classification tasks for key components of subway trains.

[0030] Step 3 involves dividing the signal slice data obtained in Step 1 into two parts: training data and test data. These are then input into a multi-channel timing diagnostic model established in Step 2, forming a training data multi-channel timing diagnostic model and a test data multi-channel timing diagnostic model. After combining the data from the training data multi-channel timing diagnostic model and the test data multi-channel diagnostic model, Step 4 outputs a combination of composite fault types.

[0031] The signals from multiple channels are merged into a single file in CSV format. The signal slices are overlapped by 50% between adjacent slices to obtain more training data.

[0032] The timing signal is divided into blocks of 16, 32, 64, etc., and the position of each block is encoded using 0, 1, 2, etc. to represent its relative position.

[0033] The optimizer used is the AdamW optimizer, with an initial learning rate of 1e-4. The learning rate is dynamically adjusted using a cosine annealing strategy. The batch size is set to 64, and the training epochs are 50-100.

[0034] The location encoding preserves the time step information of the timing data, and the location information and signal features are input together into the multi-channel timing diagnostic model.

[0035] The self-attention calculation employs a multi-head self-attention mechanism to capture long-range dependencies between different channels, enhancing the model's comprehensive processing capability for multi-channel signals. By recombinating multi-channel signals and applying the multi-head self-attention mechanism, this method effectively captures the correlation characteristics and potential fault modes between different devices, significantly improving the accuracy and stability of fault diagnosis.

[0036] The fault categories can be arbitrarily combined, with each fault type having its own independent fault prediction head to prevent interference between model parameters. This independent fault prediction head design allows each fault type to be identified independently and accurately, avoiding interference between categories and improving the overall reliability of the diagnostic system.

[0037] In practical applications, this invention can not only promptly detect and warn of potential faults in subway transmission systems, reducing equipment downtime and ensuring the safety and stability of subway operations, but also provide strong support for subsequent maintenance strategy optimization and equipment improvement through the accumulation and analysis of fault data. Furthermore, this method has strong versatility and is applicable to fault diagnosis and early warning of other complex multi-channel industrial systems, such as power systems, manufacturing equipment, and transportation vehicles, demonstrating broad application prospects and significant economic benefits.

[0038] Each enhanced sample is 64,000 in length, and the signal slice size is 6,400 with an overlap rate of 50% to obtain 19 sample slices. Fault labels are obtained from the fault diagnosis and identification of the 19 sample slices. The final fault category label is determined by majority voting to obtain the combination of composite fault types.

[0039] The calculation process for the 19 sample slices is as follows:

[0040] Step 1: Calculate the signal slice step size

[0041] A 50% overlap rate means that half of the sampling points of two adjacent signal slices overlap. Therefore: Step size = signal slice size × (1 - overlap rate). Substituting the values: Step size = 6400 × (1 - 50%) = 6400 × 0.5 = 3200

[0042] Step 2: Calculate the total number of slices

[0043] The formula for calculating the total number of fixed-length and overlapping slices is: Total number of slices = 1 + (enhanced sample length - signal slice size) divided by the step size. The logic of this formula is:

[0044] “1” represents the first slice (starting from the 0th sampling point);

[0045] "(Total length - signal slice size) / step size" represents the number of moves required after the first signal slice;

[0046] Number of moves + 1 = Total number of slices.

[0047] Substituting the values ​​for verification: Total number of slices = 1 + (64000 - 6400) / 3200 = 1 + 18 = 19

[0048] This slicing strategy has an overlap rate of 50%, which avoids the loss of information between slices caused by "no overlap" (for example, the fault feature is exactly in the gap between two slices), and also avoids the excessive number of slices (sharp increase in computation) caused by "excessive overlap" (such as 90%).

[0049] The voting strategy for 19 slices: 19 is an odd number, which can avoid the situation of "tie votes" (such as 10:10). The majority vote result is more stable and effectively offsets the misclassification caused by noise and local anomalies in a single slice, thus effectively improving the stability of fault diagnosis.

[0050] As can be seen from the above, the multi-channel time-series data fault diagnosis method provided in this application achieves efficient multi-channel time-series data integration and accurate identification of complex faults through multi-channel merging, enhancement and slicing steps in data acquisition and processing, combined with the input layer, encoding layer and output layer design of the Transformer model, as well as the model training and diagnostic identification process. It has the advantages of efficient integration of multi-channel time-series data, accurate identification of complex fault types, and improved diagnostic accuracy and response speed. Attached Figure Description

[0051] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0052] Figure 1 This is a flowchart of a multi-channel time-series data fault diagnosis method based on deep learning according to the present invention.

[0053] Figure 2 This is a structural diagram of a multi-channel time-series diagnostic model for a multi-channel time-series data fault diagnosis method based on deep learning, as described in this invention.

[0054] Figure 3 This is a schematic diagram of a data augmentation method for a multi-channel time-series data fault diagnosis method based on deep learning, as described in this invention. Detailed Implementation

[0055] The technical solutions of this application will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of this application, and not all embodiments. The components of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0056] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this application, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0057] Traditional fault diagnosis methods and existing deep learning methods suffer from problems when processing time-series data of multi-channel, long-sequence, and complex fault types in subway train transmission systems. These problems include insufficient multi-channel data processing capabilities, low efficiency in processing long-sequence data, difficulty in handling multiple non-exclusive fault types, and a lack of targeted data augmentation techniques. As a result, the diagnostic efficiency and accuracy are not high, and the ability to identify complex fault modes is limited.

[0058] In response, this application proposes a deep learning-based method for fault diagnosis of multi-channel time-series data, comprising the following steps:

[0059] Step 1: Data Acquisition and Processing

[0060] 1. Multi-channel data merging: Collect signals from multiple measuring points of the monitoring equipment, summarize them according to the same time step, and merge the signals from multiple channels into the same file to form the original sample;

[0061] II. Data Augmentation: Randomly select some channels for linear combination to merge signals, increasing data diversity;

[0062] 3. Data slicing: The data that has not undergone signal merging is combined with the data augmented to form augmented samples. The augmented samples are then cut into fixed-length segments to form signal slices.

[0063] Step 2: Model Building

[0064] It is constructed using a Transformer model with an input layer, an encoding layer, and an output layer.

[0065] 1. At the input layer: The multiple input channel signals are flattened and expanded to obtain a timing signal. The expanded timing signal is then divided into blocks and used as the input unit of the Transformer. Finally, the blocks of each channel are used to form a sequence block for position encoding.

[0066] 2. At the encoding layer: The Transformer encoder layer is the base network, which takes sequence blocks from all channels as input, and uses self-attention to calculate the interrelationships between sequence blocks. Through self-attention, the Transformer captures the dependencies and correlation features between sequence blocks, thereby enhancing the representation capability of multi-channel data.

[0067] 3. In the output layer: N binary classification layers are initialized according to the total number of fault categories N, and sigmoid is used as the activation function. Finally, the results of the N binary classification layers are combined to realize the discrimination of 2N fault types and establish a multi-channel time-series diagnostic model.

[0068] Step 3: Model Training

[0069] Input the signal slice data obtained in step 1 into the multi-channel time-series diagnostic model established in step 2, and train it according to the following conditions.

[0070] I. Loss Function Design: A multi-label binary cross-entropy loss function is adopted.

[0071] II. Learning using an optimizer

[0072] Third, determine the number of training rounds based on the optimizer's learning and validation results;

[0073] Step 4: Fault Diagnosis and Identification

[0074] I. Independent Diagnosis: Perform independent diagnosis for each signal slice;

[0075] II. Voting Mechanism: Based on the diagnostic results of all signal slices, a vote is taken, and the label that appears most frequently is selected as the final diagnostic result and the combination of composite fault types is output.

[0076] For ease of understanding, the following explains some key terms in this embodiment:

[0077] Multi-channel time-series data refers to a collection of data continuously acquired from multiple sensors or measuring points at different times. This data is time-dependent, and there may be correlations between different channels, such as various signals like vibration, current, and temperature in a subway train drive system.

[0078] The Transformer model is a deep learning model architecture based on a self-attention mechanism. Originally applied to natural language processing, it has since been extended to time-series data processing. This model can capture long-range dependencies in sequential data and improve processing efficiency through parallel computation.

[0079] Self-attention is the core mechanism of the Transformer model, allowing the model to consider the importance of all other elements in a sequence while processing a particular element. By calculating the similarity between queries, keys, and values, the self-attention mechanism can dynamically assign weights to different elements, thereby extracting representative features.

[0080] The multi-label binary cross-entropy loss function is a loss function suitable for multi-label classification tasks. In multi-label classification, a sample can belong to multiple categories simultaneously. This loss function calculates the binary cross-entropy independently for each category, and then sums or averages the losses for all categories to measure the difference between the model's predictions and the true labels.

[0081] An optimizer is an algorithm used during the training of a deep learning model to adjust the model parameters to minimize the loss function. By iteratively updating the model weights, the optimizer guides the model to learn patterns in the data, such as gradient descent and its variants.

[0082] Voting mechanisms are a decision fusion strategy commonly used in ensemble learning or multi-model prediction scenarios. In fault diagnosis, when multiple diagnostic results exist, the result with the highest frequency of occurrence is selected as the final decision by statistically analyzing the frequency of different diagnostic results, thereby improving the robustness and accuracy of the diagnosis.

[0083] This application provides a method for fault diagnosis of multi-channel time-series data based on deep learning, the specific implementation of which may include the following steps:

[0084] During the data acquisition and processing phase, multi-channel data merging is performed first. This process involves acquiring signals from multiple measuring points of the monitoring equipment and summarizing them according to the same time step. Subsequently, the signals from multiple channels are merged into a single file to form the original sample. For example, sensor data from different measuring points can be simply stitched together by timestamp, or they can be stored in different columns to form a unified data table.

[0085] Furthermore, data augmentation is performed. This operation involves linearly combining randomly selected channels to achieve signal merging, thereby increasing data diversity. For example, two channels can be randomly selected, their signals can be simply weighted and summed to generate a new synthetic channel signal, which can then be added to the original dataset.

[0086] Based on this, data slicing is performed. This step combines the data that has not undergone signal combining with the data augmentation data to form augmented samples. Subsequently, these augmented samples are cut into fixed-length segments to form signal slices. For example, a long sequence of augmented samples can be cut into equal intervals of fixed length, and each cut segment is a signal slice.

[0087] During the model building phase, a Transformer model with an input layer, encoding layer, and output layer is used. This model serves as the core diagnostic tool and is capable of processing time-series data.

[0088] Specifically, at the input layer, the input signals from multiple channels are flattened and expanded to obtain a timing signal. Then, the expanded timing signal is divided into blocks, and each block is used as the input unit of the Transformer. Finally, the blocks of each channel are used to form sequence blocks for position encoding. For example, the multi-channel signals can be arranged into a one-dimensional vector according to time steps, and then this vector can be divided into several fixed-size subsequences, with each subsequence assigned a value representing its relative position in the original sequence.

[0089] In the encoding layer, the Transformer encoder layer is used as the base network. This layer receives sequence blocks from all channels as input and uses self-attention computation to evaluate the relationships between the sequence blocks. Through self-attention computation, the Transformer can capture the dependencies and associated features between sequence blocks, thereby enhancing the representational power of multi-channel data. For example, the encoder can employ a single-head self-attention mechanism to compute attention weights between each sequence block and all other sequence blocks, thus aggregating contextual information.

[0090] In the output layer, N binary classification layers are initialized based on the total number of fault categories N. Each binary classification layer uses a sigmoid function as its activation function. Finally, the results of the N binary classification layers are combined to achieve a 22 classification. N The system identifies different fault types, thereby establishing a multi-channel time-series diagnostic model. For example, for N possible fault types, an independent binary classifier can be set up for each fault type. Each classifier determines whether the specific fault exists, and then the outputs of all classifiers are simply concatenated to form the final fault discrimination vector.

[0091] During the model training phase, the signal slice data obtained in step 1 is input into the multi-channel time-series diagnostic model established in step 2, and training is performed according to the conditions.

[0092] The loss function designed uses a multi-label binary cross-entropy loss function. This loss function can handle situations where a sample may have multiple fault labels simultaneously.

[0093] Furthermore, an optimizer is used for learning. This optimizer is responsible for adjusting the model's weights and biases based on the gradient information of the loss function to gradually reduce the model's prediction error. For example, a stochastic gradient descent optimizer can be used to update the model parameters by calculating the gradient for each batch of data.

[0094] Furthermore, the number of training epochs is determined based on the optimizer's learning and validation results. For example, a single training epoch can be set, or the model's performance can be monitored on the validation set, stopping training when performance no longer improves.

[0095] In the fault diagnosis and identification phase, independent diagnosis is performed first. This process involves diagnosing faults individually for each signal slice. For example, each signal slice is input into a trained model, and the model outputs a fault prediction result for each slice.

[0096] Finally, a voting mechanism is employed. This mechanism votes on the diagnostic results of all signal slices, selecting the label that appears most frequently as the final diagnostic result, thus outputting a combination of composite fault types. For example, if an enhanced sample is divided into multiple signal slices, and each slice provides a fault prediction, then the final fault type will be determined by the fault label that appears most frequently among these slice prediction results.

[0097] This application proposes a deep learning-based fault diagnosis method for multi-channel time-series data. Through data processing, Transformer model construction, and a multi-label diagnostic strategy, it addresses the shortcomings of existing methods for handling multi-channel, long-sequence, and complex fault data in subway train transmission systems. These methods suffer from insufficient processing power, low efficiency, difficulty in identifying multiple fault types, and inadequate data augmentation. This method improves the accuracy of fault diagnosis and the ability to identify complex fault modes, thus ensuring the safe operation of subway trains.

[0098] In some embodiments described above in this application, signal slice data is input into a multi-channel timing diagnostic model for training and fault diagnosis. However, in practical applications, if training and test data are not clearly distinguished, the model may perform well on the training set but have insufficient generalization ability on unseen data, thereby affecting the accuracy and reliability of fault diagnosis.

[0099] In response, this application further proposes that in step 3, the signal slice data obtained in step 1 is divided into two parts: training data and test data, and each part is input into a multi-channel timing diagnostic model established in step 2, forming a multi-channel timing diagnostic model for training data and a multi-channel timing diagnostic model for test data. After the data from the multi-channel timing diagnostic model for training data and the multi-channel timing diagnostic model for test data are combined, the composite fault type combination is output through step 4.

[0100] Specifically, the signal slice data obtained in step 1 is processed through data acquisition and includes both raw and enhanced time-series signal segments. Dividing it into training and testing data is a crucial step in training and evaluating machine learning models. Training data is used to learn and optimize model parameters, while testing data is used to evaluate the model's generalization ability on unseen data. This division typically employs random sampling, such as a 7:3 or 9:1 ratio, to ensure data representativeness and independence.

[0101] During the model training phase, training data is input into the multi-channel time-series diagnostic model for parameter learning. During the model evaluation or validation phase, test data is input into the same or another instantiated multi-channel time-series diagnostic model to evaluate its performance. This data partitioning and input method aims to create a "training data multi-channel time-series diagnostic model" that has fully learned and optimized its parameters on the training data, and a "test data multi-channel time-series diagnostic model" that has been independently validated on the test data and can accurately assess its generalization ability. Here, "creating" does not refer to creating two independent physical model instances, but rather to ensuring, through the training and testing process, that the constructed multi-channel time-series diagnostic model possesses the ability to reliably diagnose faults in practical applications.

[0102] After the above training and testing processes are completed, that is, after the multi-channel time-series diagnostic model has learned from the training data and its performance has been verified by the test data, the model can be used for actual fault diagnosis and identification. At this time, the model will receive new signal slice data and, according to the independent diagnosis and voting mechanism described in step 4, finally output a combination of composite fault types. Here, "after data combination" means that after the model has completed training and testing and its performance has been confirmed, it can be put into use to process new data for fault diagnosis.

[0103] By explicitly dividing signal slice data into training and testing data, and using them separately for model training and evaluation, this application effectively avoids model overfitting and ensures that the multi-channel time-series diagnostic model maintains good generalization ability on unseen data. Training data is used to optimize model parameters, enabling it to fully learn fault characteristics; testing data provides an independent, unbiased performance evaluation criterion, thus accurately assessing the model's actual diagnostic capability. This strategy of separating training and testing data significantly improves the accuracy and reliability of fault diagnosis results, making the final output of composite fault type combinations more credible, and providing solid technical support for equipment maintenance and fault early warning.

[0104] In some of the embodiments described above in this application, it is proposed to merge and slice multi-channel data to form training samples. However, in the implementation process, if the data storage format is not uniform or the signal slicing method is too simple, it may lead to low data processing efficiency and failure to make full use of the limited original data to generate sufficiently diverse training samples, thereby affecting the training effect and generalization ability of the model.

[0105] In this regard, this application further proposes that, in the above data acquisition and processing steps, the signals from multiple channels are merged into the same file, which is in CSV format, and that adjacent signal slices are overlapped by 50% to obtain more training data.

[0106] Specifically, CSV format is used to store the combined signal data from multiple channels, aiming to ensure standardized and universal data storage. CSV (Comma Separated Values) is a widely used plain text file format where data items are typically separated by commas, making it easy to read, write, and parse. This format has good compatibility, allowing it to be directly loaded and processed by various programming languages ​​and data processing tools, thus simplifying data management and exchange processes and reducing the complexity of the data preprocessing stage. For example, in a CSV file, the signal data for each channel can be a separate column, or all channel data can be laid out in the same row according to time steps, to accommodate different data organization needs.

[0107] Meanwhile, signal slicing involves dividing a continuous time-series signal into short segments of fixed length, while the overlap rate refers to the proportion of shared data between adjacent slices. Using a 50% overlap rate for signal slicing means that the latter half of the data in the current slice is identical to the first half of the data in the next slice. This slicing strategy can significantly increase the number of training samples without increasing the amount of original data collected. For example, for an original signal of length L, if the slice length is S, a non-overlapping slice will yield L / S samples; while a 50% overlap rate will cause each slice to move by a step size of S / 2, thus generating approximately 2L / S samples. This method can more densely cover the information in the original signal, capturing subtle changes and local features that may exist in the signal, especially at fault boundaries or transition regions, providing richer contextual information.

[0108] The above technical solution merges signals from multiple channels into a CSV file, achieving standardized and efficient data storage management, facilitating subsequent data loading and processing, and improving the efficiency of the data preprocessing stage. Simultaneously, by employing a 50% overlap rate in signal slicing, the scale and diversity of the training dataset are effectively expanded without increasing the cost of original data acquisition. This overlapping slicing method allows the model to learn signal features from a denser perspective, capturing more potential fault modes and state transition information, especially at fault edges or transition regions, providing richer contextual information. This not only enhances the model's ability to identify complex fault modes but also significantly improves its generalization performance and robustness, enabling it to maintain high diagnostic accuracy even when facing unknown or variant faults.

[0109] In some embodiments described above in this application, when processing multi-channel time-series data, the Transformer model needs to flatten and expand the input signals from multiple channels to obtain a single time-series signal, and then divide it into blocks. However, if the block division strategy is inappropriate or the position encoding method is not precise enough, the model may have difficulty effectively capturing the local time-series features and accurate position information contained in the time-series signal, thereby affecting the model's ability to identify fault modes.

[0110] In this regard, this application further proposes to divide the timing signal into blocks of 16, 32, 64, etc., and to use 0, 1, 2, etc. to represent the relative position of the blocks.

[0111] Specifically, time series signal partitioning involves dividing a continuous time series signal into several fixed-length subsequences or "blocks." This partitioning operation is crucial for Transformer models because the Transformer's self-attention mechanism typically performs computations on fixed-length sequences, rather than the entire infinitely long time series signal. Choosing an appropriate block size balances computational efficiency and information capture capability. For example, for some rapidly changing fault signals, a smaller block size (such as 16 or 32) may be needed to capture transient features; while for faults with longer durations, a larger block size (such as 64) may be more suitable. These block sizes are typically powers of 2 to facilitate computer processing and optimization. During the data preprocessing stage, the flattened time series signal is partitioned into sliding windows or non-overlapping blocks according to the preset block size.

[0112] Meanwhile, positional encoding of blocks is an indispensable part of the Transformer model's processing of sequence data. It provides the model with the sequential information of each element in the sequence (in this case, each block). Since the Transformer's self-attention mechanism is parallel and does not inherently handle sequence order, positional encoding is needed to inject this information. Using integers such as 0, 1, and 2 to represent relative positions is a concise and intuitive positional encoding method that directly reflects the sequential order of blocks within the entire temporal signal. For each block, an integer starting from 0 and incrementing can be assigned as its positional code. For example, the positional code for the first block is 0, the second is 1, and so on. These integer codes can be converted into high-dimensional vectors and then added to the feature vector of the block as input to the Transformer encoder. This relative positional encoding method helps the model understand the temporal relationships between different blocks, such as which block occurs first and which occurs later, thereby better capturing dynamic patterns and causal relationships in temporal data.

[0113] By dividing the time-series signal into blocks of various sizes (16, 32, 64) and encoding these blocks using relative positions represented by integers (0, 1, 2), this application effectively addresses the problem of insufficient diagnostic accuracy in traditional Transformer models when processing long-series data, caused by a lack of fine-grained capture of local temporal features and effective injection of positional information. Specifically, choosing different block sizes allows the model to understand the characteristics of the time-series signal at multiple scales. For example, smaller block sizes help capture instantaneous or short-term fault features, while larger block sizes better identify long-term trends or periodic patterns. Simultaneously, the concise and intuitive relative positional encoding method ensures that the Transformer model can accurately perceive the relative order of each block in the original time-series signal when processing blocks in parallel, thus avoiding semantic confusion caused by missing positional information. This combination of multi-scale block division and precise positional encoding significantly enhances the model's ability to represent complex temporal dependencies in multi-channel time-series data, enabling the model to more accurately identify and diagnose complex fault types, thereby improving the accuracy and robustness of fault diagnosis.

[0114] In some of the embodiments described above in this application, a fault diagnosis method for multi-channel time-series data based on the Transformer model is proposed, and the model is learned through a model training step. However, if appropriate optimization strategies and parameter configurations are not adopted during the model training process, the model may experience slow convergence, unstable training, and be prone to getting stuck in local optima, or even overfitting, thereby affecting the model's diagnostic accuracy and generalization ability for faults in complex multi-channel time-series data.

[0115] In this regard, this application further proposes that during the model training process, the optimizer used is the AdamW optimizer, the initial learning rate is set to 1e-4, the learning rate is dynamically adjusted using a cosine annealing strategy, the batch size is set to 64, and the training epochs are 50-100.

[0116] Specifically, the AdamW optimizer is an improved Adam optimizer. Its core lies in decoupling weight decay (L2 regularization) from gradient updates and applying it directly to the weight parameters. This decoupling more effectively prevents overfitting, especially when training deep neural networks, providing a more stable training process and better generalization performance. By using the AdamW optimizer, the model can better balance fitting the training data and avoid overlearning on noise during the learning process. The initial learning rate is the step size at which the optimizer updates the model parameters at the beginning of training. Setting it to 1e-4 is a widely validated and effective value in deep learning. This value ensures rapid convergence of the model in the early stages of training, avoiding inefficient training due to an excessively small learning rate, while also preventing training oscillations or divergence caused by an excessively large learning rate, thus laying a stable foundation for subsequent dynamic adjustments to the learning rate. Cosine annealing is a learning rate scheduling method that periodically adjusts the learning rate based on a cosine function. Specifically, the learning rate gradually decreases from a maximum value to a minimum value, and then may increase again, forming one or more "annealing" cycles. This dynamic adjustment mechanism helps the model quickly explore the solution space in the early stages of training and fine-tune parameters in the later stages to achieve better convergence. It also helps the model escape local optima and improves its generalization ability. Batch size refers to the number of samples used in a single model parameter update. Setting the batch size to 64 achieves a good balance between training efficiency and model generalization ability. Smaller batch sizes provide more accurate gradient estimates but result in slower training speeds and higher gradient noise; larger batch sizes, while faster, may lead to decreased model generalization ability. 64, as a medium batch size, ensures relative stability and efficiency in the training process while effectively utilizing computational resources. Training epochs refer to the number of times the model traverses the entire training dataset. Setting the training epochs to 50-100 provides the model with ample learning opportunities, enabling it to extract complex features and patterns from the data. This range allows for flexible adjustments based on performance on the validation set during actual training. For example, an early stopping mechanism can be used to terminate training prematurely when model performance no longer improves, avoiding overfitting while ensuring the model is adequately trained.

[0117] Through the above technical solutions, the AdamW optimizer effectively decouples weight decay during model training, thus more stably preventing overfitting and improving the model's generalization ability. An initial learning rate of 1e-4 provides a suitable exploration step size, avoiding instability or slow convergence in the early stages of training. Combined with a cosine annealing strategy to dynamically adjust the learning rate, the model can update parameters adaptively at different training stages, achieving both rapid convergence and effective escape from local optima, further improving training efficiency and final performance. A batch size of 64 ensures training efficiency while also considering the accuracy of gradient estimation and the model's generalization ability. 50-100 training epochs ensure the model has sufficient opportunities to learn deep features from multi-channel time-series data, enabling the constructed multi-channel time-series diagnostic model to more accurately and stably identify combinations of complex fault types, significantly improving the accuracy and robustness of fault diagnosis.

[0118] In some embodiments described above in this application, the input signals from multiple channels are divided into blocks, and each channel block is used to form a sequence block for position encoding to help the Transformer model understand the positional relationships of elements in the sequence. However, when processing multi-channel time-series data in complex industrial equipment or systems, relying solely on conventional position encoding may not be sufficient to capture the inherent time step information in the time-series data. This could lead to the model's insufficient understanding of the dynamic changes and time dependencies of the signal when identifying fault modes closely related to time evolution, thus affecting the accuracy of fault diagnosis.

[0119] In this regard, this application further proposes that the location encoding retains the time step information of the timing data, and the location information and signal features are jointly input into the multi-channel timing diagnostic model.

[0120] Specifically, positional encoding in Transformer models is typically used to inject sequential information into a sequence to compensate for the lack of sequence order awareness in self-attention mechanisms. Here, positional encoding is designed not only to encode the relative or absolute position of elements in the sequence, but more importantly, it explicitly preserves the time step information of the time-series data. This means that the generation of positional encoding considers time-scale-related information such as time intervals, sampling frequencies, or timestamps. For example, this can be achieved by incorporating time step information into the calculation formula of positional encoding, or by using specially designed time-aware positional encoding (such as relative time encoding, periodic time encoding, etc.). This design ensures that the model, when processing time-series data, can perceive the time intervals and data change rates between different time points, thereby better understanding the dynamic characteristics of time-series data.

[0121] Before inputting signal features into the multi-channel time-series diagnostic model, positional encoding containing time-step information is fused with the signal features. This fusion can be achieved in various ways, such as concatenating the positional encoding vector with the signal feature vector or performing element-wise addition. Through this common input, the multi-channel time-series diagnostic model can not only obtain the amplitude, frequency, and other features of the original signal, but also simultaneously obtain the precise position and time-step information of these features in the time dimension. This allows the Transformer's self-attention mechanism to consider both the signal content and its temporal context when calculating the relationships between sequence blocks, thus providing a more comprehensive understanding of the inherent patterns in multi-channel time-series data.

[0122] Through the aforementioned technical solution, location encoding is no longer merely a simple location index, but explicitly incorporates time step information from the time-series data. When these location codes containing time step information are input together with the original signal features into a multi-channel time-series diagnostic model, the Transformer model can gain a deeper understanding of the temporal dynamics and inherent temporal dependencies of the multi-channel time-series data. This allows the model to not only focus on the signal features themselves during self-attention calculations, but also to perceive the time intervals and trends between different time points, thereby significantly enhancing the model's ability to capture patterns in time-series data. Ultimately, this refined processing of temporal information helps improve the accuracy and robustness of fault diagnosis, especially in scenarios requiring precise identification of time-related fault modes.

[0123] In some embodiments described above in this application, the Transformer model uses self-attention to calculate the interrelationships between sequence blocks in order to capture the dependencies and associated features between sequence blocks, thereby enhancing the representation capability of multi-channel data. However, when processing complex multi-channel time-series data, a single self-attention mechanism may be insufficient to fully capture the diverse long-range dependencies between different channels that span long time steps, which may limit the model's ability to comprehensively understand and extract features from multi-channel signals.

[0124] In response, this application further proposes that the self-attention calculation adopts a multi-head self-attention mechanism to capture the long-range dependencies between different channels, thereby enhancing the model's comprehensive processing capability for multi-channel signals.

[0125] Specifically, the multi-head self-attention mechanism processes input information by running multiple self-attention computation units (i.e., "attention heads") in parallel. Each attention head independently learns different linear projections of queries, keys, and values, thus enabling it to focus on different aspects of the input sequence in different representation subspaces. The input sequence is first copied and fed separately into different attention heads, each independently computing its self-attention output. These independent outputs are then concatenated and passed through a final linear transformation layer to form the final output of the multi-head self-attention mechanism. This parallel processing and information integration approach allows the model to capture complex patterns and relationships in the data from multiple perspectives and at various granularities.

[0126] Given the parallel processing nature of multi-head self-attention mechanisms, different attention heads can be trained to focus on capturing different types of inter-channel dependencies. For example, one head might focus on capturing short-term, direct inter-channel linkages, while another head might focus on identifying indirect inter-channel effects spanning longer time series. This capability is particularly important for multi-channel time-series data, as failure modes often involve coordinated changes in multiple channels at different time scales. In this way, the model can more effectively identify long-range correlations hidden in complex multi-channel signals that are not easily detected by a single attention mechanism.

[0127] Multi-head self-attention mechanisms integrate different perspectives and information from multiple attention heads, providing the model with richer and more comprehensive multi-channel signal representations. Each attention head contributes its understanding of specific aspects of the input data, and these understandings are fused in the final output, enabling the model to perform a more refined and in-depth comprehensive analysis of multi-channel signals. This enhanced processing capability helps the model extract key features relevant to fault diagnosis more accurately when facing varied and complex fault scenarios, thereby improving overall diagnostic performance and robustness.

[0128] Through the above technical solutions, the method proposed in this application can effectively overcome the limitations of a single self-attention mechanism in capturing diverse long-range dependencies in multi-channel time-series data. The multi-head self-attention mechanism allows the model to learn and focus on various aspects of the input signal in parallel from multiple different representation subspaces, thereby simultaneously capturing complex short-term and long-term, direct and indirect correlations between different channels. This mechanism significantly enhances the model's comprehensive processing capability for multi-channel signals, enabling the model to more comprehensively and deeply understand the fault characteristics inherent in multi-channel data, thus improving the accuracy and robustness of fault diagnosis. Its advantages are particularly evident when dealing with complex fault types involving the coordinated effects of multiple channels.

[0129] In some embodiments described above in this application, N binary classification layers are initialized based on the total number of fault categories N, and their results are combined to achieve the discrimination of 2N fault types. However, in practical applications, when fault categories can be arbitrarily combined to form composite faults, if all fault types share or are tightly coupled with the parameters of the prediction layer, the learning processes between different fault types may interfere with each other, making it difficult for the model to accurately distinguish and identify multiple composite faults, thereby affecting the accuracy and robustness of the diagnosis.

[0130] In this regard, this application further proposes that the fault categories can be combined arbitrarily, with each fault type having an independent fault prediction head to prevent interference between model parameters.

[0131] Specifically, the arbitrary combination of fault categories refers to the fact that in actual equipment operation, a single fault may exist, or two or more faults may occur simultaneously. These fault types are not mutually exclusive but can be superimposed in any way to form multiple compound fault modes. For example, a piece of equipment may simultaneously experience both bearing wear and motor overheating faults. This "arbitrary combination" characteristic requires the diagnostic model to be able to independently identify each potential fault type, rather than just identifying predefined compound fault modes.

[0132] To address this complexity, this application proposes setting up an independent fault prediction head for each fault type. This means that for each fault type requiring diagnosis, an independent sub-network or module specifically designed for classifying that fault type is configured at the model's output. For example, if faults A, B, and C need to be diagnosed, the model will have a prediction head for fault A, a prediction head for fault B, and a prediction head for fault C, respectively. Each prediction head is typically a simple fully connected layer, whose output is processed by an activation function (such as sigmoid) to obtain the probability of the fault type's existence. This independent setup ensures that each prediction head can focus on learning and identifying its corresponding fault features without being directly affected by the learning process of other fault types.

[0133] By setting independent prediction heads for each fault type, the influence between model parameters can be effectively prevented. This means that during model training, the parameters of the corresponding prediction heads for different fault types can be updated and optimized independently. Adjusting the parameters of one fault prediction head will not directly change or interfere with the parameters of other fault prediction heads, thus avoiding confusion or competition in the learned features between different fault types. For example, during training, even if a certain fault type has fewer samples, the learning of its prediction head will not be dominated by other fault types with larger sample sizes, ensuring that each fault type can be learned fully and independently.

[0134] By employing the aforementioned technical solution, and by setting an independent fault prediction head for each fault type, the model can effectively decouple the learning tasks between different fault types when dealing with complex scenarios involving arbitrary combinations of fault categories. Each prediction head focuses on identifying its corresponding fault characteristics, avoiding mutual interference and influence between parameters of different fault types. This ensures that the model can independently and accurately learn and identify each potential fault, providing accurate diagnostic results even when multiple faults occur simultaneously to form compound faults. Therefore, this application significantly improves the accuracy and robustness of multi-channel time-series data fault diagnosis models in identifying compound faults, making the diagnostic results more reliable and detailed.

[0135] In some of the embodiments described above in this application, a method is proposed to independently diagnose signal slices and vote based on the diagnostic results of all signal slices to select the label that appears most frequently as the final diagnostic result and output a combination of composite fault types. However, in its implementation, if the length of the enhanced sample, the size of the signal slice, and the slice overlap rate are not precisely set, it may lead to information loss and significant boundary effects during the data slicing process, or the voting mechanism may not be able to fully exert its advantages due to insufficient sample slices, thereby affecting the accuracy and robustness of the final fault diagnosis.

[0136] To address this, this application further proposes slicing each enhanced sample with a length of 64,000, a signal slice size of 6,400, and an overlap rate of 50% to obtain 19 sample slices. Fault labels are obtained from the fault diagnosis and identification of the 19 sample slices, and the final fault category label is determined by majority voting to obtain a combination of composite fault types.

[0137] Specifically, the length of each enhanced sample is 64,000, which refers to the length of a single complete data segment used for subsequent slicing processing after multi-channel data merging and data enhancement in the data acquisition and processing steps. Setting the enhanced sample length to 64,000 aims to ensure that each sample contains sufficient time-series information to capture potential fault characteristics, especially for fault modes that are long-lasting or require extensive context for identification. This length balances the needs of data integrity and computational efficiency.

[0138] The signal slice size of 6400 refers to further subdividing the aforementioned 64000-length augmented sample into fixed-length segments, each segment being a signal slice of length 6400. Using fixed-size signal slices helps standardize the model's input, enabling the Transformer model to process data efficiently. Simultaneously, the smaller slice size allows the model to focus on local features, while multiple slices collectively cover the entire augmented sample.

[0139] The 50% overlap slicing refers to a 50% data overlap between adjacent slices when generating signal slices from the enhanced samples. For example, if the signal slice size is 6400, the step size for each slice movement is 3200. This overlapping slicing strategy effectively avoids information loss due to fault features being located precisely at slice boundaries, ensuring that data at each time point can be analyzed across multiple slices, thereby enhancing the robustness of feature extraction.

[0140] Using the aforementioned augmented sample of length 64000, signal slice size of 6400, and 50% overlap, 19 sample slices can be precisely calculated. For example, for an augmented sample of length 64000, when the slice size is 6400 and the overlap rate is 50%, the first slice starts from 0, the second slice starts from 3200, and so on, until the last slice covers the end of the augmented sample, ultimately generating 19 independent but overlapping signal slices.

[0141] The fault labeling process for identifying faults in 19 sample slices involves inputting each of the 19 generated signal slices into a pre-trained multi-channel time-series diagnostic model. Each slice independently outputs its corresponding fault diagnosis result, i.e., a set of fault labels. This process provides rich diagnostic evidence for the subsequent voting mechanism.

[0142] The majority voting method for determining the final fault category label involves statistically analyzing the independent fault labels obtained from 19 signal slices. For each possible fault type, the frequency of its occurrence across the 19 slice diagnostic results is counted, and the fault type with the highest frequency is selected as the final diagnostic result for that enhanced sample. This majority voting mechanism effectively integrates diagnostic information from multiple overlapping slices, filtering out potential misjudgments or uncertainties from individual slices, thereby improving the accuracy and reliability of the final diagnosis.

[0143] The above technical solution precisely defines the length of the enhanced sample, the size of the signal slice, and the slice overlap rate, ensuring the scientific rigor and effectiveness of the data slicing process. By generating 19 overlapping sample slices, the diagnostic information of each enhanced sample can be captured from multiple angles and levels. Based on this, independent diagnosis of each slice combined with a majority voting mechanism effectively aggregates diagnostic evidence from different slices, significantly reducing the misdiagnosis rate caused by local noise or boundary effects. This method, by integrating multiple diagnostic results, greatly enhances the robustness and accuracy of fault diagnosis, especially when identifying combinations of complex fault types, providing more stable and reliable diagnostic conclusions, thereby improving the ability to identify equipment faults and the confidence level of diagnosis in complex industrial scenarios.

[0144] The following example will provide a more detailed explanation of the above technical solution:

[0145] In a fault diagnosis scenario for a subway train transmission system in urban rail transit, real-time monitoring of key components such as the train's motor, gearbox, and bearings is required to identify potential complex faults. The system deploys multiple sensors, including vibration sensors, current sensors, temperature sensors, and speed sensors, totaling 10 channels. These sensors continuously collect data at a sampling frequency of 64kHz.

[0146] First, in the data acquisition and processing stage:

[0147] I. Multi-channel data merging: The system will simultaneously acquire signals from these 10 channels. At the same time step, the signal data from these channels will be aggregated and merged into a single CSV file to form the original sample. For example, a data sample containing 10 channels and 64,000 time steps will be generated every second.

[0148] II. Data Augmentation: To increase data diversity and model generalization ability, the system randomly selects some channels (e.g., vibration channel 1, current channel 2, and temperature channel 3) for linear combination to generate new signals. This targeted data augmentation method can simulate the complex superposition effect of signals under different fault modes, making up for the shortcomings of traditional general data augmentation techniques that may blur or distort fault characteristics.

[0149] III. Data Slicing: The raw, uncombined multi-channel data is combined with the augmented data to form augmented samples. Each augmented sample is set to a length of 64,000 time steps. To extract more training information from long-sequence data and improve the ability to capture local fault features, these augmented samples are cut into fixed-length signal slices, each 6,400 time steps long. Adjacent signal slices have a 50% overlap rate; for example, 19 signal slices can be obtained from an augmented sample of 64,000 time steps. This slicing method effectively solves the problem of low processing efficiency for long-sequence data and provides richer training data.

[0150] Secondly, in the model building phase, a Transformer model with an input layer, an encoding layer, and an output layer is used:

[0151] 1. At the input layer: For each signal slice (e.g., a slice containing 10 channels and 6400 time steps), the signals from these 10 channels are first flattened and expanded to obtain a time-series signal of length 64000. Next, this expanded time-series signal is divided into blocks; for example, using a block size of 64, the 64000 data points are divided into 1000 sequence blocks. Each sequence block serves as the input unit for the Transformer. To preserve the time step information of the time-series data, the sequence blocks formed by each channel are positionally encoded, using numbers such as 0, 1, and 2 to represent their relative positions. This positional information, along with the signal features, is input into the model. This processing method enables the model to effectively handle multi-channel data while maintaining the temporal dependencies of long-series data.

[0152] II. At the Encoding Layer: The Transformer encoder layer serves as the foundational network, receiving sequence blocks from all channels. This layer employs a multi-head self-attention mechanism to compute the interrelationships between sequence blocks. Through multi-head self-attention computation, the Transformer can capture complex dependencies and correlations between different channels and within long sequences, thereby enhancing its representational capabilities for multi-channel data. This significantly improves the model's ability to uncover complex intrinsic relationships in multi-channel data, overcoming the shortcomings of existing methods in multi-channel data processing.

[0153] III. In the output layer: Assuming the total number of fault categories N to be diagnosed is 5 (e.g., bearing wear, gear damage, motor winding fault, sensor fault, normal), five independent binary classification layers are initialized. Each binary classification layer is specifically responsible for determining the existence of a fault type, using the sigmoid function as the activation function. Finally, the results of these five binary classification layers are combined to discriminate 2^5 = 32 combinations of fault types, thus establishing a multi-channel time-series diagnostic model. This design, by setting an independent fault prediction head for each fault type, effectively prevents the influence between model parameters, enabling the model to accurately handle non-mutually exclusive composite fault types and solving the problem that existing methods struggle to handle multiple fault types.

[0154] Next, in the model training phase:

[0155] The signal slice data obtained in step 1 is divided into two parts: training data and test data. These two parts are then input into the multi-channel time-series diagnostic model established in step 2, forming the training data model and the test data model, respectively.

[0156] I. Loss Function Design: The multi-label binary cross-entropy loss function is adopted, which can accurately measure the model's prediction accuracy for multiple fault labels that may exist simultaneously.

[0157] II. Learning using an optimizer: The AdamW optimizer is used, with an initial learning rate set to 1e-4. Cosine annealing is used to dynamically adjust the learning rate to ensure that the model can converge stably and escape local optima during training. The batch size is set to 64.

[0158] 3. Determine the number of training rounds based on the optimizer's learning and validation results: The number of model training rounds is set between 50 and 100 rounds, and the specific number of rounds is dynamically adjusted based on the validation results on the test data to achieve the best performance.

[0159] Finally, in the fault diagnosis and identification phase:

[0160] I. Independent Diagnosis: For an enhanced sample to be diagnosed (length 64000), it is first divided into 19 signal slices. Then, independent diagnosis is performed on these 19 signal slices, and each slice will output a fault label combination (for example, slice 1 is diagnosed as [bearing wear, motor failure], slice 2 is diagnosed as [bearing wear]).

[0161] II. Voting Mechanism: Diagnostic results from all 19 signal slices are collected. For each fault type, a majority vote is used to determine the final diagnostic result. For example, if 15 out of the 19 slices diagnose "bearing wear," the final diagnostic result will include "bearing wear." This voting mechanism effectively improves the robustness and accuracy of the diagnosis, ultimately outputting a combination of complex fault types, thereby achieving precise identification of complex faults in the subway train transmission system.

[0162] The above descriptions are merely embodiments of this application and are not intended to limit the scope of protection of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. A method for fault diagnosis of multi-channel time-series data based on deep learning, comprising the following steps: Step 1: Data Acquisition and Processing 1. Multi-channel data merging: Collect signals from multiple measuring points of the monitoring equipment, summarize them according to the same time step, and merge the signals from multiple channels into the same file to form the original sample; II. Data Augmentation: Randomly select some channels for linear combination to merge signals, increasing data diversity; 3. Data slicing: The data that has not undergone signal merging is combined with the data augmented to form augmented samples. The augmented samples are then cut into fixed-length segments to form signal slices. Step 2: Model Building It is constructed using a Transformer model with an input layer, an encoding layer, and an output layer.

1. At the input layer: The multiple input channel signals are flattened and expanded to obtain a timing signal. The expanded timing signal is then divided into blocks and used as the input unit of the Transformer. Finally, the blocks of each channel are used to form a sequence block for position encoding.

2. At the encoding layer: The Transformer encoder layer is the base network, which takes sequence blocks from all channels as input, and uses self-attention to calculate the interrelationships between sequence blocks. Through self-attention, the Transformer captures the dependencies and correlation features between sequence blocks, thereby enhancing the representation capability of multi-channel data. III. In the output layer: Initialize N binary classification layers based on the total number of fault categories N, using sigmoid as the activation function. Finally, combine the results of the N binary classification layers to achieve 2 N A multi-channel timing diagnostic model is established to identify each fault type; Step 3: Model Training Input the signal slice data obtained in step 1 into the multi-channel time-series diagnostic model established in step 2, and train it according to the following conditions. I. Loss Function Design: A multi-label binary cross-entropy loss function is adopted. II. Learning using an optimizer Third, determine the number of training rounds based on the optimizer's learning and validation results; Step 4: Fault Diagnosis and Identification I. Independent Diagnosis: Perform independent diagnosis for each signal slice; II. Voting Mechanism: Based on the diagnostic results of all signal slices, a vote is taken, and the label that appears most frequently is selected as the final diagnostic result and the combination of composite fault types is output.

2. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: Step 3 involves dividing the signal slice data obtained in Step 1 into two parts: training data and test data. These are then input into a multi-channel timing diagnostic model established in Step 2, forming a training data multi-channel timing diagnostic model and a test data multi-channel timing diagnostic model. After combining the data from the training data multi-channel timing diagnostic model and the test data multi-channel timing diagnostic model, a combination of composite fault types is output through Step 4.

3. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: The signals from multiple channels are merged into a single file in CSV format. The signal slices are overlapped by 50% between adjacent slices to obtain more training data.

4. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: The timing signal is divided into blocks of 16, 32, 64, etc., and the position code of the block is represented by 0, 1, 2, etc. to indicate its relative position.

5. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: The optimizer used is the AdamW optimizer, with an initial learning rate of 1e-4. The learning rate is dynamically adjusted using a cosine annealing strategy. The batch size is set to 64, and the training epochs are 50-100.

6. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: The location encoding preserves the time step information of the timing data, and the location information and signal features are input together into the multi-channel timing diagnostic model.

7. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: The self-attention calculation employs a multi-head self-attention mechanism to capture long-range dependencies between different channels, thereby enhancing the model's comprehensive processing capability for multi-channel signals.

8. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: The fault categories can be combined arbitrarily, with each fault type having an independent fault prediction head to prevent interference between model parameters.

9. The method for fault diagnosis of multi-channel time-series data based on deep learning according to claim 1, characterized in that: Each enhanced sample is 64,000 in length, and the signal slice size is 6,400 with an overlap rate of 50% to obtain 19 sample slices. Fault labels are obtained from the fault diagnosis and identification of the 19 sample slices. The final fault category label is determined by majority voting to obtain the combination of composite fault types.