A fault diagnosis model training method and device
By introducing attention weights and contrastive learning represented by feature encoding vectors into the fault diagnosis model, dynamically dividing samples and performing loss function regularization, the overfitting problem caused by label noise is solved, the diagnostic accuracy and robustness of the model are improved, and efficient fault diagnosis is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA TOBACCO ZHEJIANG IND CO LTD
- Filing Date
- 2023-07-24
- Publication Date
- 2026-06-12
AI Technical Summary
Existing data-driven fault diagnosis methods tend to overfit noisy labeled data when faced with label noise, leading to decreased diagnostic accuracy and insufficient feature extraction capabilities, making it difficult to effectively uncover subtle features in fault data.
By introducing attention weights based on feature encoding vector representations, dynamically partitioning samples and regularizing the loss function, and combining contrastive learning and label correction, the model's feature representation ability is optimized, and the impact of label noise on the model is reduced.
It improves the diagnostic accuracy and feature representation ability of the model, enhances the robustness and generalization ability of the model, and can maintain good diagnostic performance even in high noise environment, with an average diagnostic accuracy of 98.0% to 98.2%.
Smart Images

Figure CN116861250B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of fault diagnosis technology, and more specifically, to a fault diagnosis model training method and apparatus. Background Technology
[0002] Bearings, as key components in mechanical transmission, are widely used in various mechanical equipment, and their health significantly impacts the safety and stability of these devices. However, when equipment operates under harsh environments such as high speed and heavy loads for extended periods, bearings inevitably degrade, developing cracks and wear. Failures directly affect the normal operation of the entire equipment, causing economic losses to the company or even leading to accidents and threatening lives. Therefore, monitoring the health of bearings and promptly eliminating potential safety hazards is of great engineering significance in ensuring the normal operation of mechanical equipment.
[0003] Currently, fault diagnosis methods for rolling bearings are mainly divided into two categories: analytical model-based and data-driven methods. Analytical model-based methods require an analytical expression of the fault diagnosis problem, which is difficult to model for complex systems, and the resulting models have low universality across other systems, limiting their practical application. Data-driven fault diagnosis methods, on the other hand, suffer from insufficient feature extraction capabilities, making it difficult to uncover deeper, subtle features from fault data, thus limiting the improvement of diagnostic accuracy.
[0004] With the rapid rise and widespread adoption of the internet and the Internet of Things (IoT), the growth rate of social data is faster than ever before. Big data provides ample training "raw materials" for deep neural networks, offering new opportunities for in-depth research and application of data-driven intelligent mechanical fault diagnosis. Currently, deep learning-based fault diagnosis methods are widely used in the field of fault diagnosis because they can effectively represent fault information. However, in actual industrial activities, workers lacking professional knowledge are prone to assigning incorrect labels to fault modes. Therefore, labeling errors (i.e., label noise) are unavoidable in real industrial datasets.
[0005] However, most current data-driven fault diagnosis methods rely too heavily on well-labeled datasets. When label noise is present, the model may overfit to the noisy labeled data, resulting in insufficient model feature representation and affecting diagnostic accuracy. Summary of the Invention
[0006] This application provides a fault diagnosis model training method and apparatus. By introducing attention weights based on feature encoding vector representation, the sample is dynamically divided to regularize the loss function. This method has high diagnostic robustness for labeled noisy samples, reduces gradient representation caused by labeled noisy samples, avoids model overfitting to noisy labeled data, and improves the expressive power and diagnostic accuracy of model features.
[0007] This application provides a method for training a fault diagnosis model, including:
[0008] The input is a first training dataset, in which all training samples have labels, some of which have correct labels and others have incorrect labels.
[0009] Repeat the following steps until the current model training epoch reaches the maximum number of training epochs:
[0010] Transform all training samples in the first training dataset into corresponding first feature encoding vectors;
[0011] The first feature encoding vector is labeled and classified to obtain the probability that the training sample belongs to each health state.
[0012] The attention weights of the training samples are obtained based on the first feature encoding vector;
[0013] Calculate the noisy attention loss function for the current training round based on the probability of all training samples, attention weights, and labels of the training samples;
[0014] The model parameters are updated based on the noise attention loss function.
[0015] Preferably, the fault diagnosis model training method further includes: performing comparative learning using the first training dataset to obtain a comparative loss function;
[0016] The combined loss function for the current training round is obtained based on the noise attention loss function and the contrastive loss function;
[0017] and,
[0018] The model parameters are updated based on the comprehensive loss function.
[0019] Preferably, comparative learning is performed using the first training dataset to obtain a comparative loss function, specifically including:
[0020] Two different data augmentations are applied to the training samples in the first training dataset to form a data augmentation sample set;
[0021] Comparative learning is performed on any two data augmented samples in the data augmentation sample set to obtain the contrastive loss function.
[0022] Preferably, a contrastive learning process is performed on any two data-augmented samples within the data-augmented sample set to obtain a contrastive loss function, specifically including:
[0023] For any pair of samples, firstly, the two data-enhanced samples in the pair are transformed into a first feature encoding vector and a second feature encoding vector, respectively; then, the first feature encoding vector and the second feature encoding vector are mapped to spatial representation vectors, respectively; finally, the similarity between the two spatial representation vectors is calculated.
[0024] The mutual information between samples is calculated using the similarity of all sample pairs, and the contrastive loss function is calculated based on the mutual information between all samples.
[0025] Preferably, before calculating the noise attention loss function, the method further includes:
[0026] Determine if the current model training epoch has reached the epoch for enabling label correction; if the epoch for enabling label correction is less than the maximum number of epochs for model training.
[0027] If so, the labels are corrected to form corrected labels, and all training samples with corrected labels form the second training dataset.
[0028] The noise attention loss function for the current training round is calculated based on the probability of all training samples, the attention weights, and the corrected labels corresponding to the training samples.
[0029] Furthermore, the first training dataset is updated to the second training dataset, and the second training dataset is used for subsequent training rounds.
[0030] This application also provides a fault diagnosis model training device, including a training data receiving module, a first transformation module, a classification module, a weight acquisition module, a first loss function calculation module, and a parameter update module;
[0031] The training data receiving module is used to receive the input first training dataset. All training samples in the first training dataset have labels, of which some training samples have correct labels and others have incorrect labels.
[0032] The first transformation module is used to transform all training samples in the first training dataset into corresponding first feature encoding vectors;
[0033] The classification module is used to classify the first feature encoding vector into labels and obtain the probability that the training sample belongs to each health state.
[0034] The weight acquisition module is used to obtain the attention weights of the training samples based on the first feature encoding vector;
[0035] The first loss function calculation module is used to calculate the noisy attention loss function for the current training round based on the probability of all training samples, attention weights, and labels of the training samples.
[0036] The parameter update module is used to update the model parameters based on the noise attention loss function.
[0037] Preferably, the fault diagnosis model training device further includes a contrastive learning module and a second loss function calculation module;
[0038] The contrastive learning module is used to perform contrastive learning using the first training dataset to obtain a contrastive loss function;
[0039] The second loss function calculation module is used to obtain the comprehensive loss function for the current training round based on the noise attention loss function and the contrast loss function;
[0040] Furthermore, the parameter update module is used to update the model parameters based on the comprehensive loss function.
[0041] Preferably, the contrastive learning module includes a data augmentation module and a sample pair contrastive learning module;
[0042] The data augmentation module is used to perform two different data augmentations on all training samples in the first training dataset to form a data augmentation sample set;
[0043] The sample pair contrastive learning module is used to perform contrastive learning on any two data augmented samples within the data augmentation sample set to obtain the contrastive loss function.
[0044] Preferably, the sample pair comparison learning module includes a second transformation module, a mapping module, a similarity calculation module, and a comparison loss function calculation module;
[0045] The second transformation module is used to transform the two data augmented samples in the sample pair into the first feature encoding vector and the second feature encoding vector, respectively.
[0046] The mapping module is used to map the first feature encoding vector and the second feature encoding vector into spatial representation vectors, respectively;
[0047] The similarity calculation module is used to calculate the similarity between two spatial representation vectors;
[0048] The contrastive loss function calculation module is used to calculate the mutual information between samples using the similarity of all sample pairs, and to calculate the contrastive loss function based on the mutual information between all samples.
[0049] Preferably, the fault diagnosis model training device further includes a judgment module, a label correction module, and a dataset update module;
[0050] The judgment module is used to determine whether the current model training round has reached the mark for enabling label correction;
[0051] The label correction module is used to correct the labels when the current model training round reaches the label correction activation round, forming corrected labels. All training samples with corrected labels form the second training dataset.
[0052] The dataset update module is used to update the first training dataset to the second training dataset, and then use the second training dataset for subsequent training rounds.
[0053] The first loss function calculation module is used to calculate the noisy attention loss function for the current training round based on the probability of all training samples, attention weights, and the corrected labels corresponding to the training samples.
[0054] Other features and advantages of this application will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description
[0055] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments of the present application and, together with their description, serve to explain the principles of the present application.
[0056] Figure 1 A flowchart of a preferred embodiment of the fault diagnosis model training method provided in this application;
[0057] Figure 2 A structural diagram of one embodiment of the feature encoding module provided in this application;
[0058] Figure 3 A flowchart illustrating the comparative learning process provided for this application;
[0059] Figure 4 A structural diagram of an embodiment of the fault diagnosis model training system provided in this application;
[0060] Figure 5 A comparison chart of label noise rates before and after label correction provided for this application;
[0061] Figure 6 A comparison chart of the classification performance of the model training method of this application with other model training methods provided for this application;
[0062] Figure 7 This is a structural diagram of the fault diagnosis model training device provided in this application. Detailed Implementation
[0063] Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present application.
[0064] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit the scope of this application and its application or use.
[0065] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, they should be considered part of the specification.
[0066] In all the examples shown and discussed herein, any specific values should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.
[0067] This application provides a fault diagnosis model training method and apparatus. By introducing attention weights based on feature-encoded vector representation, dynamic weight allocation and sample partitioning are achieved to regularize the loss function, reducing gradient representation caused by noisy labeled samples and preventing the model from overfitting to noisy labeled data. This effectively solves the fault diagnosis problem with mislabeled data, improving the model's feature expressiveness and diagnostic accuracy. Furthermore, this application optimizes the feature space by contrastive learning to narrow the mapping distance between positive sample pairs and widen the feature distance between negative samples, further enhancing the model's feature expressiveness and reducing the negative impact of label noise. Additionally, in the later stages of training, this application integrates model predictions with the original noisy labels to perform label correction, constructing a training dataset with a low noise rate, further improving the model's generalization ability and obtaining better feature representations.
[0068] like Figure 1 As shown, the fault diagnosis model training method provided in this application includes:
[0069] S110: Receives the first training dataset as input. All training samples (fault signal samples) in the first training dataset are labeled, with some training samples having correct labels and others having incorrect labels. As an example, the first training dataset consists of vibration signals of rolling bearings in different states under the same operating condition, with incorrect labels.
[0070] The first training dataset is represented as follows: Where x [i] As training samples, For training sample x [i] The tag. It's worth noting that... It could be a correct label or an incorrect label (i.e., label noise).
[0071] S120: Transform all training samples in the first training dataset into the corresponding first feature encoding vector.
[0072] Specifically, through the feature encoding module f(·; θ) f The training samples are transformed into corresponding first feature encoding vectors, where θ f For its module parameter set, please refer to [reference needed]. Figure 4 f(·;θ) f The information sequence of the input training samples (e.g., the vibration signal sequence of a rolling bearing) is mapped to a high-dimensional feature embedding space, i.e., f:x→R. f Training sample x [i] The corresponding first feature encoding vector is represented as h [i] .
[0073] As an example, the feature encoding module adopts a ResNet architecture, such as... Figure 2 As shown, the ResNet architecture adds residual blocks to traditional deep neural networks to simplify network complexity and address the problem of network degradation. Residual blocks can be used to train more efficient deep networks, as inputs can propagate forward more quickly through the identity mapping connections within the residual blocks.
[0074] S130: Perform label classification on the first feature encoding vector to obtain the probability that the training sample belongs to each health state.
[0075] Specifically, through the label classification decoding module c(·; θ) c For label classification of the first feature encoding vector, please refer to [reference needed]. Figure 4 c(·;θ) c The first feature encoding vector h obtained from the above high-dimensional embedding space [i] As input, it is then mapped to the device health state space, and the output is the health state space probability distribution p = softmax(v), i.e., c:f(x)→R. M , where v is the classification value output by the label classification decoding module. For a given training sample x [i] The probability p that it belongs to the k-th health state [i] for:
[0076]
[0077] S140: Obtain the attention weights of the training samples based on the first feature encoding vector.
[0078] Early model learning phenomena indicate that deep learning models tend to memorize correctly labeled samples before fitting mislabeled samples. Therefore, in the early learning stages, correctly labeled samples are more likely to possess better learned feature representations than mislabeled samples. To enable deep neural networks to generate attention weights that reflect the quality of the learned representations, this application incorporates attention weights in the feature encoding module f(·θ). f The attention weight branch is then introduced; please refer to [reference needed]. Figure 4 The attention weight branch consists of fully connected layers, outputting a scalar e. [i] , making e [i] =Wh [i] +b, and then use the sigmoid function to scale it to between 0 and 1, where Indicates branch weight, This indicates the corresponding deviation.
[0079] For each training sample x [i] The attention weights of the output for:
[0080]
[0081] S150: Calculate the noise attention loss function for the current training round.
[0082] As an example, the noisy attention loss function for the current training round is calculated based on the probability of all training samples, the attention weights, and the labels of the training samples.
[0083] Cross-entropy L is commonly used in classification tasks. ce As an empirical risk loss function, it measures how well the model predictions fit the labels, and is used to perform model parameter updates and optimizations during backpropagation.
[0084]
[0085] In the formula, N is the batch size for model training.
[0086] In order to enable the attention weights to automatically capture the differences in representation, this application introduces them as regularization terms into equation (3), and proposes a method based on the attention term L. a and the improvement item L b The noise attention loss function L is composed of NAL :
[0087]
[0088] In the formula, λ is an adjustable hyperparameter. The attention term L... a The prediction term in equation (3) is composed of a weighted set of model predictions and the original sample label data: For correctly labeled samples, the attention weights output by the model L tends towards 1. a This degenerates into the cross-entropy loss function of equation (3). For mislabeled samples, the attention weights will tend to 0, reducing the gradient representation they cause.
[0089] On the one hand, in the early learning stage, the model has not yet overfitted to mislabeled samples (i.e., noisy labeled samples). Therefore, compared with correctly labeled samples, it still cannot effectively represent mislabeled samples, resulting in a poor fit between the predicted health state and its corresponding labeled health state. By minimizing L... a This effectively enables the model to output attention weights for mislabeled samples. It approaches 0. On the other hand, because the model will first fit the correctly labeled samples, it makes... The attention weights tend to 0, at which point the attention weights are relevant to minimizing L. a There is no impact. Therefore, this application introduces the enhancement term L. b To avoid the model outputting weight values of 0 for all samples, L b It can be viewed as a binary cross-entropy loss function, that is, for all inputs The target predicted value is 1, so that for correctly labeled samples, the weight value can effectively approach 1.
[0090] For the noise attention loss function L proposed in equation (4) NAL For the sake of simplicity, this application rewrites it as Gradient analysis was then performed on it to further demonstrate its effectiveness:
[0091]
[0092] Here, the scaling factor is set.
[0093] Compared to the cross-entropy loss function L ce L NAL By introducing a scaling factor Gradient reweighting is performed to reduce the impact of label-noisy sample data. exist The upper monotonically increases, and has For correctly labeled samples, its cross-entropy gradient term (p) j -1) After the early learning phase, the value tends to 0, which can easily lead to overfitting of the model to mislabeled samples. Introducing a scaling factor can address this. (It is worth noting that for mislabeled samples, the scaling factor tends to 0 under the influence of attention weights, which can effectively reduce the gradient representation caused by mislabeled samples and prevent them from dominating gradient updates.)
[0094] S1100: Update model parameters based on the comprehensive loss function. If the current model training epoch has reached the maximum number of training epochs, execute S1100 and return to S120. Otherwise, end training.
[0095] As an example, the noise attention loss function is used as the comprehensive loss function, and the model parameters are updated based on the comprehensive loss function.
[0096] Based on the above, preferably, the fault diagnosis model training method further includes:
[0097] S160: Perform contrastive learning using the first training dataset to obtain the contrastive loss function.
[0098] S170: Obtain the comprehensive loss function for the current training round based on the noise attention loss function and the contrast loss function.
[0099] Furthermore, in step S1100, the model parameters are updated based on the comprehensive loss function.
[0100] As an example, in S160, contrastive learning is performed using the first training dataset to obtain a contrastive loss function, specifically including:
[0101] S1601: Perform two different data augmentations on the training samples in the first training dataset to form a data augmentation sample set.
[0102] As an example, training samples of batch size N are randomly sampled from the first training dataset. like Figure 4 As shown, two different data augmentation methods t are applied to each training sample in this batch. a , t b To obtain a data augmentation sample set (Total of 2N), including the first data augmentation method t a The first data augmentation sample set and the second data augmentation method t b The second data augmentation sample set obtained.
[0103] As an example, such as Figure 3 As shown, the training sample x is subjected to two different data augmentation methods t a , t b The data augmentation sample obtained later is x a x b .
[0104] S1602: Perform contrastive learning on any two data augmentation samples in the data augmentation sample set to obtain the contrastive loss function.
[0105] For a given data augmentation sample (from training sample x) [k] Data augmentation method t a (Obtained), which can form a sample pair with the remaining 2N-1 samples in the data augmentation sample set, where For positive sample pairs (where, From training sample x [k] Data augmentation method t b (obtained) and the remaining 2N-2 samples (from the training samples x) [k] Training samples other than those obtained through data augmentation method t a or t b (Obtain) to form negative sample pairs.
[0106] like Figure 4 As shown, a contrastive learning process is performed on any two data-augmented samples within the data-augmented sample set to obtain a contrastive loss function, specifically including:
[0107] P1: For any pair of samples, firstly, the two data-enhanced samples in the pair are transformed into a first feature encoding vector and a second feature encoding vector, respectively; then, the first feature encoding vector and the second feature encoding vector are mapped to spatial representation vectors, respectively; finally, the similarity between the two spatial representation vectors is calculated.
[0108] As an example, negative samples are used to... For example, first, please refer to S120 and use the feature encoding module f(·θ) f From the two data-augmented samples, feature representations are extracted to obtain the corresponding feature encoding vectors:
[0109]
[0110] Then, through the projection layer g(·;θ) g Map the first and second feature encoding vectors to the unit hypersphere space to obtain the corresponding spatial representation vectors:
[0111]
[0112] Finally, in the unit hypersphere vector space, cosine similarity is used to measure the similarity between two spatial representation vectors. For each pair of features (k,j), where k∈{1,2,3,...,N} and j∈{1,2,3,...,N}, the cosine similarity calculation formula is... for:
[0113]
[0114] As an example, such as Figure 3 As shown, positive sample pairs (x a x b ) via feature encoding module f(·; θ f After that, the first feature encoding vector v is obtained. a Second feature encoding vector v b After projection layer g(·; θ) g Then, the spatial representation vector z is obtained respectively. a , z b Finally, the similarity between the two is calculated.
[0115] P2: Calculate the mutual information between samples using the similarity of all sample pairs, and calculate the contrastive loss function based on the mutual information between all samples.
[0116] As an example, for any first data augmentation sample Mutual information between samples for:
[0117]
[0118] For any second data augmentation sample Mutual information between samples for:
[0119]
[0120] In equations (9) and (10), τ is an indicator function, taking a value of 1 when j≠i and 0 otherwise. τ is the contrast loss temperature coefficient.
[0121] As an example, InfoNCE is used as the loss function L. c :
[0122]
[0123] Therefore, in S170, the comprehensive loss function L is:
[0124] L = L NAL +λ c L c (12)
[0125] In the formula, λ c To compare the loss balance coefficients.
[0126] In the fault diagnosis model training method of this application, the overall training objective is to minimize the loss function (e.g., the comprehensive loss function L here) using gradient descent. The Adam optimizer can be used for gradient descent.
[0127] By using contrastive learning loss, it is possible to make samples with the same health status in the high-dimensional feature embedding space more closely distributed in space, and further widen the distance between samples with different health statuses, thereby achieving discriminative feature enhancement and improving the accuracy of the integrated pseudo-labels in label correction (see the following description).
[0128] Based on the above, preferably, before S150, the following steps are also included:
[0129] S180: Determine whether the current model training epoch has reached the epoch where label correction is enabled. If yes, execute S190 to perform label correction in each epoch; otherwise, execute S150.
[0130] S190: Correct the labels of the training samples to form corrected labels. All training samples with corrected labels form the second training dataset. Then, execute S150, in which the noise attention loss function for the current training round is calculated based on the probabilities of all training samples, attention weights, and the corrected labels corresponding to the training samples.
[0131] Specifically, the labels are corrected using the Label Correction (LC) module to obtain a second training dataset with lower label noise rate:
[0132]
[0133] Where, f(y,y) [t] Let f(y,y) be the label of the training sample in the t-th model training epoch. [t-1] ) represents the label of the training sample in the (t-1)th model training round; E represents the labels of the training samples in the first training dataset. s The number of epochs for enabling label correction; m is the difference between the current model training epoch and the epoch for enabling label correction; α is the pseudo-label update momentum, with a value of [0,1). t≥E s In the formula, the first term is a weight α with exponential decay. m The original noise labels allow for a smoother iteration of model label data, mitigating cognitive bias. The second term is the ensemble prediction term, composed of an exponential moving average of the predicted values, applied in iteration rounds E. s When +m, its integrated iteration term is Furthermore, as the number of model iterations increases, α mAs the value gradually approaches 0, the model's prediction target ultimately depends on the ensemble predictor, in order to construct a more complete training label set and introduce it into L. NAL Replace the original noise label Perform model training.
[0134] In this preferred embodiment, the fault diagnosis model training method further includes:
[0135] S1110: Update the first training dataset to the second training dataset. Then return to S120 (in the preferred embodiment of contrastive learning, also return to S160) to perform subsequent rounds of training using the second training dataset.
[0136] It should be noted that this application does not restrict the order of S1100 and S1110, and the two steps can be performed simultaneously.
[0137] Figure 5 A comparison plot of the label noise rates is shown for the first training dataset (a) with a 90% label noise rate and the second training dataset (b) obtained after label correction, where the values on the diagonal represent scores with correct labels. Figure 5 It can be seen that the label noise rate is significantly reduced after label correction.
[0138] Based on the model training method in this application that combines weight allocation, label correction, and contrastive learning, the classification performance after training is shown in (see...). Figure 6 (d) For comparison with other training methods, for example Figure 6 As shown. Figure 6 It uses a five-class classification method to classify fault signals. Figure 6 (a) shows the classification effect using the cross-entropy loss function CE. As can be seen from the figure, the classification results of multiple fault types overlap and no clear boundary can be obtained. Figure 6 (b) To illustrate the classification effect using the Symmetric Cross-Entropy Loss Function (SCE), which combines cross-entropy and anti-cross-entropy, it can be seen from the figure that Class 0 and Class 4 faults can be distinguished from the other three classes, but the other three classes overlap. Figure 6 (c) To illustrate the classification performance of ELR with early regularization, the model integrates predictions from multiple previous iterations as a regularization term and introduces it into the loss function for training. As shown in the figure, the five types generally have their own boundaries, but there is still a small amount of overlap between types 2 and 3. Figure 6 (d) It can be seen that the five types of faults have clear boundaries in the model training method of this application.
[0139] It should be noted that after the model training is completed, the fault diagnosis model includes the aforementioned feature encoding module and label classification decoding module. After the fault signal is input into the fault diagnosis model, the model first converts the fault signal into a feature encoding vector, and then inputs the feature encoding vector into the label classification decoding module to obtain the probability that the fault signal belongs to each health state.
[0140] Based on the above-mentioned fault diagnosis model training method, this application also provides a fault diagnosis model training device. For example... Figure 7 As shown, the fault diagnosis model training device includes a training data receiving module 710, a first transformation module 720, a classification module 730, a weight acquisition module 740, a first loss function calculation module 750, and a parameter update module 760.
[0141] The training data receiving module 710 is used to receive the input first training dataset. All training samples in the first training dataset have labels, of which some training samples have correct labels and others have incorrect labels.
[0142] The first transformation module 720 is used to transform all training samples in the first training dataset into corresponding first feature encoding vectors.
[0143] The classification module 730 is used to classify the first feature encoding vector by label to obtain the probability that the training sample belongs to each health state.
[0144] The weight acquisition module 740 is used to obtain the attention weights of the training samples based on the first feature encoding vector.
[0145] The first loss function calculation module 750 is used to calculate the noisy attention loss function for the current training round based on the probability of all training samples, attention weights, and labels of the training samples.
[0146] The parameter update module 760 is used to update the model parameters based on the noise attention loss function.
[0147] Preferably, the fault diagnosis model training device further includes a contrastive learning module 770 and a second loss function calculation module 780.
[0148] The contrastive learning module 770 is used to perform contrastive learning using the first training dataset to obtain a contrastive loss function.
[0149] The second loss function calculation module 780 is used to obtain the comprehensive loss function of the current training round based on the noise attention loss function and the contrast loss function.
[0150] Furthermore, the parameter update module 760 is used to update the model parameters based on the comprehensive loss function.
[0151] Preferably, the contrastive learning module 770 includes a data augmentation module 7701 and a sample pair contrastive learning module 7702.
[0152] The data augmentation module 7701 is used to perform two different data augmentations on all training samples in the first training dataset to form a data augmentation sample set.
[0153] The sample pair contrastive learning module 7702 is used to perform contrastive learning on any two data augmented samples in the data augmentation sample set to obtain the contrastive loss function.
[0154] Preferably, the sample pair comparison learning module 7702 includes a second transformation module, a mapping module, a similarity calculation module, and a comparison loss function calculation module.
[0155] The second transformation module is used to transform the two data augmentation samples in the sample pair into the first feature encoding vector and the second feature encoding vector, respectively.
[0156] The mapping module is used to map the first feature encoding vector and the second feature encoding vector into spatial representation vectors, respectively.
[0157] The similarity calculation module is used to calculate the similarity between two spatial representation vectors.
[0158] The contrastive loss function calculation module is used to calculate the mutual information between samples using the similarity of all sample pairs, and to calculate the contrastive loss function based on the mutual information between all samples.
[0159] Preferably, the fault diagnosis model training device further includes a judgment module 790, a label correction module 7100, and a dataset update module 7110.
[0160] The judgment module 790 is used to determine whether the current model training round has reached the mark for enabling label correction.
[0161] The label correction module 7100 is used to correct the labels when the current model training round reaches the label correction activation round, forming corrected labels. All training samples with corrected labels form a second training dataset.
[0162] The dataset update module 7110 is used to update the first training dataset to the second training dataset, and to use the second training dataset for subsequent training rounds.
[0163] The first loss function calculation module 750 is used to calculate the noisy attention loss function for the current training round based on the probability of all training samples, attention weights, and the corrected labels corresponding to the training samples.
[0164] This application performs weight allocation and label correction based on the early learning characteristics of the model, and introduces contrastive learning to enhance the model's representational ability. No additional training subset is required, and the model maintains good generalization performance even under high noise rates. First, based on the characteristic that the model first fits correctly labeled samples to ensure that the feature representation ability of the samples is consistent with the accuracy of their labels, an attention weight branch is designed and introduced into the loss function to partition the samples for regularization. This assigns greater weight to correctly labeled samples and reduces the weight of incorrectly labeled samples, effectively ensuring that correctly labeled samples maintain their dominant position during the model's gradient update process. Second, the label correction module performs label correction by integrating the predicted terms with the original labeled data, constructing a more complete training dataset. Finally, a contrastive learning module is designed, imposing constraints of feature similarity and structural similarity with the model's classification branch on a shared feature extraction network, fully mining the inherent discriminative information of fault signals.
[0165] This application enhances the robustness of the model by introducing attention weight branches and improves the discriminative ability by label correction and contrastive learning. It increases the number of available samples for the model and optimizes the discrimination boundary of various health state samples in the high-dimensional embedding space, thereby further improving the generalization ability of the model. The average diagnostic accuracy reached 98.0% and 98.2% under various noise levels.
[0166] While specific embodiments of this application have been described in detail by way of examples, those skilled in the art should understand that the above examples are for illustrative purposes only and are not intended to limit the scope of this application. Those skilled in the art should understand that modifications can be made to the above embodiments without departing from the scope and spirit of this application. The scope of this application is defined by the appended claims.
Claims
1. A method for training a fault diagnosis model, characterized in that, include: The input is a first training dataset, in which all training samples have labels, some of which have correct labels and others have incorrect labels. Repeat the following steps until the current model training epoch reaches the maximum number of training epochs: Transform all training samples in the first training dataset into corresponding first feature encoding vectors; The first feature encoding vector is labeled and classified to obtain the probability that the training sample belongs to each health state. The attention weights of the training samples are obtained based on the first feature encoding vector; The noisy attention loss function for the current training round is calculated based on the probabilities of all training samples, the attention weights, and the labels of the training samples. The model parameters are updated based on the noise attention loss function. It also includes: performing contrastive learning using the first training dataset to obtain a contrastive loss function; The combined loss function for the current training round is obtained based on the noise attention loss function and the contrast loss function. Furthermore, the model parameters are updated based on the comprehensive loss function. Specifically, the comparative learning process using the first training dataset to obtain the comparative loss function includes: Two different data augmentations are performed on the training samples in the first training dataset to form a data augmentation sample set; wherein, the batch size of random sampling from the first training dataset is [missing information]. training samples Two different data augmentation methods were applied to each training sample in this batch. , To obtain a data augmentation sample set This includes the first data augmentation method. The first data augmentation sample set obtained Second data augmentation method The second data augmentation sample set obtained ; Contrastive learning is performed on any two data-augmented samples within the data-augmented sample set to obtain the contrastive loss function; wherein, for a given data-augmented sample... With the rest of this data augmentation sample set Each sample forms a sample pair, where... For positive sample pairs, and the rest Each sample forms a negative sample pair; Specifically, the comparative learning of sample pairs consisting of any two data-augmented samples within the data-augmented sample set to obtain the comparative loss function includes: For any sample pair, firstly, the two data-enhanced samples in the sample pair are transformed into the first feature encoding vector and the second feature encoding vector, respectively; then, the first feature encoding vector and the second feature encoding vector are mapped to spatial representation vectors, respectively; finally, the similarity between the two spatial representation vectors is calculated; wherein, the feature encoding module is used. Feature representations are extracted from the two data-augmented samples to obtain the corresponding feature encoding vectors: ; Through the projection layer Map the first and second feature encoding vectors to the unit hypersphere space to obtain the corresponding spatial representation vectors: ; In the unit hypersphere vector space, cosine similarity is used to measure the similarity between two spatial representation vectors; for each pair of feature pairs... ,in , Cosine similarity calculation formula for: ; Positive sample pairs ( , ) via feature encoding module The first feature encoding vector is then obtained. Second feature encoding vector via projection layer Then, spatial representation vectors are obtained respectively. , Finally, the similarity between the two is calculated. The mutual information between samples is calculated using the similarity of all sample pairs, and the contrastive loss function is calculated based on the mutual information between all samples; whereby, for any first data augmentation sample... Mutual information between samples for: ; For any second data augmentation sample Mutual information between samples for: ; in, For indicator functions, when When the time is right, the value is 1; otherwise, it is 0. To compare the temperature coefficient of loss; Using InfoNCE as the loss function : ; Comprehensive loss function for: ; in, To compare the loss balance coefficients.
2. The fault diagnosis model training method according to claim 1, characterized in that, Before calculating the noise attention loss function, the method further includes: Determine whether the current model training epoch has reached the label correction activation epoch, wherein the label correction activation epoch is less than the maximum model training epoch; If so, the labels are corrected to form corrected labels, and all training samples with corrected labels form the second training dataset. Calculate the noisy attention loss function for the current training round based on the probability of all training samples, attention weights, and the corrected labels corresponding to the training samples. Furthermore, the first training dataset is updated to the second training dataset, and the second training dataset is used for subsequent training rounds.
3. A fault diagnosis model training device, characterized in that, It includes a training data receiving module, a first transformation module, a classification module, a weight acquisition module, a first loss function calculation module, and a parameter update module; The training data receiving module is used to receive the input first training dataset, in which all training samples in the first training dataset have labels, some of which have correct labels and others have incorrect labels. The first conversion module is used to convert all training samples in the first training dataset into corresponding first feature encoding vectors; The classification module is used to perform label classification on the first feature encoding vector to obtain the probability that the training sample belongs to each health state. The weight acquisition module is used to obtain the attention weights of the training samples based on the first feature encoding vector; The first loss function calculation module is used to calculate the noisy attention loss function for the current training round based on the probability of all training samples, the attention weights, and the labels of the training samples. The parameter update module is used to update the model parameters according to the noise attention loss function; It also includes a contrastive learning module and a second loss function calculation module; The contrastive learning module is used to perform contrastive learning using the first training dataset to obtain a contrastive loss function; The second loss function calculation module is used to obtain the comprehensive loss function of the current training round based on the noise attention loss function and the contrast loss function; Furthermore, the parameter update module is used to update the model parameters based on the comprehensive loss function; The contrastive learning module includes a data augmentation module and a sample pair contrastive learning module. The data augmentation module is used to perform two different data augmentations on all training samples in the first training dataset to form a data augmentation sample set; wherein, the batch size of random sampling from the first training dataset is... training samples Two different data augmentation methods were applied to each training sample in this batch. , To obtain a data augmentation sample set This includes the first data augmentation method. The first data augmentation sample set obtained Second data augmentation method The second data augmentation sample set obtained ; The sample pair contrastive learning module is used to perform contrastive learning on any two data augmented samples within the data augmentation sample set to obtain the contrastive loss function; wherein, for a given data augmented sample... With the rest of this data augmentation sample set Each sample forms a sample pair, where... For positive sample pairs, and the rest Each sample forms a negative sample pair; The sample pair comparison learning module includes a second transformation module, a mapping module, a similarity calculation module, and a comparison loss function calculation module. The second conversion module is used to convert the two data augmented samples in the sample pair into the first feature encoding vector and the second feature encoding vector, respectively. The mapping module is used to map the first feature encoding vector and the second feature encoding vector into spatial representation vectors, respectively. The similarity calculation module is used to calculate the similarity between two spatial representation vectors; wherein, the feature encoding module is used. Feature representations are extracted from the two data-augmented samples to obtain the corresponding feature encoding vectors: ; Through the projection layer Map the first and second feature encoding vectors to the unit hypersphere space to obtain the corresponding spatial representation vectors: ; In the unit hypersphere vector space, cosine similarity is used to measure the similarity between two spatial representation vectors; for each pair of feature pairs... ,in , Cosine similarity calculation formula for: ; Positive sample pairs ( , ) via feature encoding module The first feature encoding vector is then obtained. Second feature encoding vector via projection layer Then, spatial representation vectors are obtained respectively. , Finally, the similarity between the two is calculated. The contrast loss function calculation module is used to calculate the mutual information between samples using the similarity of all sample pairs, and to calculate the contrast loss function based on the mutual information between all samples; wherein, for any first data augmentation sample Mutual information between samples for: ; For any second data augmentation sample Mutual information between samples for: ; in, For indicator functions, when When the time is right, the value is 1; otherwise, it is 0. To compare the temperature coefficient of loss; Using InfoNCE as the loss function : ; Comprehensive loss function for: ; in, To compare the loss balance coefficients.
4. The fault diagnosis model training device according to claim 3, characterized in that, It also includes a judgment module, a label correction module, and a dataset update module; The judgment module is used to determine whether the current model training round has reached the mark for enabling label correction. The label correction module is used to correct the labels when the current model training round reaches the label correction activation round, forming corrected labels, and all training samples with corrected labels form a second training dataset. The dataset update module is used to update the first training dataset to the second training dataset, and use the second training dataset for subsequent training rounds; The first loss function calculation module is used to calculate the noisy attention loss function for the current training round based on the probability of all training samples, the attention weights, and the corrected labels corresponding to the training samples.