A cross-domain small sample fault diagnosis method and system based on gated fusion and covariance distillation
By employing gated fusion and covariance distillation, the problems of overfitting and insufficient generalization ability of wind turbine gearbox fault diagnosis models in data-scarce and cross-domain scenarios were solved. This enabled high-precision fault identification and stable generalization under varying operating conditions, thereby improving the operation and maintenance level of wind power equipment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DALIAN UNIV
- Filing Date
- 2026-01-29
- Publication Date
- 2026-06-19
AI Technical Summary
Existing wind turbine gearbox fault diagnosis models face overfitting and insufficient generalization capabilities in scenarios with scarce data and cross-domain diagnosis. They struggle to effectively capture second-order correlations of features and ensure semantic integrity. In particular, the large differences in feature distribution under varying operating conditions lead to insufficient diagnostic accuracy and adaptability.
A cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation is adopted. By constructing teacher and student networks, updating teacher network parameters using exponential moving average, performing multi-level feature mapping and adaptive aggregation, and combining decoupling adapter and subspace covariance distillation, feature subspace alignment and loss function optimization are achieved, thereby improving cross-domain adaptability and diagnostic accuracy.
It significantly improves the cross-domain adaptability and fault identification accuracy of wind turbine gearboxes under varying operating conditions, reduces the dependence on the number of samples in the target domain, maintains high diagnostic accuracy and stable generalization performance, and can effectively cope with complex fault modes.
Smart Images

Figure CN122241334A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of industrial intelligent operation and maintenance and deep learning technology, specifically to a cross-domain small-sample fault diagnosis method and system based on gated fusion and covariance distillation. Background Technology
[0002] With the continuous improvement of the level of intelligent manufacturing and the increasing complexity of industrial equipment, wind power generation equipment, as a core piece of equipment in the clean energy field, has placed more stringent demands on online monitoring, accurate identification, and robust fault diagnosis technologies. As the core component of power transmission in wind turbine generator sets, the stability and reliability of the wind turbine gearbox directly determine the overall power generation efficiency and profoundly affect the service life and maintenance costs of the unit. However, wind turbine generator sets are mostly deployed in remote onshore or offshore areas, exposed to harsh conditions such as drastic wind speed changes, frequent load fluctuations, and strong random impacts. This results in a high failure rate for key components such as gears and bearings inside the gearbox, with failure modes exhibiting significant diversity and complexity, posing a huge challenge to fault diagnosis.
[0003] In recent years, deep learning has demonstrated groundbreaking application potential in the field of intelligent fault diagnosis due to its powerful automatic representation learning capabilities. Unlike traditional fault diagnosis methods that rely on manually designed feature extractors, deep learning models can directly and automatically mine and extract multi-level, abstract feature representations highly correlated with the health status of equipment from high-dimensional, nonlinear raw sensor signals in an "end-to-end" learning manner, effectively overcoming the limitations of manual feature design. However, in the actual application scenario of wind turbine gearbox fault diagnosis, this powerful representation learning capability still faces two core challenges: First, the collection of fault data is limited by factors such as equipment operating safety and the difficulty of fault reproduction, making it difficult to obtain massive, balanced, and accurately labeled fault samples, resulting in scarce model training data; second, changes in operating conditions in the actual test environment can cause data distribution shifts, resulting in significant differences between the dataset in the offline training stage and the data distribution in actual applications, leading to insufficient model generalization ability.
[0004] To address the core challenges of small-sample and cross-domain diagnostics, various solutions have emerged in existing technologies, with data augmentation, transfer learning, and meta-learning being the three most widely applied strategies. Wang et al. proposed an intelligent fault diagnosis model based on deep neural networks (FCDNN), which transforms the fault diagnosis problem into a sample pair similarity measurement task through a conjoined network architecture, providing an effective approach for few-sample fault diagnosis. Lin et al. proposed a prototype-matching-based meta-learning (PMML) fault diagnosis framework, innovatively integrating Model-Independent Meta-Learning (MAML), Prototype Network, and Bidirectional Long Short-Term Memory (BiLSTM) networks, successfully achieving fault diagnosis under small-sample and even zero-sample conditions. These research results fully demonstrate that integrating meta-learning, transfer strategies, and innovative network architectures is a key development direction for overcoming the technological bottlenecks of small-sample and cross-domain diagnostics.
[0005] While the aforementioned deep learning frameworks based on transfer learning and meta-learning alleviate the overfitting pressure caused by sample scarcity to some extent, existing fault representation models still face deep-seated bottlenecks when dealing with the extremely strong nonlinear coupling characteristics and highly complex dynamic behavior of wind turbine gearboxes. Traditional methods are mostly limited to linear representation paradigms, making it difficult to deeply mine the second-order geometric relationships between features. Furthermore, in the process of aligning with the intrinsic semantic structure of the domain, they are prone to destroying the semantic integrity of the backbone network. Therefore, exploring a diagnostic architecture that can effectively capture second-order relationships between features, ensure semantic integrity, and achieve cross-domain structural alignment has become an urgent need for intelligent fault diagnosis algorithms for wind power equipment. In particular, for the feature shift problem of wind turbine gearboxes under dynamic and varying operating conditions, constructing a small-sample intelligent fault diagnosis model with strong generalization ability has significant practical and engineering value for reducing unit failure risks, ensuring the safety of unit operation throughout its entire life cycle, and improving the overall operation and maintenance level of the wind power industry. Summary of the Invention
[0006] The purpose of this invention is to propose a cross-domain small-sample fault diagnosis method and system based on gated fusion and covariance distillation. This aims to solve the overfitting problem caused by the large difference in feature distribution between the source domain and the target domain in existing wind turbine gearbox fault diagnosis under varying operating conditions and data-scarce scenarios, thereby improving the fault identification accuracy and generalization ability of the model under cross-domain small-sample conditions.
[0007] According to a first aspect of the embodiments of this disclosure, a cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation is provided, comprising the following steps: Obtain the source domain fault dataset and the target domain small sample fault dataset, perform continuous wavelet transform processing on the vibration signal, and obtain the time-frequency image as the meta-task input data; We construct teacher and student networks with identical structures as the feature extraction backbone. The teacher network does not perform gradient backpropagation, but updates the parameters of the student network by exponential moving average (EMA) to continuously obtain stable intrinsic semantic anchors. Multi-level features of the meta-task input data are extracted using teacher and student networks respectively. The multi-level features extracted by the two types of networks are mapped to hyperbolic tangent space respectively. Then, a space-channel joint gating module is constructed to adaptively aggregate the deep multi-level features in the hyperbolic tangent space to obtain teacher fusion features and student fusion features respectively. At the fusion feature output end of the teacher network and student network, decoupled teacher bypass adapters and student bypass adapters are configured respectively. The teacher-student fusion features are processed by the adapters and the contraction covariance is obtained. Based on the contraction covariance of the teacher side, the main feature direction is extracted to construct a feature subspace. The contraction covariance of both the teacher and student sides is projected into this subspace. Subspace covariance distillation is performed. By minimizing the difference in the contraction distribution of the teacher and student covariance in the subspace, the distillation structure alignment loss is obtained. A total loss function is constructed, which includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss. The student network is optimized with the total loss to achieve fault category prediction.
[0008] In one embodiment, the vibration signal is processed by continuous wavelet transform, specifically by using Morlet wavelets as basis functions to transform the original one-dimensional vibration signal in the source and target domains. Continuous wavelet transform (CWT) is performed to convert the one-dimensional time-domain signal into a two-dimensional time-frequency matrix. The two-dimensional time-frequency matrix is normalized to the [0,1] interval, and then adjusted to a fixed-size RGB three-channel image using a bicubic interpolation algorithm. This constructs meta-task input data containing support set samples and query set samples under different operating conditions. in, This represents the mapping process or operation from the original one-dimensional signal to the meta-task dataset. Represents the original one-dimensional signal space. Indicates the first Support set for each meta-task Indicates the first The query set of each meta-task Indicates the total number of meta-tasks. Representing the feature space of time-frequency images, The label space represents the different fault categories of the device. This represents the original one-dimensional vibration signal in the source or target domain. Represents the space of square-integrable signals. Indicates the sampling time variable. This represents the total signal duration, with the number of sampling points being [number missing]. ; tag space Defined as a finite discrete set: in, Indicates the specific type of equipment failure; Constructing based on time-frequency image feature space Each metatask contains a support set. With query set ; The support set Defined as: in, The first wavelet generated by continuous wavelet transform RGB time-frequency image samples, This indicates the fault category label corresponding to the sample. Indicates the number of fault categories. This indicates the number of labeled samples provided for each class; The query set Defined as: in, Indicates the first to be classified One test sample image, This indicates its true label. This indicates the number of samples used for testing in each category.
[0009] In one embodiment, both the teacher network and the student network use ResNet-18 as the backbone architecture for feature extraction, and the teacher network is dynamically updated through parameter evolution based on exponential moving average. In order to provide stable intrinsic semantic anchors, the parameter evolution of the exponential moving average involves the teacher network not participating in gradient backpropagation training, but rather following the student network update through parameter smoothing evolution. The smooth evolution is used in the first In this iteration, the teacher network parameters are updated as follows: in This means that the left-hand variable is updated to reflect the right-hand result. This indicates the current training iteration number. This represents the student network parameters updated via backpropagation in the current iteration step. This represents the teacher network parameters at the previous moment. Indicates the smoothing coefficient; The teacher network does not perform gradient backpropagation in order to explicitly truncate the gradient flow of the teacher network, i.e. Its parameters Student network parameters The historical state was formed through a smooth accumulation; The dynamic update of the teacher network employs a dynamic smoothing strategy to accelerate the adaptation of the teacher network to the feature distribution in the early stages of training and to maintain the stability of the teacher features in the later stages of training. The smoothing coefficient... Not a fixed value; this invention sets It is a dynamic variable that gradually increases with the number of training iterations; The dynamic variable satisfies the following conditions: in This indicates the current training iteration number; The The update method uses a linear growth model to dynamically adjust with the number of iterations, as follows: in This indicates the preset total number of training iterations.
[0010] In one embodiment, the multi-level features extracted by the two types of networks are each mapped to a hyperbolic tangent space, specifically by using the hyperbolic tangent function. As a multi-level feature activation function, hyperbolic tangent space mapping is performed on the high-level features of the teacher network and the student network, as well as the intermediate layer features after downsampling adjustment. Then, the mapped intermediate layer features are weighted and modulated with the space-channel joint gating signal.
[0011] The multi-level features are defined as follows: for any input time-frequency image sample The outputs of Layer 3 and Layer 4 in the ResNet-18 network are extracted as key feature maps, where Layer 3 and Layer 4 represent the third and fourth residual block levels of the feature extraction backbone network, respectively. The intermediate layer features of the teacher network and student network The output, defined as Layer 3, contains local structure and texture information, as follows: in These represent the number of channels, height, and width of the intermediate layer feature map, respectively. High-level characteristics of the teacher network and student network The output, defined as Layer 4, contains highly abstract category semantic information, as follows: in These represent the number of channels, height, and width of the high-level feature map, respectively. The downsampling adjustment only applies to the intermediate layer feature maps of the teacher and student networks, by introducing a... The downsampling convolution kernel will extract the intermediate layer features. Adjusting the spatial dimensions and passageway dimensions to match the characteristics of high-rise buildings Consistent, as follows: The hyperbolic tangent space mapping process involves inputting the downsampled intermediate layer features and the high-level features into a hyperbolic mapping function for nonlinear mapping, so that the multi-level features fall into a unified hyperbolic tangent space coordinate system, as follows: , The specific process of weighted modulation of the space-channel joint gating signal is as follows: S1. Obtain the downsampled, adjusted intermediate layer features of the student or teacher network. ; S2, using global average pooling layer Spatial dimension compression is performed to generate channel descriptors with a global receptive field. ,as follows: in These represent the spatial coordinates of the feature map; S3, using a convolutional kernel size of... Point convolutional layers are used for channel dimensionality reduction and interaction; S4. Concatenate a convolutional kernel of size [size missing]. Deep convolutional layers are used to capture local spatial correlations; S5, Through The activation function maps feature values to Interval, generating spatial-channel joint gating signals ,as follows: in, express Activation function Represents depthwise convolution. Represents point convolution. This indicates global average pooling.
[0012] In one embodiment, adaptive aggregation is performed on deep multi-level features within the hyperbolic tangent space, as follows: in, This indicates the final teacher integration characteristics or student integration characteristics obtained; The Hadamard product, which is an element-wise multiplication, is represented by a gating signal. A linear mapping between spatial and channel weights is applied to the mid-layer features; This represents the hierarchical fusion coefficient, used to control the proportion of low-level detailed information injected into the high-level semantic stream; This represents the hyperbolic scaling factor, used to adjust the modulus distribution of features in the hyperbolic tangent space; The learnable coefficients and A dynamic adjustment strategy is adopted. Specifically, the layer fusion coefficients are initialized to 0 in the initial training phase. As the number of training iterations increases, they are adaptively updated using the backpropagation algorithm. and The value, the network will To what extent does controlled learning introduce... Mid-level detail features will be Under the control of [the system / mechanism], it learns how to adjust the radius of curvature of the feature space, thereby achieving a smooth transition from coarse-grained feature learning to fine-grained feature fusion, ultimately obtaining fused features containing rich multi-scale information. , as subsequent input.
[0013] In one embodiment, a feature subspace is constructed by extracting the principal feature direction based on the teacher-side contracted covariance. The specific process is as follows: S1. Obtain the shrinkage covariance matrix, including: S1.1 Map the teacher-student fusion features to the common feature dimension d through a bypass adapter to obtain the adapted features: , in and These represent the teacher bypass adapter. Student bypass adapter , and These represent the feature embeddings mapped by the fused features through the teacher bypass adapter and the student bypass adapter, respectively. S1.2, will Expanded into a matrix in dimension d ,in , ; S1.3, Implement batch centralization: , in Represents the mean vector. This represents the centered matrix, used for subsequent covariance calculation; S1.4 Calculate the shrinkage covariance matrix: in This represents the original sample covariance matrix. This represents a contraction coefficient between 0 and 1, which determines the trade-off between the original covariance and the diagonal matrix. This indicates that the covariance is diagonalized. Represents a regular constant; S2, Shrinking the covariance matrix on the teacher side Eigenvalue decomposition yields the eigenvector matrix and eigenvalue matrix, as shown in the following formula: in, This represents a diagonal matrix arranged in order of eigenvalues. This represents the eigenvector matrix corresponding to the eigenvalues; S3, will The eigenvalues in the data are sorted from largest to smallest, and the top eigenvalues are selected. The eigenvectors corresponding to the largest eigenvalues form the projection matrix. : in, This represents the preset subspace dimension, and satisfies... The projection matrix The principal direction of the second-order statistical structure of the teacher network is defined as the geometric reference for subsequent structural distillation.
[0014] In one embodiment, both the teacher bypass adapter and the student bypass adapter include a gradient blocking layer, a feature transformation layer, and a distribution normalization layer connected in sequence. The gradient blocking layer does not contain learnable parameters; it is a logical operation layer that receives deep fusion features from the teacher or student networks. As input, the gradient is set to zero during backpropagation to ensure that gradient updates only affect the internal parameters of the adapter and do not adjust the weights of the student network backbone, thereby achieving decoupling. The feature transformation layer is denoted as ,in ; It is a learnable weight matrix, consisting of a convolutional kernel of size . The point convolutions are used to perform linear mapping and rotation on the student features after the gradient is blocked; The distribution normalization layer uses one-dimensional batch normalization to standardize the features after linear transformation and eliminate the dimensional differences between the features. The bypass adapter expression is: in , This indicates a gradient blocking operation. Corresponding matrix multiplication , This indicates batch normalization processing.
[0015] In one embodiment, the contraction covariances on both the teacher and student sides are projected onto a feature subspace, and subspace covariance distillation is performed. By minimizing the difference in the contraction distribution of the teacher and student covariances within the subspace, the distillation structure alignment loss is obtained, specifically as follows: Teachers and students shrink covariance using projection matrix By projecting the contracted covariances of the teacher and student sides onto the feature subspaces respectively, we obtain the subspace covariance representation: , in, and Denotes the second-order statistics within a subspace, with dimension representing... This is used to characterize the correlation structure of teacher and student characteristics in the teacher's main direction subspace; This represents the covariance matrix with a contraction coefficient. ; The goal of the subspace covariance distillation is to make the students' second-order structure statistics approximate the teacher's within the teacher-defined principal structure subspace, i.e., to let... Alignment in numerical values and distribution patterns This alignment does not require students to completely replicate all dimensions of the teacher's relevance, but rather emphasizes structural transfer in the most discriminative principal direction, thereby improving generalization ability under cross-domain, small-sample conditions and distilling the structural alignment loss. as follows: In one embodiment, a total loss function is constructed that includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss, specifically as follows: The failure classification loss as follows: in Represents the query set. This indicates the calculation of the classification probability of the query sample. Query samples, The actual category label for this query sample is as follows: in This represents a distance metric function used to measure the similarity between the query embedding and the category prototype. This represents the category index in the normalized summation. This indicates the embedding of the student network. This represents the student category prototype constructed by class; the prototype classification head includes an adaptive global average pooling layer, a Flatten layer, a Linear1 layer, a normalization layer, a GELU layer, a Dropout layer, a Linear2 layer, and an L2 normalization layer connected in sequence, which are used to map the fused features of the student network into normalized embedding vectors to support subsequent distance metric classification based on the category prototype. The construction of the category prototype involves supporting each category set. Student prototypes are constructed by embedding support sets. ,as follows: in Indicates support set, Represents a set The number of samples, This indicates the embedding of the student network; The teacher-student prototype semantic alignment loss ,as follows: in Indicates the number of categories within the original task. The L2 norm of a vector is used to normalize the prototype vector. The alignment loss of the distillation structure and , The combined total loss is as follows: in , , Indicates the loss weight; Total loss using optimizer Student network parameters The student bypass adapter and prototype classification header parameters are updated via backpropagation as follows: in This represents the learning rate.
[0016] According to a second aspect of the present disclosure, a cross-domain small-sample fault diagnosis system based on gated fusion and covariance distillation is provided, comprising: The data preprocessing module takes the source domain fault dataset and the target domain small sample fault dataset, performs continuous wavelet transform processing on the vibration signal, and obtains the time-frequency image as the meta-task input data. The dual-branch network module is used to construct a teacher network and a student network with the same structure as the feature extraction backbone. The teacher network does not perform gradient backpropagation, but updates the parameters of the student network by exponential moving average (EMA) to continuously obtain stable intrinsic semantic anchors. The hyperbolic tangent space mapping module uses the teacher network and student network to extract multi-level features from the meta-task input data, respectively, and maps the multi-level features extracted by the two types of networks to the hyperbolic tangent space. Then, a space-channel joint gating module is constructed to adaptively aggregate the deep multi-level features in the hyperbolic tangent space to obtain teacher fusion features and student fusion features respectively. The bypass adaptation and subspace covariance distillation module configures decoupled teacher bypass adapters and student bypass adapters at the fusion feature output ends of the teacher network and student network, respectively. The adapters process the teacher-student fusion features and obtain the contracted covariance. Based on the contracted covariance of the teacher side, the main feature direction is extracted to construct a feature subspace. The contracted covariance of both the teacher and student sides is projected into this subspace, and subspace covariance distillation is performed. By minimizing the difference in the contracted distribution of the teacher and student covariance in the subspace, the distillation structure alignment loss is obtained. The multi-loss fusion and network optimization module constructs a total loss function that includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss. The student network is optimized using the total loss to achieve fault category prediction.
[0017] According to a third aspect of the present disclosure, an electronic device is provided, including a memory, a processor, and a computer program stored in the memory and running on the memory, wherein the processor executes the program to implement the cross-domain small sample fault diagnosis method based on gated fusion and covariance distillation.
[0018] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the aforementioned method for cross-domain small-sample fault diagnosis based on gated fusion and covariance distillation.
[0019] The advantages of the above technical solutions adopted in this invention compared with the prior art are as follows: 1. By using space-channel joint gating to adaptively weight and fuse mid-to-high-level features in hyperbolic tangent space, the essential correlation of fault features under different working conditions can be accurately captured, effectively weakening the distribution difference between the source domain and the target domain, and significantly improving the cross-domain adaptability of the model in variable working conditions.
[0020] 2. By leveraging the second-order structure distillation mechanism, the second-order structure knowledge of the teacher network is transferred to the student network through subspace covariance distillation. This allows the student network to learn robust fault feature representations without the need for massive labeled samples, significantly reducing the dependence on the number of samples in the target domain and maintaining high diagnostic accuracy even in data-scarce scenarios.
[0021] 3. Hyperbolic tangent space mapping provides a more suitable expression space for the nonlinear characteristics of fault data. Combined with a gating mechanism to dynamically adjust the feature fusion ratio, the fused features contain both high-level abstract semantic information and retain mid-level local detail features, thereby improving the recognizability of fault features.
[0022] 4. Through multi-objective collaborative optimization of fault classification loss, teacher-student prototype semantic alignment loss and distillation structure alignment loss, dual alignment at the semantic and structural levels is achieved, enabling the model to learn general fault features independent of operating conditions, effectively addressing the challenges of diversified and complex fault modes in wind turbine gearboxes, and ensuring stable and reliable generalization performance.
[0023] 5. The teacher network is dynamically updated through exponential moving average, providing stable intrinsic semantic anchors for the student network. At the same time, the decoupling of the bypass adapter and gradient blocking mechanism avoids gradient interference during training, ensuring the convergence stability and efficiency of model training. Attached Figure Description
[0024] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments of this application and their descriptions are used to explain this application and do not constitute an undue limitation of this application.
[0025] Figure 1 Flowchart of a cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation; Figure 2 This is a diagram illustrating the overall framework of a cross-domain small-sample fault diagnosis network. Detailed Implementation
[0026] The present disclosure will be further described below with reference to the accompanying drawings and embodiments.
[0027] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments according to this application. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0028] It should be noted that the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of this disclosure. It should be noted that each block in a flowchart or block diagram may represent a module, segment, or portion of code, which may include one or more executable instructions for implementing the logical functions specified in the various embodiments. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than that shown in the drawings. For example, two consecutively represented blocks may actually be executed substantially in parallel, or they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the flowcharts and / or block diagrams, and combinations of blocks in the flowcharts and / or block diagrams, may be implemented using a dedicated hardware-based system that performs the specified functions or operations, or using a combination of dedicated hardware and computer instructions.
[0029] Example 1: like Figure 1 As shown, this embodiment provides a cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation, including the following steps: Step 1: Obtain the source domain fault dataset and the target domain small sample fault dataset, perform continuous wavelet transform processing on the vibration signal, and obtain the time-frequency image as the input data for the meta-task; Specifically, using the wind turbine gearbox as the monitoring object, vibration signals were collected using accelerometers installed inside the nacelle. The sampling frequency was 25.6 kHz, and each signal segment was 1.0 second long, therefore each raw record contained 25,600 sampling points. The source domain dataset was collected under multiple different speed and load conditions, totaling 5,000 samples; the target domain dataset was collected under different operating conditions at another wind farm, with only 5 samples for each fault type.
[0030] Each original signal is divided into frames of 2048 sampling points (with a 50% overlap in frame shift, i.e., a frame shift of 1024 points), resulting in short-time signal segments of uniform length. Each segment undergoes a continuous wavelet transform using Morlet wavelets, with a scaling factor... Within the range [1, 128], 128 scales are uniformly selected, and the translation step size is... This is equal to the frame shift. The time-frequency coefficient matrix can be obtained using the following formula: in The Morlet mother wavelet is used. The resulting two-dimensional time-frequency matrix is 128×128 in size, and each element represents the energy of the signal at different scales and time windows.
[0031] For each two-dimensional time-frequency matrix, according to Normalize the image to the [0,1] interval, copy it three times to form RGB channels, and use bicubic interpolation to adjust it into a 224×224 three-channel image, denoted as . All images form a source domain image set and a target domain sample image set.
[0032] Randomly constructed from the preprocessed image set Each metatask contains [number] metatasks. Each category; Supported sets Each category contains One sample, query set Each category contains One sample. The support set is used for prototype extraction and loss calculation, and the query set is used for model evaluation.
[0033] Step 2: Construct teacher and student networks with identical structures as the feature extraction backbone. The teacher network does not perform gradient backpropagation, but updates the parameters of the student network by exponential moving average (EMA) to continuously obtain stable intrinsic semantic anchors. This embodiment uses a ResNet18 network, with RGB time-frequency images as input and multi-layer features as output. The teacher network is denoted as... Student network record .
[0034] The specific structures of Layer 3 and Layer 4 are as follows: Layer 3 contains two residual units. The first residual unit is in the first layer. Convolution with a stride of 2 reduces the feature map size from... Shrink to Output channels: 256; Second layer: Convolution, stride 1. Residual connections are used. Convolution adjusts the dimensions. The second residual unit has the same structure; the output of this block is treated as a mid-level feature. .
[0035] Layer 4 contains two residual units. The first residual unit is in the first layer. Convolution with a stride of 2 shrinks the feature map to Output channel 512; the second layer is Convolution, stride 1. Residual connections are used. Convolution adjusts the dimensionality. The second residual unit has the same structure; the output of this block is treated as a high-level feature. .
[0036] The teacher network is trained end-to-end on the source domain dataset to learn robust fault characteristics. Training employs... The optimizer is set with an initial learning rate of 0.01, a batch size of 64, and a loss function of cross-entropy, and is trained for 100 epochs. A student network is constructed and its parameters are randomly initialized, with the same structure as the teacher network.
[0037] During training, the teacher network does not participate in backpropagation; its parameters are updated using the exponential moving average (EMA) of the student network parameters. .
[0038] in This indicates the current training iteration number. This represents the student network parameters updated via backpropagation in the current iteration step. This embodiment uses a smoothing coefficient to represent the teacher network parameters at the previous time step. The smoothing coefficient gradually increases with each training iteration to ensure the stability of the teacher features.
[0039] Step 3: Extract multi-level features from the meta-task input data using the teacher network and student network respectively. Map the multi-level features extracted by the two types of networks to the hyperbolic tangent space. Then construct a space-channel joint gating module to adaptively aggregate the deep multi-level features in the hyperbolic tangent space to obtain teacher fusion features and student fusion features respectively. Specifically, the mid-level features obtained from teacher networks and student networks and high-level characteristics First use The convolution kernel will downsampling to Size obtained In order to communicate with Align. Then align them separately. and Apply hyperbolic tangent Activation maps features to hyperbolic tangent space.
[0040] Gating signals are generated in the following manner. : (1) Use a global average pooling layer to sample the downsampled data. The channel descriptor is obtained by averaging along the spatial dimension. .
[0041] (2) Use 256 Convolution kernel, stride 1, number of output channels ,in To achieve channel compression ratio, ReLU is used as the activation function; features are obtained. .
[0042] (3) To conduct Depthwise convolution with a stride of 1 and padding of 1, performing independent convolution on each channel while maintaining the same number of channels. This yields the features. .
[0043] (4) Through Activation function will Mapped to Interval, generating spatial-channel joint gating signals The gating generation process can be represented as: in, express Activation function Represents depthwise convolution. Represents point convolution. This indicates global average pooling.
[0044] Step 4: At the fusion feature output end of the teacher network and student network, configure decoupled teacher bypass adapters and student bypass adapters respectively. Process the teacher-student fusion features through the adapters and obtain the contraction covariance. Based on the contraction covariance of the teacher side, extract the main feature direction to construct a feature subspace. Project the contraction covariance of both the teacher and student sides to this subspace and perform subspace covariance distillation. By minimizing the difference in the contraction distribution of the teacher-student covariance in the subspace, obtain the distillation structure alignment loss. Specifically, the integration characteristics of teacher networks and student networks The output terminals are respectively connected to teacher bypass adapters. and student bypass adapter Each adapter includes a gradient blocking layer, a feature transformation layer, and a normalization layer, where the feature transformation layer uses... Convolution kernel, 512 input channels, 512 output channels The layer has a step size of 1 and is padded with 0s. It is followed by Batch Norm and ReLU activation to obtain the adapted features. .
[0045] Flatten the adaptation features along the spatial dimension: Let The size is Flattening yields the matrix .right In-batch centering is performed, and the unbiased covariance matrix is calculated. To reduce estimation error, a contraction estimation method is used. .
[0046] Shrinking the covariance matrix on the teacher side Perform eigenvalue decomposition: ,in It is an eigenvalue diagonal matrix. Before selection The eigenvectors constitute the projection matrix. .
[0047] Project the contracted covariance of teachers and students onto a subspace. : The subspace covariance distillation loss is When calculating the loss, gradient blocking is applied to the teacher adapter, so that the loss is used only to update the parameters of the student network and student adapter.
[0048] Step 5: Construct a total loss function that includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss. Optimize the student network with the total loss to achieve fault category prediction.
[0049] Specifically, at the student network fusion feature output end, a prototype classification head is connected. The specific structure consists of a global average pooling layer, a Flatten layer, a fully connected layer FC1, a Dropout layer, a fully connected layer FC2, and an L2 normalization layer. The fully connected layer FC1 has a 512-dimensional input and a 128-dimensional output, with the weight matrix size... The activation function is GELU; the fully connected layer FC2 has 128-dimensional input and outputs the number of fault categories. Dimension; Dropout layer .
[0050] For each category The teacher prototype is calculated from the embedding vectors of the support set samples in the teacher network and student network, respectively. and student prototype The semantic alignment loss is calculated using prototypes: For query samples By computing the embedding of student networks With prototypes of each category Euclidean distance between Obtain the probability distribution: Fault classification loss .
[0051] The total loss, which combines fault classification loss, prototype semantic alignment loss, and covariance distillation loss, is: In this embodiment, the selected , , The student network and student adapter were trained using the Adam optimizer, with an initial learning rate of [value missing]. The network parameters decay by a factor of 0.5 every 10 epochs, for a total of 100 epochs. Teacher network parameters are dynamically updated via EMA and do not participate in backpropagation.
[0052] To demonstrate the effectiveness of the method of this invention, experimental verification was conducted on the Case Western Reserve University (CWRU) bearing dataset and the Paderborn University (PU) bearing dataset. The CWRU dataset, with its abundant data and broad coverage of operating conditions, was used as a pre-training dataset for the teacher network; the PU dataset, containing real accelerated life test data with more complex fault characteristics and significant non-stationarity, was used for cross-domain small sample training and testing of the student network.
[0053] The specific experimental setup is as follows: 1. A prototype auxiliary head is introduced into the teacher network, and it is pre-trained in full supervision using the CWRU dataset to learn a robust fault feature space with intra-class similarity and inter-class separation; 2. The prototype auxiliary head is frozen, the teacher backbone network is updated using EMA, and the student network is trained using the PU dataset for meta-learning based on the method of this invention. To verify the model's generalization ability under "unseen conditions," a strict cross-condition evaluation protocol is adopted: when constructing the meta-task, the support set samples are sampled only from two specific conditions in the PU dataset, while the query set samples are sampled from a completely different third condition. This means that the domain distributions of the support set and the query set have no intersection, forcing the model to learn essential fault features independent of the condition.
[0054] The experiment used both 5-way 1-shot and 5-way 5-shot settings for evaluation, with Top-1 classification accuracy as the evaluation metric. To verify the effectiveness of the gated fusion and covariance distillation mechanism proposed in this invention, it was compared with the direct transfer performance of MatchNet, ProtoNet, and the teacher network. The experimental results are shown in Table 1. As shown in Table 1, with prior knowledge from the teacher, the gated fusion and covariance distillation mechanism proposed in this invention achieved significant improvements in both 1-shot and 5-shot tasks, indicating that simple model transfer is insufficient to cope with complex cross-condition scenarios. The results show that the model constructed in this invention learns more fine-grained general fault features. The covariance distillation strategy transfers the second-order structure information learned by the teacher network on CWRU to the student network, effectively compensating for the problem of insufficient target domain samples. At the same time, the gated mechanism effectively suppresses background noise interference caused by cross-conditions.
[0055] Example 2: like Figure 2 As shown, this embodiment provides a cross-domain small-sample fault diagnosis system based on gated fusion and covariance distillation, including: The data preprocessing module takes the source domain fault dataset and the target domain small sample fault dataset, performs continuous wavelet transform processing on the vibration signal, and obtains the time-frequency image as the meta-task input data. The dual-branch network module is used to construct a teacher network and a student network with the same structure as the feature extraction backbone. The teacher network does not perform gradient backpropagation, but updates the parameters of the student network by exponential moving average (EMA) to continuously obtain stable intrinsic semantic anchors. The hyperbolic tangent space mapping module uses the teacher network and student network to extract multi-level features from the meta-task input data, respectively, and maps the multi-level features extracted by the two types of networks to the hyperbolic tangent space. Then, a space-channel joint gating module is constructed to adaptively aggregate the deep multi-level features in the hyperbolic tangent space to obtain teacher fusion features and student fusion features respectively. The bypass adaptation and subspace covariance distillation module configures decoupled teacher bypass adapters and student bypass adapters at the fusion feature output ends of the teacher network and student network, respectively. The adapters process the teacher-student fusion features and obtain the contracted covariance. Based on the contracted covariance of the teacher side, the main feature direction is extracted to construct a feature subspace. The contracted covariance of both the teacher and student sides is projected into this subspace, and subspace covariance distillation is performed. By minimizing the difference in the contracted distribution of the teacher and student covariance in the subspace, the distillation structure alignment loss is obtained. The multi-loss fusion and network optimization module constructs a total loss function that includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss. The student network is optimized using the total loss to achieve fault category prediction.
[0056] The above modules can be deployed on the same device or distributed devices; the division of modules is only a functional logic description and does not limit the specific physical boundaries or implementation order.
[0057] Example 3: An electronic device is provided for running the aforementioned "a cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation". The electronic device includes: a processor, a memory, and optional communication interfaces / display devices / input devices, etc.; the memory stores a computer program that can run on the processor, and when the processor executes the program, it implements steps one through five of the method described in Embodiment 1, specifically including but not limited to: Step 1: Obtain the source domain fault dataset and the target domain small sample fault dataset, perform continuous wavelet transform processing on the vibration signal, and obtain the time-frequency image as the input data for the meta-task; Step 2: Construct teacher and student networks with identical structures as the feature extraction backbone. The teacher network does not perform gradient backpropagation, but updates the parameters of the student network by exponential moving average (EMA) to continuously obtain stable intrinsic semantic anchors. Step 3: Extract multi-level features from the meta-task input data using the teacher network and student network respectively. Map the multi-level features extracted by the two types of networks to the hyperbolic tangent space. Then construct a space-channel joint gating module to adaptively aggregate the deep multi-level features in the hyperbolic tangent space to obtain teacher fusion features and student fusion features respectively. Step 4: At the fusion feature output end of the teacher network and student network, configure decoupled teacher bypass adapters and student bypass adapters respectively. Process the teacher-student fusion features through the adapters and obtain the contraction covariance. Based on the contraction covariance of the teacher side, extract the main feature direction to construct a feature subspace. Project the contraction covariance of both the teacher and student sides to this subspace and perform subspace covariance distillation. By minimizing the difference in the contraction distribution of the teacher-student covariance in the subspace, obtain the distillation structure alignment loss. Step 5: Construct a total loss function that includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss. Optimize the student network with the total loss to achieve fault category prediction.
[0058] The electronic device hardware can be one of a server, personal computer, workstation, industrial controller, edge computing device, or mobile terminal; the processor can be a general-purpose CPU, GPU, NPU, FPGA, or a combination thereof; the memory can be RAM, ROM, flash memory, or disk array. The device can interact with local / remote data storage (acquiring observation data and outputting inversion results) through a communication interface. The above hardware configuration does not constitute a limitation of the present invention.
[0059] Example 4: A computer-readable storage medium storing a computer program, which, when run on a processor of an electronic device, causes the program to perform steps one through five of the method described in Embodiment 1; the storage medium may be a disk, optical disk, flash memory, solid-state drive, read-only memory, random access memory, or any combination of the above media.
[0060] Those skilled in the art will understand that the modules or steps described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, which can then be stored in a storage device for execution by a computer device. Alternatively, they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. This disclosure is not limited to any particular combination of hardware and software.
[0061] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation, characterized in that, Includes the following steps: Obtain the source domain fault dataset and the target domain small sample fault dataset, perform continuous wavelet transform processing on the vibration signal, and obtain the time-frequency image as the meta-task input data; We construct teacher and student networks with identical structures as the feature extraction backbone. The teacher network does not perform gradient backpropagation, but updates the parameters of the student network by exponential moving average to continuously obtain stable intrinsic semantic anchors. Multi-level features of the meta-task input data are extracted using teacher and student networks respectively. The multi-level features extracted by the two types of networks are mapped to hyperbolic tangent space respectively. Then, a space-channel joint gating module is constructed to adaptively aggregate the deep multi-level features in the hyperbolic tangent space to obtain teacher fusion features and student fusion features respectively. At the fusion feature output end of the teacher network and student network, decoupled teacher bypass adapters and student bypass adapters are configured respectively. The teacher-student fusion features are processed by the adapters and the contraction covariance is obtained. Based on the contraction covariance of the teacher side, the main feature direction is extracted to construct a feature subspace. The contraction covariance of both the teacher and student sides is projected into this subspace. Subspace covariance distillation is performed. By minimizing the difference in the contraction distribution of the teacher and student covariance in the subspace, the distillation structure alignment loss is obtained. A total loss function is constructed, which includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss. The student network is optimized with the total loss to achieve fault category prediction.
2. The cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation according to claim 1, characterized in that, The vibration signal is processed by continuous wavelet transform, specifically by using Morlet wavelets as basis functions to transform the original one-dimensional vibration signal in the source and target domains. A continuous wavelet transform is performed to convert the one-dimensional time-domain signal into a two-dimensional time-frequency matrix. The two-dimensional time-frequency matrix is normalized to the [0,1] interval, and then adjusted into a fixed-size RGB three-channel image using a bicubic interpolation algorithm. This constructs meta-task input data containing support set samples and query set samples under different working conditions. in, This represents the mapping process or operation from the original one-dimensional signal to the meta-task dataset. Represents the original one-dimensional signal space. Indicates the first Support set for each meta-task Indicates the first The query set of each meta-task Indicates the total number of meta-tasks. Representing the feature space of time-frequency images, The label space represents the different fault categories of the device. This represents the original one-dimensional vibration signal in the source or target domain. Represents the space of square-integrable signals. Indicates the sampling time variable. This represents the total signal duration, with the number of sampling points being [number missing]. .
3. The cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation according to claim 1, characterized in that, Both the teacher network and the student network use ResNet-18 as the backbone architecture for feature extraction, and the teacher network updates in accordance with the student network through smooth parameter evolution. The smooth evolution is used in the first In this iteration, the teacher network parameters are updated as follows: in This means that the left-hand variable is updated to reflect the right-hand result. This indicates the current training iteration number. This represents the student network parameters updated via backpropagation in the current iteration step. This represents the teacher network parameters at the previous moment. Indicates the smoothing coefficient; The teacher network does not perform gradient backpropagation in order to explicitly truncate the gradient flow of the teacher network, i.e. Its parameters Student network parameters It is the result of the smooth accumulation of historical states.
4. The cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation according to claim 1, characterized in that, The multi-level features extracted by the two types of networks are each mapped to the hyperbolic tangent space, specifically by using the hyperbolic tangent function. As a multi-level feature activation function, hyperbolic tangent space mapping is performed on the high-level features of the teacher network and the student network, as well as the intermediate layer features after downsampling adjustment. Then, the mapped intermediate layer features are weighted and modulated with the space-channel joint gating signal. The multi-level features are defined as follows: for any input time-frequency image sample The outputs of Layer 3 and Layer 4 in the ResNet-18 network are extracted as key feature maps, where Layer 3 and Layer 4 represent the third and fourth residual block levels of the feature extraction backbone network, respectively. The intermediate layer features of the teacher network and student network The output, defined as Layer 3, contains local structure and texture information, as follows: in These represent the number of channels, height, and width of the intermediate layer feature map, respectively. High-level characteristics of the teacher network and student network The output, defined as Layer 4, contains highly abstract category semantic information, as follows: in These represent the number of channels, height, and width of the high-level feature map, respectively. The downsampling adjustment only applies to the intermediate layer feature maps of the teacher and student networks. It uses downsampling convolution kernels to adjust the intermediate layer features... Adjusting the spatial dimensions and passageway dimensions to match the characteristics of high-rise buildings Consistent, as follows: The hyperbolic tangent space mapping process involves inputting the downsampled intermediate layer features and the high-level features into a hyperbolic mapping function for nonlinear mapping, so that the multi-level features fall into a unified hyperbolic tangent space coordinate system, as follows: , 。 5. A cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation according to claim 1 or 4, characterized in that, Adaptive aggregation of deep multi-level features within the hyperbolic tangent space is performed as follows: in, This indicates the final teacher integration characteristics or student integration characteristics obtained; The Hadamard product, which is an element-wise multiplication, is represented by a gating signal. A linear mapping between spatial and channel weights is applied to the mid-layer features; This represents the hierarchical fusion coefficient, used to control the proportion of low-level detailed information injected into the high-level semantic stream; This represents the hyperbolic scaling factor, used to adjust the modulus distribution of features in the hyperbolic tangent space; The learnable coefficients and A dynamic adjustment strategy is adopted. Specifically, the hierarchical fusion coefficients are initialized to 0 in the initial training phase; as the number of training iterations increases, they are adaptively updated using the backpropagation algorithm. and The value, the network will To what extent does controlled learning introduce... Mid-level detail features will be Under the control of [the system / mechanism], it learns how to adjust the radius of curvature of the feature space, thereby achieving a smooth transition from coarse-grained feature learning to fine-grained feature fusion, ultimately obtaining fused features containing rich multi-scale information. , as subsequent input.
6. The cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation according to claim 1, characterized in that, The contraction covariances on both the teacher and student sides are projected onto the feature subspace, and subspace covariance distillation is performed. By minimizing the difference in the contraction distribution of the teacher and student covariances within the subspace, the distillation structure alignment loss is obtained, specifically: Teachers and students shrink covariance using projection matrix By projecting the contracted covariances of the teacher and student sides onto the feature subspaces respectively, we obtain the subspace covariance representation: , in, and Denotes the second-order statistics within a subspace, with dimension representing... This is used to characterize the correlation structure of teacher and student characteristics in the teacher's main direction subspace; This represents the covariance matrix with a contraction coefficient. ; The goal of the subspace covariance distillation is to make the students' second-order structure statistics approximate the teacher's within the teacher-defined principal structure subspace, i.e., to let... Alignment in numerical values and distribution patterns Distillation structure alignment loss as follows: 。 7. The cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation according to claim 6, characterized in that, Construct a total loss function that includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss, as follows: The failure classification loss as follows: in Represents the query set. This indicates the calculation of the classification probability of the query sample. Query samples, The actual category label for this query sample is as follows: in This represents a distance metric function used to measure the similarity between the "query embedding" and the "category prototype". This represents the category index in the normalized summation. This indicates the embedding of the student network. This represents the student category prototype constructed by class; the prototype classification head includes an adaptive global average pooling layer, a Flatten layer, a Linear1 layer, a normalization layer, a GELU layer, a Dropout layer, a Linear2 layer, and an L2 normalization layer connected in sequence, which are used to map the fused features of the student network into normalized embedding vectors to support subsequent distance metric classification based on the category prototype. The teacher-student prototype semantic alignment loss ,as follows: in Indicates the number of categories within the original task. The L2 norm of a vector is used to normalize the prototype vector. The alignment loss of the distillation structure and , The combined total loss is as follows: in , , Indicates the loss weight; Total loss using optimizer Student network parameters The student bypass adapter and prototype classification header parameters are updated via backpropagation as follows: in This represents the learning rate.
8. A cross-domain small-sample fault diagnosis system based on gated fusion and covariance distillation, characterized in that, include: The data preprocessing module takes the source domain fault dataset and the target domain small sample fault dataset, performs continuous wavelet transform processing on the vibration signal, and obtains the time-frequency image as the meta-task input data. The dual-branch network module is used to construct a teacher network and a student network with the same structure as the feature extraction backbone. The teacher network does not perform gradient backpropagation, but updates the parameters of the student network by exponential moving average to continuously obtain stable intrinsic semantic anchors. The hyperbolic tangent space mapping module uses the teacher network and student network to extract multi-level features from the meta-task input data, respectively, and maps the multi-level features extracted by the two types of networks to the hyperbolic tangent space. Then, a space-channel joint gating module is constructed to adaptively aggregate the deep multi-level features in the hyperbolic tangent space to obtain teacher fusion features and student fusion features respectively. The bypass adaptation and subspace covariance distillation module configures decoupled teacher bypass adapters and student bypass adapters at the fusion feature output ends of the teacher network and student network, respectively. The adapters process the teacher-student fusion features and obtain the contracted covariance. Based on the contracted covariance of the teacher side, the main feature direction is extracted to construct a feature subspace. The contracted covariance of both the teacher and student sides is projected into this subspace, and subspace covariance distillation is performed. By minimizing the difference in the contracted distribution of the teacher and student covariance in the subspace, the distillation structure alignment loss is obtained. The multi-loss fusion and network optimization module constructs a total loss function that includes fault classification loss, teacher-student prototype semantic alignment loss, and distillation structure alignment loss. The student network is optimized using the total loss to achieve fault category prediction.
9. An electronic device, comprising a memory, a processor, and a computer program stored in the memory and running thereon, characterized in that, When the processor executes the program, it implements the cross-domain small sample fault diagnosis method based on gated fusion and covariance distillation as described in any one of claims 1-7.
10. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the cross-domain small-sample fault diagnosis method based on gated fusion and covariance distillation.