Voltage sag source classification method based on multi-modal feature extraction and double attention mechanism
The voltage sag source classification method using multimodal feature extraction and dual attention mechanism solves the problems of insufficient feature fusion and poor model interpretability in existing technologies, achieving high-precision and interpretable voltage sag source classification and improving the intelligent operation and maintenance capabilities of power systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID SHANGHAI MUNICIPAL ELECTRIC POWER CO
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies struggle to effectively integrate the multimodal features of voltage sag signals, resulting in poor classification accuracy and interpretability. Furthermore, deep learning models lack interpretability and struggle to simultaneously capture key information in both time and channel dimensions.
A voltage sag source classification model is constructed by employing multimodal feature extraction and a dual attention mechanism. This model extracts features in the time-frequency domain, depth space, and deep temporal domain, and combines modal attention and feature attention mechanisms to perform feature fusion and enhancement.
It significantly improves the accuracy and interpretability of voltage sag source classification, achieves high-precision classification in complex scenarios, enhances the system's engineering robustness and adaptability, and supports intelligent operation and maintenance of power systems.
Smart Images

Figure CN122196507A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of voltage sag source classification technology, and in particular to a voltage sag source classification method based on multimodal feature extraction and dual attention mechanism. Background Technology
[0002] Voltage sags are one of the most common power quality problems in power systems and pose the greatest threat to sensitive industrial users. Their sources are diverse, generally categorized as short-circuit faults, large motor starting, and transformer commissioning. Accurately locating and identifying voltage sag sources is a crucial prerequisite for improving power quality, assigning responsibility, and enhancing power supply reliability.
[0003] Traditional identification methods rely primarily on expert experience to analyze steady-state characteristics of voltage sag events, such as voltage amplitude, duration, and phase transitions, and make judgments based on thresholds. However, with the increasing complexity of power system structures and the intertwined causes of sag events, relying solely on limited steady-state characteristics is insufficient to effectively distinguish between sag types with similar features, leading to a bottleneck in classification accuracy and inefficiency due to heavy reliance on manual operation.
[0004] With the development of artificial intelligence technology, data-driven automatic classification methods for voltage sag sources have become a mainstream research direction. Many scholars have attempted to use machine learning algorithms such as support vector machines and random forests, as well as deep learning models such as convolutional neural networks and long short-term memory networks, to mine classification patterns from massive amounts of waveform data. These methods have improved the ability to extract high-dimensional and nonlinear features to some extent, but still have significant limitations.
[0005] On the one hand, most studies focus only on single-modal features (such as using only time-domain waveforms or frequency-domain components), failing to fully integrate the complementary multimodal information of voltage sags in the time, frequency, and time-series trajectories. This results in incomplete feature representations, and the model easily overlooks key discriminative details. On the other hand, while mainstream deep learning models possess powerful feature learning capabilities, their internal decision-making processes are often like "black boxes," lacking interpretability. This makes it difficult for maintenance personnel to understand and trust the model's classification results, and even more difficult to obtain valuable tracing evidence such as "what is the key data segment that led to this classification." Furthermore, even though some studies attempt to introduce attention mechanisms to improve model performance, they usually only use a single type of attention (such as temporal attention), which is insufficient to simultaneously address the complexity of voltage sag signals in both the time and channel dimensions. Temporal attention can focus on key time points, but the contribution of different feature channels (such as harmonic components and d-q axis components) to classification also varies; ignoring any dimension will result in information loss.
[0006] Therefore, how to collaboratively integrate multimodal features in the model, design an effective mechanism to simultaneously capture key information in the time and channel dimensions, break the "black box" limitation, and build a high-precision and interpretable voltage sag source classification model has become a core problem that urgently needs to be solved in this field. Summary of the Invention
[0007] Based on the above analysis, the embodiments of the present invention aim to provide a voltage sag source classification method based on multimodal feature extraction and dual attention mechanism, in order to solve the problems of incomplete voltage sag source classification representation, poor model interpretability, and difficulty in simultaneously capturing key information in time and channel dimensions in the prior art.
[0008] This invention provides a voltage sag source classification method based on multimodal feature extraction and a dual attention mechanism, the method comprising: A voltage sag training set is constructed based on the feature-enhanced representation vectors of all voltage sag waveform sample data and their corresponding voltage sag source types. The voltage sag source classifier is trained using the voltage sag training set to obtain a successfully trained voltage sag source classifier; The feature-enhanced representation vector of the real-time voltage sag waveform data is input into the trained voltage sag source classifier, and the voltage sag source classifier predicts the voltage sag source type corresponding to the real-time voltage sag waveform data. Among them, voltage sag waveform sample data and voltage sag waveform real-time data are both used as voltage transient waveform data. By performing multimodal feature extraction and feature fusion and enhancement based on dual attention mechanism on voltage transient waveform data, the corresponding feature enhancement representation vector is obtained.
[0009] Based on the above solution, the present invention also makes the following improvements: Furthermore, the process of extracting multimodal features from voltage transient waveform data, and performing feature fusion and enhancement based on a dual attention mechanism, involves: The voltage sag waveform data is preprocessed, multimodal feature extracted, and feature standardized sequentially to obtain the corresponding standard multimodal feature extraction results; The standard multimodal feature extraction results are subjected to feature fusion and enhancement based on a dual attention mechanism to obtain the corresponding feature enhancement representation vector.
[0010] Furthermore, multimodal feature extraction is performed on the preprocessed voltage sag waveform data, and the following steps are taken: The preprocessed voltage sag waveform data were subjected to time-frequency domain feature extraction, CNN feature extraction based on time-frequency image, and LSTM feature extraction to obtain the corresponding time-frequency domain feature vector, depth space feature vector, and depth time series feature vector. The time-frequency domain feature vector, the depth space feature vector, and the depth-temporal feature vector are used as the results of multimodal feature extraction.
[0011] Furthermore, feature standardization is performed on the multimodal feature extraction results, and the following steps are taken: The time-frequency domain feature vector, the depth space feature vector, and the depth-temporal feature vector are respectively standardized to obtain the corresponding standard time-frequency domain feature vector, standard depth space feature vector, and standard depth-temporal feature vector. The standard time-frequency domain feature vector, standard depth space feature vector, and standard depth time-series feature vector are used as the feature vectors of the three modes in the standard multimodal feature extraction results.
[0012] Furthermore, the feature fusion and enhancement based on a dual attention mechanism is performed on the standard multimodal feature extraction results, and the following steps are executed: The feature vectors of each modality in the standard multimodal feature extraction results are mapped to the same feature dimension space, and the feature vectors of all the mapped modalities are concatenated to obtain the initial fused feature vector; The initial fused feature vector is fused using a modal attention network to obtain a weighted fused feature vector; The weighted fused feature vector is input into the feature attention enhancement network to obtain the corresponding feature enhancement representation vector.
[0013] Furthermore, the weighted fusion feature vector is obtained by performing the following operations: The initial fused feature vector is input into the modal attention network to obtain the attention score of the feature vector of each modality; The attention scores of the feature vectors of each modality are normalized to obtain the attention weights of the feature vectors of each modality. By using the attention weights of the feature vectors of each modality, the feature vectors of the corresponding modalities mapped to the same feature dimension space are weighted and fused to obtain a weighted fused feature vector.
[0014] Furthermore, the corresponding feature enhancement representation vector is obtained by performing the following operations: The weighted fused feature vector is input into the feature attention enhancement network, which then performs global information embedding, adaptive weight learning, and feature enhancement on the weighted fused feature vector in sequence to obtain the corresponding feature enhancement representation vector.
[0015] Furthermore, time-frequency domain feature extraction is performed on the preprocessed voltage sag waveform data, and the following steps are taken: Time-domain and frequency-domain features were extracted from the preprocessed voltage sag waveform data. The feature extraction results from the time-domain and frequency-domain features were then concatenated to form the corresponding time-frequency domain feature vector.
[0016] Furthermore, a voltage sag source classifier is trained using the voltage sag training set, and the following steps are performed: Using the feature enhancement representation vector of each voltage sag waveform sample data as input and the corresponding voltage sag source type as label, the voltage sag source classifier is trained to obtain a successfully trained voltage sag source classifier.
[0017] Furthermore, the voltage sag source classifier is implemented using a deep neural network classification model.
[0018] Compared with the prior art, the present invention can achieve at least one of the following beneficial effects: (1) This invention significantly improves the accuracy and reliability of classification in complex scenarios. Traditional methods face bottlenecks due to their reliance on single features. This invention innovatively integrates three-dimensional features, including time-frequency domain, time-frequency and time-series features, to construct a comprehensive feature profile. By combining CNN and LSTM to deeply extract high-order abstract features, and using a dual attention mechanism to dynamically focus key information and suppress noise, high-precision differentiation of transient types with similar features is achieved.
[0019] (2) This invention significantly enhances the engineering robustness and adaptability of the system. Deep networks can learn more essential and interference-resistant patterns than handmade features. The core of this invention is that the modal attention mechanism can dynamically adjust the weights of each feature source according to the signal quality, forming an intelligent weighted fusion strategy to ensure stable and reliable output under different operating environments and data quality.
[0020] (3) This invention provides efficient technical support for intelligent operation and maintenance of power systems and has outstanding practical value. It realizes end-to-end automation from data to diagnosis, eliminating the reliance on expert experience. This invention can be integrated into existing systems to realize real-time automatic analysis of massive monitoring points, which can quickly locate sag sources and clarify responsibilities, and also discover weak links in the power grid through long-term statistics, greatly improving the efficiency and intelligence level of power supply quality management.
[0021] In this invention, the above-described technical solutions can be combined with each other to achieve more preferred combinations. Other features and advantages of this invention will be set forth in the following description, and some advantages may become apparent from the description or be learned by practicing the invention. The objects and other advantages of this invention can be realized and obtained from what is particularly pointed out in the description and drawings. Attached Figure Description
[0022] The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Throughout the drawings, the same reference numerals denote the same parts. Figure 1A flowchart of a voltage sag source classification method based on multimodal feature extraction and dual attention mechanism provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of the multimodal feature extraction process for voltage sag waveform data. Detailed Implementation
[0023] Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form part of this application and are used together with the embodiments of the present invention to illustrate the principles of the present invention, but are not intended to limit the scope of the present invention.
[0024] A specific embodiment of this invention discloses a voltage sag source classification method based on multimodal feature extraction and a dual attention mechanism, which improves the accuracy and interpretability of voltage sag source classification and enables the analysis and management of sag events in complex power system environments. The flowchart of this method is as follows: Figure 1 As shown, the specific explanation is as follows.
[0025] Step S1: Construct a voltage sag training set based on the feature enhancement representation vectors of all voltage sag waveform sample data and their corresponding voltage sag source types.
[0026] Step S2: Train the voltage sag source classifier using the voltage sag training set to obtain a successfully trained voltage sag source classifier.
[0027] Step S3: Input the feature enhancement representation vector of the real-time voltage sag waveform data into the trained voltage sag source classifier, and let the voltage sag source classifier predict the voltage sag source type corresponding to the real-time voltage sag waveform data.
[0028] In this embodiment, both voltage sag waveform sample data and voltage sag waveform real-time data are used as voltage transient waveform data. By performing multimodal feature extraction and feature fusion and enhancement based on a dual attention mechanism on the voltage transient waveform data, a corresponding feature enhancement representation vector is obtained.
[0029] The process of obtaining the feature enhancement representation vector from voltage transient waveform data is explained in detail below.
[0030] Step C1: Perform preprocessing, multimodal feature extraction, and feature standardization on the voltage sag waveform data in sequence to obtain the corresponding standard multimodal feature extraction results.
[0031] Preferably, in this embodiment, the multimodal feature extraction results include standard time-frequency domain feature vectors, standard depth space feature vectors, and standard depth temporal feature vectors. The specific implementation process is described below.
[0032] Step C11: Preprocess the voltage sag waveform data.
[0033] In practice, voltage sag waveform data can be collected using sensors or monitoring devices. In this embodiment, the voltage sag waveform sample data is a one-dimensional time series signal, denoted as […]. , This represents the total number of sampling points in the voltage sag waveform data. Preferably, in this embodiment, in order to eliminate the interference of noise and baseline drift, the voltage sag waveform data needs to be preprocessed. The preprocessing process may include noise reduction filtering and normalization in sequence.
[0034] For example, wavelet thresholding can be used for noise reduction filtering to preserve the edge features of the voltage sag waveform data to the maximum extent. Normalization can scale the original signal to a standard range and eliminate the influence of amplitude dimensions; a common method is min-max normalization. The preprocessed voltage sag waveform data can be denoted as... .
[0035] Step C12: Perform multimodal feature extraction on the preprocessed voltage sag waveform data to obtain the multimodal feature extraction results.
[0036] Preferably, in this embodiment, multimodal feature extraction is performed on the preprocessed voltage sag waveform data. Specifically, the following operations are performed: time-frequency domain feature extraction, CNN feature extraction based on time-frequency image, and LSTM feature extraction are performed on the preprocessed voltage sag waveform data to obtain the corresponding time-frequency domain feature vector, depth space feature vector, and depth time-series feature vector. The time-frequency domain feature vector, depth space feature vector, and depth time-series feature vector are used as the multimodal feature extraction results.
[0037] A schematic diagram of the multimodal feature extraction process for voltage sag waveform data is shown below. Figure 2 As shown, the specific implementation process is described below.
[0038] (1) Time-frequency domain feature extraction Time-frequency domain feature extraction includes time-domain feature extraction and frequency-domain feature extraction.
[0039] 1) Temporal feature extraction Specifically, in this embodiment, time-domain feature extraction refers to extracting statistical quantities that reflect the signal amplitude, fluctuation, and shape characteristics from the preprocessed voltage sag waveform data as time-domain features, thereby constructing a time-domain feature vector. Preferably, in this embodiment, the time-domain features include dimensional features and dimensionless features. A detailed explanation follows.
[0040] Preferably, the dimensional characteristics include mean, root mean square, standard deviation, and peak value.
[0041] Mean: (1) Root Mean Square (RMS): (2) Standard Deviation: (3) Peak value: (4) Dimensionless characteristics refer to the fact that the preprocessed voltage sag waveform data is insensitive to load changes and has good stability.
[0042] Preferably, the dimensionless features include skewness, kurtosis, waveform factor, and peak factor.
[0043] Skewness measures the asymmetry in signal distribution and is expressed as: (5) Kurtosis measures the steepness and tail characteristics of a signal distribution, and is expressed as: (6) Form Factor: (7) Crest Factor: (8) Finally, the temporal feature vector obtained by temporal feature extraction Represented as: (9) 2) Frequency domain feature extraction The preprocessed voltage sag waveform data is converted to the frequency domain using Fast Fourier Transform (FFT). The energy distribution of its frequency components is analyzed to obtain several frequency domain features, which are then extracted to obtain a frequency domain feature vector. In this embodiment, preferred frequency domain features include spectral centroid, spectral bandwidth, and spectral entropy, the specific implementation of which is described below.
[0044] Calculate the Discrete Fourier Transform (DFT) of the preprocessed voltage sag waveform data: (10) Calculate the power spectral density (PSD) estimate: M= (11) 1) Spectral Centroid The centroid of the spectrum reflects the location where spectral energy is concentrated, and is represented as: (12) in, This represents the actual frequency value corresponding to the k-th frequency component.
[0045] 2) Spectral Bandwidth Spectral bandwidth is used to measure the degree of dispersion of the spectrum around the centroid, and is expressed as: (13) 3) Spectral Entropy Spectral entropy is used to measure the degree of disorder in the spectrum; the higher the entropy value, the more complex the frequency components.
[0046] First, normalize P(k) into a probability distribution: (14) Then calculate the Shannon entropy: (15) Finally, the frequency domain feature vector obtained by frequency domain feature extraction Represented as: (16) Therefore, the time-frequency domain feature vector obtained by time-frequency domain feature extraction Represented as: (17) In this embodiment, leveraging the powerful nonlinear mapping and feature learning capabilities of deep neural networks, higher-level and more discriminative deep features are automatically extracted from time-frequency images and original waveform sequences, respectively. This process is jointly implemented by parallel CNN feature extraction branches and LSTM feature extraction branches. Specific details are as follows.
[0047] (3) CNN feature extraction based on time-frequency images In this embodiment, based on CNN feature extraction of time-frequency images, the following steps are performed: S-transform is applied to the preprocessed voltage sag waveform data to generate a time-frequency image; CNN feature extraction is performed on the time-frequency image to obtain a depth space feature vector.
[0048] The CNN feature extraction branch aims to extract non-stationary features in the joint time-frequency domain of a signal. This embodiment employs the S-Transform—a time-frequency analysis method that outperforms the Short-Time Fourier Transform (STFT) and provides frequency-dependent resolution. The core improvement of this embodiment lies in applying the Convolutional Neural Network (CNN) not directly to the original voltage waveform image, but rather to the time-frequency image generated by the S-Transform, and using this CNN feature extraction branch as a key component of the multimodal feature fusion system.
[0049] Perform an S-transform on the signal x(n) to obtain the time-frequency matrix. ]: (18) Where m = 0, 1, ..., N 1 represents the time index; p = 0, 1, ..., M represents the frequency index, M = ; , This represents the actual frequency value of the p-th frequency. The sampling frequency.
[0050] Calculate the modulus of the S-transform to obtain the time-frequency spectrum. The frequency spectrum was used as a time-frequency image, which clearly showed the trajectory of frequency components evolving over time during the sag.
[0051] The CNN feature extraction branch uses time-frequency images I= As input, the goal is to automatically learn the unique spatial patterns and structural features (such as specific energy accumulation regions and oscillation pattern trajectories) of transient landing events in the joint time-frequency domain. In specific implementation, this embodiment employs a multi-layer convolutional neural network (CNN) architecture, whose core operation is convolution, extracting local features by sliding the convolution kernel across the image. The first in the layer Feature mapping The calculation formula is: (19) in, Represents convolution operation It is the connection of the first Layer The feature is mapped to the first feature in this layer. The convolution kernel weight matrix of each feature map. Here, is the bias vector, and ReLU is the activation function.
[0052] Convolutional layers are typically followed by pooling layers (such as max pooling) to reduce data dimensionality and enhance the translation invariance of features. By stacking multiple "convolution-activation-pooling" modules, the network can gradually abstract high-level semantic features that are highly relevant to the classification from low-level general features (such as edges and textures).
[0053] CNN parameters { , Optimization is performed through end-to-end gradient backpropagation. During the forward pass, the time-frequency image is input into the CNN, passing through convolutional layers (convolutional kernels). Bias The activation function and pooling layers are transformed layer by layer, and finally the deep space feature vector is obtained through global average pooling. During training, the gradients of the parameters of each layer are calculated in reverse using the chain rule, based on the error signal output by the entire network. and And use the Adam optimizer (initial learning rate) The parameters are iteratively updated to gradually optimize the deep space feature vector. Expressive ability.
[0054] Finally, the 3D feature tensor output by the last convolutional layer is compressed into a 1D feature vector through a global average pooling layer, which serves as the depth space feature vector. : (20) It is a compact and information-rich spatial pattern representation obtained after deep mining of time-frequency images.
[0055] In practice, training employs a mini-batch gradient descent strategy with a batch size of 32 and a maximum of 100 training epochs. The learning rate is scheduled using the ReduceLROnPlateau mechanism, automatically reducing the learning rate during validation error plateaus. Convolutional kernel weights... Initialized using a He normal distribution, with bias... Initialize to zero. To prevent overfitting and enhance generalization, an early stopping mechanism (patience value of 15 rounds) is introduced during training, and a Dropout operation (dropout rate set to 0.3 to 0.5) is implemented after the fully connected layer.
[0056] (4) LSTM feature extraction In this embodiment, LSTM feature extraction is performed on the preprocessed voltage sag waveform data to obtain a deep time-series feature vector.
[0057] The LSTM feature extraction branch takes the preprocessed original waveform sequence x(n) as input and aims to capture the long-term dynamic evolution and dependencies of the voltage sag signal over time, thereby modeling the temporal dependencies of the original waveform. The core improvement of this branch lies in not using LSTM as an independent classifier or a simple pre-feature extractor, but rather positioning it as a parallel branch within a multimodal feature fusion system dedicated to modeling the temporal dependencies of the original sequence, and deeply fusing it with CNN and time-frequency domain features through attention weighting.
[0058] Long Short-Term Memory (LSTM) networks consist of gating units (input gates) Forgotten Gate Output gate Candidate cell status The corresponding weight matrix and bias vector The algorithm is optimized using a backpropagation over time. In the forward propagation, the preprocessed voltage sequence... The data is sequentially input into the LSTM unit at time steps, and the cell state is updated according to the gating mechanism. With hidden state After the sequence processing is complete, retrieve the hidden state at the final time step. As a deep temporal feature vector During training, based on the overall error signal output by the entire network, the gradient is backpropagated along the time dimension to calculate the loss function with respect to each parameter of the LSTM. The partial derivatives are calculated, and an adaptive moment estimation optimizer is used to iteratively update the parameters, enabling the LSTM to progressively learn the long and short-term time-series dependencies related to classification in the voltage sag signal. Each time step of the LSTM... A discrete sampling point directly corresponding to the voltage signal The formula for calculating t at each time step is as follows: (twenty one) in, For the Sigmoid function, This represents element-wise multiplication. This represents the hyperbolic tangent activation function. The forget gate receives the same input information as the input gate—the voltage sample value at the current time step. Compared to the previous hidden state The Sigmoid function outputs a forgetting vector between 0 and 1. This vector is then used as the "memory decay coefficient," which is applied directly to the previous cell state through element-wise multiplication. Precisely control the retention of historical information in the current cell state. The proportion.
[0059] The Long Short-Term Memory (LSTM) network training employs a mini-batch learning strategy that unfolds over time, with a batch size of 32 and a time step count of [missing information]. Equal to the number of sampling points of the voltage sequence The network parameters are orthogonally initialized to mitigate gradient issues in the early stages of training. To prevent the vanishing or exploding gradients common in deep temporal networks, gradient clipping is introduced during training, with a gradient norm threshold set to 5.0. Additionally, a dropout mechanism (typically set to 0.2-0.3) is optionally applied between LSTM layers to enhance the model's generalization ability by randomly disconnecting neuron connections. The learning rate is scheduled in line with the CNN branches, dynamically adjusted using the ReduceLROnPlateau strategy to ensure that the LSTM parameters converge in tandem with other parts of the network.
[0060] After processing the entire sequence of length N, take the hidden state at the last time step. As a general representation of the entire waveform sequence, namely the deep temporal feature vector : (twenty two) Deep temporal feature vector It contains dynamic information about the entire process of voltage sag occurrence, development, and recovery.
[0061] Step C13: Standardize the multimodal feature extraction results to obtain standard multimodal feature extraction results.
[0062] Specifically, the time-frequency domain feature vector, the depth space feature vector, and the depth-time series feature vector are standardized respectively to obtain the corresponding standard time-frequency domain feature vector, standard depth space feature vector, and standard depth-time series feature vector.
[0063] In this embodiment, Z-score normalization is used to standardize both the time-frequency domain feature vector and the depth-time series feature vector, converting each feature dimension into a standard normal distribution with a mean of 0 and a standard deviation of 1. The standard time-frequency domain feature vector and the standard depth-time series feature vector are respectively represented as follows: , It should be noted that since each feature in the deep space feature vector output by CNN feature extraction is itself between 0 and 1, feature standardization is not required; that is, the deep space feature vector can be directly processed. As a feature vector in deep space .
[0064] Therefore, the standard multimodal feature extraction results of voltage sag waveform data It includes feature vectors for three modes, namely: standard time-frequency domain feature vectors. Standard depth space feature vector Standard depth temporal feature vector .
[0065] In this embodiment, the results of multimodal feature extraction are based on the time-frequency domain, depth space, and deep temporal dimensions, which preserve and characterize the essential features of the temporary landing event to the greatest extent possible, thereby constructing a comprehensive and rich multimodal feature set, laying the foundation for subsequent analysis.
[0066] Step C2: Perform feature fusion and enhancement based on a dual attention mechanism on the standard multimodal feature extraction results to obtain the corresponding feature enhancement representation vector.
[0067] First, it should be noted that the dual attention mechanism proposed in this embodiment represents a significant improvement over existing technologies in terms of architecture design and functional implementation. This embodiment innovatively constructs a dual attention mechanism based on a two-layer weighted fusion framework of "modal attention - feature attention": Modal attention is used at the macro level to dynamically evaluate and weight the contributions of three feature modalities—temporal (LSTM), spatial (CNN), and time-frequency domain—through competitive Softmax normalization, achieving cross-modal adaptive source selection; Feature attention, at the micro level, borrows ideas from SENet to squeeze, excite, and recalibrate the fused features, achieving adaptive enhancement and suppression of feature dimensions—thus highlighting the key information most relevant to classification, suppressing redundant features, and thereby enhancing the voltage sag source classifier's ability to perceive and express important features. This mechanism improves the discriminativeness and robustness of multimodal information fusion through hierarchical attention.
[0068] Step C21: Map the feature vectors of each modality in the standard multimodal feature extraction results to the same feature dimension space, and concatenate the feature vectors of all the mapped modalities to obtain the initial fused feature vector.
[0069] Initial fusion feature vector Represented as: (twenty three) in, , , These represent the standard time-frequency domain eigenvectors, respectively. Standard depth space feature vector Standard depth temporal feature vector Mapped to the same feature dimension space.
[0070] Step C22: Perform modal attention fusion on the initial fused feature vector using a modal attention network to obtain a weighted fused feature vector.
[0071] The initial fused feature vector is input into the modal attention network to obtain the attention score of the feature vector of each modality. The attention score of the feature vector of each modality is normalized to obtain the attention weight of the feature vector of each modality. The attention weight of the feature vector of each modality is used to perform weighted fusion of the feature vector of the corresponding modality mapped to the same feature dimension space to obtain the weighted fused feature vector.
[0072] Modal attention fusion is performed using a modal attention network to automatically evaluate the importance of different modal features (time-frequency domain, image, time series) for the current specific classification task, and assign higher weights to more important modalities. The specific implementation process is described below.
[0073] The initial fused feature vector The input is a modal attention network, which calculates the attention score for the feature vector of each modality. Specifically, the modal attention network is implemented through a nonlinear transformation (such as a single-layer neural network) to calculate the attention score for each modality.
[0074] The attention score vector of all modal feature vectors output by the modal attention network Represented as: (twenty four) in, , , , These are all learnable parameters in modal attention networks. It is a score vector containing three elements. , , , These represent the standard time-frequency domain eigenvectors, respectively. Standard depth space feature vector Standard depth temporal feature vector The corresponding attention score.
[0075] Then, the attention scores of the feature vectors of each modality can be normalized into a probability distribution using the Softmax function to obtain the attention weights of the feature vectors of each modality.
[0076] Standard time-frequency domain eigenvectors Standard depth space feature vector Standard depth temporal feature vector attention weights , , They are represented as follows: (25) (26) (27) And satisfy .
[0077] Weighted fusion feature vector Represented as: (28) This step allows the model to dynamically focus on the most relevant information sources. For example, if the current signal sag feature is very unique in the time-frequency image, the corresponding weight will be significantly greater than other weights, and the features of the CNN modality will dominate the decision-making process.
[0079] The modal attention fusion mechanism proposed in this embodiment has three significant advantages: (1) The weight score is derived from the initial fusion feature vector through a nonlinear transformation layer, avoiding decision bias caused by relying solely on single modal information; (2) Competitive weight normalization based on Softmax is introduced, so that the three modalities of time-frequency domain, image and time series form a dynamic competitive relationship in the forward inference process of each sample, thereby adaptively allocating attention according to signal characteristics; (3) The obtained weight value is directly associated with the feature modality with clear physical meaning, which not only realizes the adaptive fusion of multi-source information, but also provides a traceable decision basis for classification conclusions.
[0080] Step C23: Input the weighted fused feature vector into the feature attention enhancement network to obtain the corresponding feature enhancement representation vector.
[0081] After obtaining the weighted fused feature vector generated by modal attention fusion, this embodiment introduces an additional feature attention enhancement mechanism. The basic idea behind this mechanism is that even in the weighted fused feature vector, the contributions of each feature dimension to the final classification task still differ. Some dimensions are highly discriminative core features, while others may be redundant or even noisy features. The goal of the feature attention mechanism is to mimic the human brain's attention allocation method, performing "refined processing" on the feature vector, automatically learning and highlighting the feature dimensions crucial to the current classification task, while weakening unimportant or highly interfering dimensions. Its structure originates from the classic Squeeze-and-Excitation Network idea and adopts an implementation method adapted to fully connected layers. This process mainly includes three steps: squeezing, excitation, and recalibration. It should be noted that this embodiment improves the design and implementation of the traditional attention mechanism in the feature attention enhancement stage, taking into account the structural characteristics of the voltage sag multimodal fusion features. Current research largely adopts the channel attention mechanism from the image domain, failing to fully consider the physical meaning and coupling relationships of feature dimensions in power time-series signals. This embodiment, based on the Squeeze-and-Excitation concept, innovatively applies it to the vector representation after the fusion of time-series, spatial, and statistical features. Through a three-step process of "global information compression - adaptive weight learning - feature recalibration," it achieves precise enhancement of key discriminative feature dimensions. This method not only improves the model's classification sensitivity under complex sag modes but also, through the interpretability of the weight distribution, provides maintenance personnel with feature-level evidence of "why it is classified this way," further enhancing the reliability and practicality of the method in engineering practice.
[0082] Specifically, in this embodiment, the weighted fused feature vector is input into the feature attention enhancement network, which then sequentially performs global information embedding representation, adaptive weight learning, and feature enhancement on the weighted fused feature vector to obtain the corresponding feature enhancement representation vector. The specific implementation method is explained below.
[0083] 1) Squeeze: Global information embedding representation First, the weighted fused feature vector is represented by global information embedding through the Global Average Pooling (GAP) operation, generating a descriptor with a global receptive field for each feature dimension of the weighted fused feature vector, thus obtaining the global descriptor corresponding to the weighted fused feature vector.
[0084] In the specific implementation process, GAP will weight and fuse feature vectors. Compressed into a scalar statistic z This scalar captures the average response strength of the corresponding feature dimension across the entire domain. Global Descriptor Represented as: (29) in, Represents the weighted fusion feature vector Feature dimensions, Represents the weighted fusion feature vector The Middle Elements of each feature dimension.
[0085] 2) Excitation: Adaptive weight learning Adaptive weight learning is performed on the global descriptor corresponding to the weighted fusion feature vector to obtain the feature attention weight vector.
[0086] Once the global descriptor is obtained, the importance weights of each feature dimension in the weighted fused feature vector can be learned based on this descriptor. Adaptive weight learning is achieved through a gating mechanism, specifically by using two fully connected layers (FC) and a nonlinear activation function to learn the complex nonlinear interaction relationships between channels in a parameterized manner.
[0087] The first fully connected layer serves to reduce dimensionality and perform nonlinear mapping, typically using the ReLU activation function.
[0088] The output vector of the first fully connected layer Represented as: (30) in, Here is the weight matrix of the first fully connected layer, with dimension 1. , It is the ReLU activation function. It is a scaling ratio (usually 4, 8, 16) used to compress the number of channels to reduce computation and improve generalization. This represents the bias vector of the first fully connected layer.
[0089] The second fully connected layer is used to recover the dimensionality, and the output weight values are normalized to the [0,1] interval using the Sigmoid function to obtain the feature attention weight vector. , is represented as: (31) in, Here is the weight matrix of the first fully connected layer, with dimension 1. , It is the Sigmoid function. Each element in Both are scalars between 0 and 1, representing the first... The importance of each feature dimension. This represents the bias vector of the second fully connected layer.
[0090] 3) Recalibration: Feature enhancement The weighted fused feature vector and the feature attention weight vector are subjected to feature enhancement based on element-wise multiplication to obtain the corresponding feature enhancement representation vector.
[0091] Specifically, the learned feature attention weight vector Features fused with the original modality Element-wise multiplication is performed to generate a refined, attention-enhanced feature augmentation representation vector. : (32) Here, ⊙ represents element-wise multiplication. , express The operation amplifies the feature dimensions that are critical to the current classification task, while reducing the contribution of useless or redundant dimensions, thereby generating a more discriminative feature-enhanced representation vector.
[0092] In this embodiment, after obtaining the feature-enhanced representation vectors of all voltage sag waveform sample data in the above manner, a voltage sag training set can be constructed based on the feature-enhanced representation vectors of all voltage sag waveform sample data and their corresponding voltage sag source types. The voltage sag source classifier is then trained using the voltage sag training set to obtain a successfully trained voltage sag source classifier. Specifically, the voltage sag source classifier is trained using the feature-enhanced representation vectors of each voltage sag waveform sample data as input and the corresponding voltage sag source type as the label to obtain a successfully trained voltage sag source classifier.
[0093] In practical implementation, voltage sag source classifiers can be implemented using deep neural network classification models. Leveraging the high discriminative power of feature-enhanced representation vectors, deep neural network classification models can generate high-precision, highly interpretable voltage sag source classifiers.
[0094] In constructing and training the deep neural network classification model, this embodiment employs a specialized design to address the nonlinear decision boundary characteristics of high-dimensional fusion features in voltage sag source classification tasks. Compared to conventional approaches using shallow classification networks or fixed-structure fully connected layers, this embodiment designs a depth-configurable, strongly regularized multilayer perceptron classifier. This deep neural network classification model takes the refined features (i.e., feature-enhanced representation vectors) output by the front-end dual attention mechanism as input. It performs deep feature transformation and information abstraction through multiple fully connected layers with ReLU activation and random deactivation operations, and applies L2 regularization constraints to all weights to improve the model's generalization performance with limited samples. The classifier and the aforementioned feature extraction and fusion modules are trained end-to-end in an integrated manner. Global parameter co-optimization is achieved through adaptive optimization algorithms and dynamic learning rate scheduling, ensuring a high degree of consistency between feature learning and classification decision objectives. This significantly enhances the system's adaptability and stability to complex power conditions while maintaining high accuracy.
[0095] For example, in order to fully explore the deep nonlinear classification boundaries of these features, this embodiment uses a multi-layer deep neural network (DNN) as the final classifier.
[0096] This DNN classifier consists of an input layer, multiple hidden layers, and a final softmax output layer. Its forward propagation process is as follows: a. Input layer: Receives feature-enhanced representation vectors from the attention mechanism. .
[0097] b. Hidden Layers: Consisting of L fully connected layers. Each layer contains a non-linear activation function and an optional dropout operation to prevent overfitting.
[0098] No. The computation of a hidden layer can be represented as: (33) in: It is input. , They represent the first Layer weight matrix and bias vector. ( ) is a non-linear activation function. This embodiment recommends using ReLU (Rectified Linear Unit) or its variants (such as LeakyReLU) because it effectively alleviates the vanishing gradient problem and accelerates training. Its expression is: (34) Dropout: During training, the output of each hidden layer can be randomly set to zero with a certain probability (e.g., 0.5). This is an effective regularization technique that enhances the model's generalization ability by preventing co-adaptation of feature detectors.
[0099] c. The output h of the last hidden layer (L) It is passed to the output layer. The output layer is a fully connected layer with the same number of neurons as the number of target categories, o.
[0100] This layer first calculates logits (scores): (35) Then, use the Softmax function to calculate the logits. z is transformed into a normalized probability distribution , representing the predicted probability that the input sample belongs to each category: (36) in, This represents the predicted probability that the sample belongs to the o-th voltage sag source type.
[0101] Final decision: Classification results The voltage sag source type is determined by the index of the voltage sag source type corresponding to the maximum probability value in the probability vector: (37) Then, the trained voltage sag source classifier can be used to accurately classify and identify unknown sag events, thereby achieving accurate identification and tracing of different sag sources. In summary, this embodiment aims to address the technical challenges of existing voltage sag analysis methods, including limited feature extraction dimensions, insufficient ability to distinguish complex sag modes, and low levels of automation and intelligence. By combining multimodal feature fusion and a dual attention mechanism, this invention significantly improves the accuracy, robustness, and practicality of voltage sag analysis, demonstrating a crucial role in real-world applications. This invention significantly enhances the accuracy and interpretability of voltage sag event classification in complex power system environments, making it suitable for practical sag mitigation and analysis. First, the raw voltage sag data is preprocessed to extract time-domain, frequency-domain, and time-series trajectory features from the voltage waveform, constructing a multimodal feature set. Next, a dual attention mechanism is used to adaptively weight and fuse the time-series and channel features, enhancing the expressive effect of key features. Subsequently, a classification model is built based on the fused features, trained using a deep neural network to obtain a high-precision and highly interpretable voltage sag source classifier. Finally, the trained model is used to classify and identify the test set, achieving accurate identification and tracing of different sag sources. This invention enhances feature representation capabilities through multimodal feature fusion and attention mechanisms, significantly improving the accuracy and interpretability of voltage sag source classification, and is applicable to the analysis and management of sag events in complex power system environments.
[0102] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware, and the program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.
[0103] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.
Claims
1. A voltage sag source classification method based on multimodal feature extraction and dual attention mechanism, characterized in that, The method includes: A voltage sag training set is constructed based on the feature-enhanced representation vectors of all voltage sag waveform sample data and their corresponding voltage sag source types. The voltage sag source classifier is trained using the voltage sag training set to obtain a successfully trained voltage sag source classifier; The feature-enhanced representation vector of the real-time voltage sag waveform data is input into the trained voltage sag source classifier, and the voltage sag source classifier predicts the voltage sag source type corresponding to the real-time voltage sag waveform data. Among them, voltage sag waveform sample data and voltage sag waveform real-time data are both used as voltage transient waveform data. By performing multimodal feature extraction and feature fusion and enhancement based on dual attention mechanism on voltage transient waveform data, the corresponding feature enhancement representation vector is obtained.
2. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 1, characterized in that, The process of extracting multimodal features from voltage transient waveform data, performing feature fusion and enhancement based on a dual attention mechanism, is as follows: The voltage sag waveform data is preprocessed, multimodal feature extracted, and feature standardized sequentially to obtain the corresponding standard multimodal feature extraction results; The standard multimodal feature extraction results are subjected to feature fusion and enhancement based on a dual attention mechanism to obtain the corresponding feature enhancement representation vector.
3. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 2, characterized in that, Multimodal feature extraction is performed on the preprocessed voltage sag waveform data. The preprocessed voltage sag waveform data were subjected to time-frequency domain feature extraction, CNN feature extraction based on time-frequency image, and LSTM feature extraction to obtain the corresponding time-frequency domain feature vector, depth space feature vector, and depth time series feature vector. The time-frequency domain feature vector, the depth space feature vector, and the depth-temporal feature vector are used as the results of multimodal feature extraction.
4. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 3, characterized in that, The multimodal feature extraction results are standardized by performing the following steps: The time-frequency domain feature vector, the depth space feature vector, and the depth-temporal feature vector are respectively standardized to obtain the corresponding standard time-frequency domain feature vector, standard depth space feature vector, and standard depth-temporal feature vector. The standard time-frequency domain feature vector, standard depth space feature vector, and standard depth time-series feature vector are used as the feature vectors of the three modes in the standard multimodal feature extraction results.
5. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 4, characterized in that, The standard multimodal feature extraction results are subjected to feature fusion and enhancement based on a dual attention mechanism, which is performed as follows: The feature vectors of each modality in the standard multimodal feature extraction results are mapped to the same feature dimension space, and the feature vectors of all the mapped modalities are concatenated to obtain the initial fused feature vector; The initial fused feature vector is fused using a modal attention network to obtain a weighted fused feature vector; The weighted fused feature vector is input into the feature attention enhancement network to obtain the corresponding feature enhancement representation vector.
6. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 5, characterized in that, The weighted fusion feature vector is obtained by performing the following operations: The initial fused feature vector is input into the modal attention network to obtain the attention score of the feature vector of each modality; The attention scores of the feature vectors of each modality are normalized to obtain the attention weights of the feature vectors of each modality. By using the attention weights of the feature vectors of each modality, the feature vectors of the corresponding modalities mapped to the same feature dimension space are weighted and fused to obtain a weighted fused feature vector.
7. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 6, characterized in that, The corresponding feature-enhanced representation vector is obtained by performing the following operations: The weighted fused feature vector is input into the feature attention enhancement network, which then performs global information embedding, adaptive weight learning, and feature enhancement on the weighted fused feature vector in sequence to obtain the corresponding feature enhancement representation vector.
8. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 3, characterized in that, Time-frequency domain feature extraction was performed on the preprocessed voltage sag waveform data, and the following steps were taken: Time-domain and frequency-domain features were extracted from the preprocessed voltage sag waveform data. The feature extraction results from the time-domain and frequency-domain features were then concatenated to form the corresponding time-frequency domain feature vector.
9. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to any one of claims 1-8, characterized in that, Train a voltage sag source classifier using the voltage sag training set, and perform the following: Using the feature enhancement representation vector of each voltage sag waveform sample data as input and the corresponding voltage sag source type as label, the voltage sag source classifier is trained to obtain a successfully trained voltage sag source classifier.
10. The voltage sag source classification method based on multimodal feature extraction and dual attention mechanism according to claim 9, characterized in that, The voltage sag source classifier is implemented using a deep neural network classification model.