Decoupled fusion-based dual-domain self-supervised eeg signal representation learning method

By employing dual-domain embedding and dynamic feature decoupling, combined with curriculum-based masking and multi-objective loss functions, the problem of incomplete representation in EEG self-supervised learning is solved, achieving comprehensive feature capture and structural integrity of EEG signals, and improving the performance and generalization ability of downstream tasks.

CN122272047APending Publication Date: 2026-06-26SHENYANG AEROSPACE UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENYANG AEROSPACE UNIVERSITY
Filing Date
2026-03-20
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing EEG self-supervised learning models cannot fully capture the multimodal characteristics of signals, cannot decouple the complex interaction between transient time events and potential spectral rhythms, and lack a target function that enforces structural isomorphism during training, resulting in incomplete representations and insufficient generalization ability.

Method used

We employ a dual-domain self-supervised EEG signal representation learning method based on decoupling and fusion. Through dual-domain embedding, dynamic feature decoupling, hierarchical spatiotemporal fusion coding, curriculum-based structured masking, and multi-objective composite loss function, we learn a structurally complete and generalizable universal representation from large-scale unlabeled EEG data.

Benefits of technology

It significantly improves performance and generalization ability in various downstream tasks such as emotion recognition, motor imagery classification, and abnormal EEG detection, and achieves comprehensive capture of the spatiotemporal features and structural integrity of signals.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122272047A_ABST
    Figure CN122272047A_ABST
Patent Text Reader

Abstract

This invention discloses a dual-domain self-supervised EEG signal representation learning method based on decoupled fusion, belonging to the fields of artificial intelligence and biosignal processing technology. The method includes: acquiring and preprocessing EEG samples, segmenting them into patches; simultaneously extracting and fusing temporal and frequency domain features for each patch; decoupling the features into a temporal-centered view and a spatial-centered view through learnable gating; adding positional encoding and then performing a curriculum-based structured masking strategy; inputting the masked feature stream into a DeFuse encoder, extracting deep spatiotemporal representations through stacked decoupled fusion layers, and predicting the mask content; training the network with a multi-objective loss function including temporal reconstruction loss, spectral fidelity loss, and spatial covariance loss as the optimization objective. This invention can learn structurally complete and generalizable universal representations from unlabeled EEG data, and its performance on downstream tasks such as emotion recognition, motor imagery classification, and anomaly detection is significantly superior to existing methods.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of artificial intelligence and biological signal processing technology, specifically relating to a dual-domain self-supervised EEG signal representation learning method based on decoupling fusion. Background Technology

[0002] Electroencephalography (EEG) is the cornerstone of modern neuroscience and brain-computer interfaces (BCI), providing a non-invasive window into the complex dynamics of the brain's electrical activity. EEG signals are inherently multi-layered, encoding rich information across temporal, spatial, and spectral dimensions, reflecting the dynamic characteristics of brain function and cognition. The pursuit of decoding this information has driven the evolution of computational models, from traditional machine learning methods to task-specific deep neural networks. These advanced models have achieved significant success in various BCI applications, including motor imagery classification, emotion recognition, automated sleep staging, and epilepsy detection.

[0003] Despite these advances, supervised learning models typically rely on large amounts of precisely labeled data, limiting their generalization capabilities to specific tasks. In recent years, the field has undergone a profound paradigm shift towards self-supervised learning (SSL) to build versatile foundational models. These models aim to learn general EEG representations from large unlabeled datasets, significantly improving performance and generalization across a wide range of downstream tasks. Among these, masked autoencoders are a leading paradigm, drawing on their landmark successes in natural language processing and computer vision.

[0004] However, despite its promising prospects, current foundational EEG models often produce structurally incomplete representations. This deficiency stems from several key limitations, typically inherited from adaptation methods in other domains, posing a significant challenge to generalization across datasets:

[0005] 1. Representational Limitations: Existing models often fail to fully capture the multimodal characteristics of signals, where spatial, temporal, and spectral features are deeply intertwined. Processing EEG from a single perspective, such as general time-series grids or pure spectral inputs, makes it difficult to decouple the complex interactions between transient time events and underlying spectral rhythms.

[0006] 2. Oversimplified pre-training tasks: The learning process is often based on simple masking strategies. Standard random masks cannot reflect the structured characteristics of real-world neural events or physiological artifacts, limiting the model's ability to learn robustly and without local dependencies.

[0007] 3. Coarse learning objective: The training process is constrained by an objective function that fails to enforce structural isomorphism. Traditional mean squared error (MSE) loss assesses pointwise accuracy but ignores higher-order structural characteristics. It fails to penalize the reconstruction of distorted signal spectral composition or cross-channel functional connectivity patterns.

[0008] Therefore, designing a self-supervised learning method that can comprehensively capture the time, frequency, and spatial features of EEG signals while maintaining the integrity of the signal's neurophysiological structure has become a key technical problem that needs to be solved. Summary of the Invention

[0009] Therefore, the purpose of this invention is to provide a dual-domain self-supervised EEG signal representation learning method based on decoupling fusion. Through dual-domain embedding, dynamic feature decoupling, hierarchical spatiotemporal fusion coding, curriculum-based structured masking, and multi-objective composite loss function, it learns structurally complete and generalizable universal representations from large-scale unlabeled EEG data.

[0010] The technical solution provided by this invention is: a dual-domain self-supervised EEG signal representation learning method based on decoupling and fusion, comprising:

[0011] Acquire EEG sample data and preprocess it;

[0012] The preprocessed EEG samples are divided into non-overlapping patches of fixed size along the time axis to obtain a patch sequence.

[0013] For each patch, both time-domain and frequency-domain features are extracted simultaneously, and the extracted time-domain and frequency-domain features are fused to obtain the fused embedding vector for each patch.

[0014] Based on the fused embedding vector, the features are dynamically decoupled into a time-centered view and a spatial-centered view through a learnable gating mechanism;

[0015] Add position encoding to the time-centered view, the spatial-centered view, and the global feature stream respectively to obtain three position-aware feature streams;

[0016] A course-based structured masking strategy is adopted to mask some patch embedding vectors in the three location-aware feature streams, and replace the patch embedding vectors at the mask positions with learnable mask tokens.

[0017] The three feature streams after masking are input into the DeFuse encoder; the DeFuse encoder performs initial parallel feature decoupling enhancement, view fusion and first joint context relationship calculation, view reshaping and second joint context relationship calculation on the input feature streams through stacked decoupling fusion layers, outputs a deep representation rich in spatiotemporal context information, and predicts the original patch content at the masked position;

[0018] Using a multi-objective loss function as the optimization objective, the error between the predicted patch content and the original patch is calculated. The network parameters of the DeFuse encoder are trained through backpropagation, thereby learning a general EEG signal representation.

[0019] Preferably, the temporal feature extraction is performed by extracting waveform features through a 2D convolutional layer:

[0020] Each EEG block slice is processed through a series of 2D convolutional layers to directly learn local temporal patterns and structural features from the original signal waveform. This process can be summarized as a function. :

[0021]

[0022] in It is a time-domain embedding vector of dimension D;

[0023] The frequency domain feature extraction involves applying an FFT transform to each EEG block slice to divide it into 6 frequency bands. Each frequency band is then processed by a dedicated MLP, and the features of all frequency bands are ultimately fused into a single structured spectral representation. This process is formalized as follows:

[0024]

[0025]

[0026]

[0027] in It is the amplitude spectrum. This is a characteristic of the b-th frequency band. It is the final frequency domain embedding vector;

[0028] The final patch embedding is obtained by summing the outputs of the time and frequency domain branches element-wise.

[0029] .

[0030] Preferably, based on the fused embedding vector, the features are dynamically decoupled into a temporally centered view and a spatially centered view through a learnable gating mechanism:

[0031] Embedded Tensors A dynamic feature separator uses a learnable gating network to dynamically decouple features into a time-centered view. ) and space center view ( The gating mechanism generates specific weights for each sample based on the global average of the embeddings across spatial (C) and temporal (N) dimensions.

[0032]

[0033]

[0034]

[0035] in It is a gateway to learning in terms of time and space. This represents element-wise multiplication in broadcast mode. It is a linear transformation that projects features onto their respective D / 2D views.

[0036] Preferably, positional encoding is added to the temporal center view, spatial center view, and global feature stream respectively:

[0037] Time location encoding: Add to Time View : ;

[0038] Channel location coding: Add to space view : ;

[0039] Global positional encoding: Compressed representation added to the full feature set:

[0040] Preferably, the DeFuse encoder consists of L identically stacked DeFuse encoder layers, each accepting three input streams: a time-centered view (…). ), spatial center view ( ) and global feature flow ( );

[0041] The initial parallel feature decoupling enhancement: The temporal and spatial views are first independently enhanced by a novel contextual relation computation module using standard pre-layer normalization (Pre-LN) to capture intra-domain dependencies. Temporal view The handling:

[0042]

[0043] Spatial view Perform the same transformation using independent sets of weights:

[0044]

[0045] The above-and-below relationship calculation between the view fusion and the first joint is as follows:

[0046] Spatial features Reshaped to match temporal features And splice them together to form a joint representation This representation is then subjected to a joint attention step along the time dimension:

[0047]

[0048]

[0049]

[0050] The view reshaping and the second joint context relationship calculation: Output Reshaped into a time batch view The focus of the second union context relationship is shifted to capturing the dependencies between channels, which are now rich in integrated temporal information:

[0051]

[0052]

[0053] Feature defusion and output: Representation of deep fusion The segments are projected back into separate temporal and spatial views for use by the next layer, while the global path is processed in parallel to handle the global feature flow x.

[0054] Preferably, the course-based structured masking strategy includes: 40% pipeline masking, 30% channel masking, and 30% random masking.

[0055] The proportion of the mask patch starts from a low value and gradually increases during training according to cosine scheduling.

[0056] For the current round k, the mask ratio for:

[0057]

[0058] in It is the total number of training rounds. .

[0059] Preferably, the multi-objective loss function is:

[0060] Temporal reconstruction loss :predict With the original Standard MSE between mask patches ensures waveform-level accuracy:

[0061]

[0062] Spectral fidelity loss

[0063] L1 loss of patch amplitude spectrum forces perceptual similarity in the frequency domain:

[0064]

[0065] Space structure retention loss :

[0066] The loss is calculated by penalizing the difference between the channel covariance matrices and using the Frobenius norm as a measure:

[0067]

[0068] Final training objective: a weighted sum of the three components.

[0069]

[0070] in It is a hyperparameter that balances the contribution of each loss term.

[0071] Furthermore, this invention also provides a dual-domain self-supervised EEG signal representation learning system based on decoupling and fusion, comprising: a data preprocessing module, a dual-domain embedding module, a dynamic decoupling module, a multi-view location encoding module, a curriculum-based structured masking module, a DeFuse encoder module, and a multi-target loss calculation module; each module executes the corresponding steps in the method.

[0072] This invention provides a decoupled fusion-based dual-domain self-supervised EEG signal representation learning method. It simultaneously extracts features from EEG patches in both the time and frequency domains, fusing the transient morphological information of the original waveform with spectral power information segmented according to clinically relevant frequency bands (Delta, Theta, Alpha, Beta, Gamma, and high Gamma) into a unified patch representation. Based on this, a dynamic gating mechanism decouples the features into a temporal-centered view and a spatial-centered view, and then fuses them hierarchically. This allows for the learning of complex spatiotemporal interactions from multiple perspectives, while also incorporating pipeline masks, channel masks, and randomization. The three curriculum-based structured masking strategies and the learning schedule from easy to difficult force the model to learn robust nonlocal dependencies. The composite objective function composed of temporal reconstruction loss, spectral fidelity loss and spatial covariance loss ensures that the learned representation is faithful to the original signal in terms of temporal accuracy, frequency composition and spatial connectivity. Thus, it can learn a structurally complete and dynamically isomorphic general representation from large-scale EEG data without manual annotation, which significantly improves the performance and generalization ability of various downstream tasks such as emotion recognition, motor imagery classification and abnormal EEG detection. Attached Figure Description

[0073] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.

[0074] Figure 1 This is a general framework diagram of the method of the present invention. Detailed Implementation

[0075] The present invention will be further explained below with reference to specific implementation schemes, but this explanation does not limit the scope of the invention.

[0076] Electroencephalography (EEG) signals are obtained by recording the potential changes caused by the synchronized firing of a group of neurons in the brain using multiple electrodes placed on the scalp. The raw data is presented as a multi-channel time series, typically expressed as... Where C is the number of electrode channels (e.g., 19 channels or more according to the 10-20 international system), and T is the number of time sampling points. Different brain-computer interface (BCI) applications use different methods to label EEG signals. For example, in emotion recognition tasks, after a subject watches material evoking specific emotions, the collected EEG segments are labeled with the corresponding emotion category (e.g., the FACED dataset contains 9 discrete emotion labels, and the SEED-V dataset contains 5 emotion labels); in motor imagery tasks, when a subject imagines performing a specific limb movement, the corresponding EEG segments are labeled with the movement category (e.g., the PhysioNet-MI dataset contains 4 labels: left fist, right fist, both fists, and both feet); in abnormal EEG detection tasks, clinical experts perform binary labeling of EEG recordings as normal / abnormal (e.g., the TUAB dataset), or more granular multi-class labeling of abnormal types (e.g., the TUEV dataset contains 6 event type labels).

[0077] Automatic analysis and feature extraction of EEG signals directly affect the recognition accuracy of various BCI tasks. However, current EEG representation methods based on self-supervised learning suffer from the following problems. First, existing models often process EEG signals from a single perspective, such as extracting features only from the time or frequency domain. This fails to fully capture the multimodal characteristics of the signal, which are deeply intertwined with spatial, temporal, and spectral features, and makes it difficult to decouple the complex interaction between transient events and underlying spectral rhythms, resulting in structurally incomplete representations. Second, mainstream masked autoencoder pre-training methods typically employ simple random masking strategies, which cannot reflect the structured characteristics of real-world neural events or physiological artifacts, limiting the model's ability to learn robust nonlocal dependencies. Furthermore, the mean squared error loss function commonly used during training only evaluates pointwise accuracy, ignoring higher-order structural characteristics such as signal spectral composition and cross-channel functional connectivity patterns, and failing to force the reconstructed signal to maintain structural isomorphism with the original signal.

[0078] To address the aforementioned issues, the overall architecture of this invention, based on a decoupled and fused dual-domain self-supervised EEG signal representation learning method (DeFuseMod), is as follows: Figure 1As shown. This method comprises three core innovative stages: (1) a dual-domain embedding module that extracts rich features from EEG block slices; (2) a DeFuse context relation computation encoder that is specifically designed to model unique dependencies in EEG data; and (3) a composite reconstruction objective that guides the learning process from multiple complementary perspectives.

[0079] A dual-domain self-supervised EEG signal representation learning method based on decoupled fusion includes the following steps:

[0080] This invention acquires EEG signal samples based on the masked autoencoder (MAE) paradigm. The EEG samples are represented as follows: Where C is the number of electrode channels and T is the total number of timestamps. The input samples are segmented into non-overlapping patches of a fixed window size, forming a tensor of shape (C, N, W), where N is the sequence length (number of patches) and W is the patch size. Subsequently, some patches are replaced with learnable mask tokens for the reconstruction task. The masking process is represented as:

[0081]

[0082] in It is the original patch at channel c and sequence position n. It is the corresponding mask indicator. It is a learnable mask token.

[0083] To comprehensively extract features from each EEG block slice, this invention designs a block embedding module that operates simultaneously in the time and frequency domains. EEG analysis is inherently multifaceted: the time domain captures the morphology of transient neural events, while the frequency domain reveals the state of underlying neural oscillations.

[0084] Temporal Branch: Each EEG block slice is first processed through a series of 2D convolutional layers. This branch is designed to learn local temporal patterns and structural features, such as the characteristic shape of event-related potentials (ERPs), directly from the original signal waveform. This process can be summarized as a function... :

[0085]

[0086] in It is a time-domain embedding vector of dimension D.

[0087] Frequency Domain Branching: To explicitly inject spectral knowledge into the model, this invention employs a patch-based multi-band processor module. This module first applies a Fast Fourier Transform (FFT) to each patch slice. Unlike methods that treat the entire spectrum as a single vector, this method divides the amplitude spectrum into six clinically relevant bands with clear neurophysiological significance: Delta (0.5-4Hz), Theta (4-8Hz), Alpha (8-13Hz), Beta (13-30Hz), Gamma (30-50Hz), and High Gamma (50-75Hz). The spectral information of each band is processed by a dedicated multilayer perceptron (MLP), and the features of all bands are ultimately fused into a single structured spectral representation. This process is formalized as follows:

[0088]

[0089]

[0090]

[0091] in It is the amplitude spectrum. This is a characteristic of the b-th frequency band. It is the final frequency domain embedding vector.

[0092] The final block-slice embedding is an element-wise summation of the time-domain and frequency-domain embeddings:

[0093]

[0094] The invented time-frequency dual-domain embedding mechanism simultaneously captures the original waveform morphology and spectral power distribution, providing a more comprehensive input representation than single-domain methods. Ablation experiments show that, compared to using only time-domain or frequency-domain embedding, dual-domain embedding improves the balance accuracy by 1.2% and 1.9% on the PhysioNet-MI dataset, respectively, and by 2.0% and 2.4% on the FACED dataset, respectively.

[0095] After the block-slice embedding stage, embedding tensors First, the Dynamic Feature Separator module processes the features. This module dynamically decouples the features into a time-centered view using a learnable gating network. ) and space center view ( The gating mechanism generates specific weights for each sample based on the global average of the embeddings across spatial (C) and temporal (N) dimensions.

[0096]

[0097]

[0098]

[0099] in It is a gateway to learning in terms of time and space. This represents element-wise multiplication in broadcast mode. It is a linear transformation that projects features onto their respective D / 2D views.

[0100] Subsequently, three independent learnable absolute position codes are introduced to provide the necessary contextual information:

[0101] Time location encoding: Add to Time View :

[0102] Channel location coding: Add to space view :

[0103] Global positional encoding: Compressed representation added to the full feature set: ;

[0104] A DeFuse encoder for learning EEG signal representations is constructed; a composite masking strategy is adopted to mask some patches in the patch tensor, and the original patches at the mask positions are replaced with learnable mask tokens to obtain the masked patch tensor.

[0105] The DeFuse encoder is a novel context-relational computation decoupled-fusion architecture for modeling the complex interactions between the spatial (inter-channel) and temporal (sequence) dimensions of EEG signals. Unlike traditional parallel processing methods, the DeFuse encoder achieves deep integration of spatiotemporal information through a complex process involving parallel processing, view fusion, continuous joint attention, and feature decoupling-fusion.

[0106] The encoder consists of L identically stacked DeFuse encoder layers. Each layer accepts three input streams: a time-centered view (…). ), spatial center view ( ) and global feature flow ( ).

[0107] Initial Parallel Feature Decoupling Enhancement: The temporal and spatial views are first independently enhanced through a novel contextual relation computation module using standard pre-layer normalization (Pre-LN) to capture intra-domain dependencies. Temporal View The handling:

[0108]

[0109] Spatial view Perform the same transformation using independent sets of weights:

[0110]

[0111] View fusion and initial joint top-bottom relationship calculation: spatial features Reshaped to match temporal features And splice them together to form a joint representation This representation is then subjected to a joint attention step along the time dimension:

[0112]

[0113]

[0114]

[0115] View reshaping and second joint context calculation: The key innovation of this architecture lies in the subsequent steps, where the output... Reshaped into a time batch view This crucial operation shifts the focus of the second union context relationship to capturing dependencies between channels, which are now rich in integrated temporal information:

[0116]

[0117]

[0118] Representation of deep fusion The segments are projected back into separate temporal and spatial views for use by the next layer, while the global path is processed in parallel to handle the global feature flow x.

[0119] The DeFuse encoder proposed in this invention, through dynamic feature separation and hierarchical fusion mechanisms, can learn complex spatiotemporal interactions from multiple perspectives while avoiding premature feature mixing. Ablation experiments confirm that removing the joint attention module leads to a 2.0% decrease in balanced accuracy on PhysioNet-MI and a 4.0% decrease on FACED, indicating that hierarchical fusion is necessary for modeling complex cross-domain dependencies.

[0120] Recognizing the structured nature of dependencies in EEG data, this invention employs a composite masking strategy combining three different methods:

[0121] Tube Masking (40%): Inspired by video representation learning, it performs consistent joint masking on all (or a fixed set of) EEG channels in consecutive time steps. This blocks the cross-referencing of spatial information at the same moment, forcing the model to rely entirely on the historical and future context sequences for temporal interpolation, thereby gaining a deep understanding of the global temporal dynamics of the signal.

[0122] Channel Masking (30%): Masks the entire channel for a given sample. This forces the model to reconstruct using the spatial correlation between adjacent channels.

[0123] Random Masking (30%): Randomly mask individual patches to build robust, generalized contextual understanding.

[0124] The composite masking strategy of this invention (pipeline mask, channel mask, and random mask) combined with curriculum learning scheduling forces the model to learn more robust nonlocal dependencies. Experiments show that pipeline masks contribute more to tasks requiring temporal continuity (such as motion imagery), while channel masks are more beneficial for tasks that depend on spatial patterns (such as emotion recognition). The combination of the two produces optimal performance on all tasks.

[0125] Furthermore, this invention implements a scheduling mask ratio. The mask patch ratio starts from a low value and gradually increases during training using cosine scheduling. This course learning strategy helps stabilize the initial training and allows the model to tackle more challenging tasks as it becomes more proficient. For the current round k, the mask ratio... for:

[0126]

[0127] in It is the total number of training rounds. .

[0128] The masked patch tensor is input into the DeFuse encoder to extract deep features and predict the original patch content at the masked position; the error between the predicted patch and the original patch is calculated using a multi-objective loss function that includes temporal reconstruction loss, spectral fidelity loss and spatial covariance loss as the optimization objective, and the network parameters of the DeFuse encoder are trained by backpropagation.

[0129] Simple mean square error (MSE) is insufficient for high-fidelity reconstruction of complex signals such as EEG, as it often leads to overly smooth and perceptually unconvincing results. Therefore, a composite loss function was designed to ensure that the reconstruction is accurate in the time domain, faithful in the frequency domain, and structurally sound in the spatial domain.

[0130] Temporal reconstruction loss ( ):predict( ) and original ( Standard MSE between mask patches ensures waveform-level accuracy.

[0131]

[0132] Spectral fidelity loss ( L1 loss of patch amplitude spectrum forces perceptual similarity in the frequency domain:

[0133]

[0134] Space structure retention loss ( To maintain the integrity of cross-electrode spatial correlation and functional connectivity patterns, this invention penalizes the differences between channel covariance matrices. The loss is measured using the Frobenius norm.

[0135]

[0136] The ultimate training objective is a weighted sum of these three components:

[0137]

[0138] in It is a hyperparameter that balances the contribution of each loss term.

[0139] The composite loss function designed in this invention ensures that the learned representation is faithful to the original signal in terms of temporal accuracy, frequency composition, and spatial connectivity. Ablation experiments show that using only... (Simulating the standard MAE) resulted in a severe performance degradation (PhysioNet-MI: -2.3%, FACED: -4.5%); removal Removed due to performance degradation (PhysioNet-MI: -0.9%, FACED: -1.2%). This resulted in a performance decrease (PhysioNet-MI: -1.1%, FACED: -1.8%).

[0140] After training, any EEG signal sample is input into the trained DeFuse encoder, and the output is a general EEG signal feature representation applicable to downstream tasks.

[0141] This invention demonstrates significantly superior cross-task generalization capabilities compared to existing methods: It consistently achieves state-of-the-art performance across three classes of downstream tasks on six diverse BCI datasets.

[0142] Emotion recognition task:

[0143] FACED dataset (9-class classification): Balanced accuracy of 58.99%, a 3.9% improvement over the suboptimal CBraMod method.

[0144] SEED-V dataset (5-class classification): Balanced accuracy 40.89%, F1-W 41.83%

[0145] Motion imagery classification task:

[0146] PhysioNet-MI dataset (4-class classification): Balanced accuracy of 64.28%, an improvement of 0.11% over the suboptimal method.

[0147] SHU-MI dataset (2-class classification): Balanced accuracy 62.92%, AUROC 66.87%

[0148] Abnormal EEG detection task:

[0149] TUEV dataset (6-class classification): Balanced accuracy 64.00%, nearly 4% improvement over CBraMod.

[0150] TUAB dataset (2-class classification): Balanced accuracy 79.81%, a 4.2% improvement over the suboptimal method; AUROC 87.38%; AUC-PR 87.61%.

[0151] Self-supervised pre-training significantly improves the performance of downstream tasks:

[0152] Ablation experiments show that training from scratch (without pre-training) resulted in a 5.2% decrease in balance accuracy on PhysioNet-MI (64.28%→59.05%) and a 13.0% decrease on FACED (58.99%→46.05%), clearly demonstrating the ability of self-supervised pre-training to learn generalizable representations.

[0153] Example 1: Large-scale pre-training based on the TUEG dataset

[0154] Step 1: Data Preprocessing

[0155] Temple University Hospital EEG Corpus (TUEG) was used as the pre-training corpus. TUEG is currently the largest publicly available EEG dataset, containing 69,652 records from 14,987 subjects, spanning 26,846 sessions, with a total duration of over 27,000 hours.

[0156] The preprocessing steps are as follows: discard records shorter than 5 minutes; remove the first and last minutes of each record to avoid low-quality fragments; select 19 universal EEG channels conforming to the 10-20 international system (Fp1, Fp2, F7, F3, Fz, F4, F8, T3, C3, Cz, C4, T4, T5, P3, Pz, P4, T6, O1, O2).

[0157] The signal is subjected to 0.3-75Hz bandpass filtering and 60Hz notch filtering to eliminate power line noise;

[0158] The signal was resampled to 200Hz and divided into 30-second non-overlapping samples. "Bad" samples with an absolute amplitude exceeding 100μV at any time point were removed. 100μV was set as the unit scale for normalization, and the amplitude values ​​were mainly in the range of [-1, 1].

[0159] After preprocessing, more than 1.1 million EEG samples (more than 9,000 hours of clean EEG data) were retained for pre-training.

[0160] Step 2: Slicing: Each 30-second EEG sample is sliced ​​into slices with a duration of 1 second (200 data points), resulting in 19 × 30 = 570 slices per sample.

[0161] Step 3: Two-domain embedding: Perform the following simultaneously for each slice:

[0162] Temporal branch: Extracting waveform features through 2D convolutional layers;

[0163] Frequency domain branch: After FFT transformation, it is divided into 6 frequency bands. Each frequency band is processed by a dedicated MLP and then fused. The outputs of the two branches are summed element-wise to obtain the final patch embedding.

[0164] Step 4: Dynamic Feature Separation

[0165] Using a gated network, the patch is embedded dynamically decoupled into a time-centered view and a space-centered view, with time, channel, and global position encodings added respectively.

[0166] Step 5: DeFuse encoder processing: Input the three position-aware streams into a 6-layer DeFuse encoder, each layer containing:

[0167] Enhanced parallel temporal / spatial context;

[0168] First joint context relationship (time dimension);

[0169] Second joint context relationship (spatial dimension)

[0170] Feature defusion: Encoder configuration: 4 sets of context relationships, hidden dimension 200, feedforward dimension 800.

[0171] Step 6: Masking and Reconstruction: Apply a composite masking strategy (40% pipeline, 30% channel, 30% random), gradually increasing the masking ratio from 0.3 to 0.75 (cosine scheduling). Train the model using a composite loss function (L_{MSE} + L_{Spectral} + L_{Cov}) to reconstruct the masked patch.

[0172] Step 7: Pre-training: Using the AdamW optimizer, with a learning rate of 5 × 10⁻⁴, weight decay of 5 × 10⁻², cosine annealing learning rate scheduling, batch size of 128, and training for 40 epochs. The experiment was conducted on a server equipped with two NVIDIA RTX 4090 GPUs.

[0173] Example 2: Fine-tuning of downstream tasks in emotion recognition

[0174] Taking the FACED dataset as an example, this dataset records 32 channels of EEG from 123 subjects at 250Hz, covering 9 emotion categories.

[0175] Step 1: Data preprocessing: Divide the signal into 10-second windows; resample to 200Hz; subjects 1-80 / 81-100 / 101-123 are used for training / validation / testing respectively;

[0176] Step 2: Load the pre-trained model: Load the weights of the pre-trained DeFuseMod model from Example 1.

[0177] Step 3: Fine-tuning. Keep the same preprocessing and slicing configuration as pre-training, add a 9-class softmax layer after the classification head, fine-tune using 5 random seeds, and report the mean and standard deviation of the evaluation metrics.

[0178] Experimental results: Balance accuracy: 58.99% ± 1.00%; Cohen's Kappa: 0.5373 ± 0.0121; Weighted F1: 59.78% ± 1.32%;

[0179] Example 3: Fine-tuning of downstream tasks in motor imagery classification;

[0180] Taking the PhysioNet-MI dataset as an example, this dataset contains 64 channels of EEG from 109 subjects performing four motor imagery tasks (left fist, right fist, both fists, and both feet), with a sampling rate of 160Hz.

[0181] Step 1: Data preprocessing, resample the signal to 200Hz, divide it into 4-second windows, and use subjects 1-70 / 71-89 / 90-109 for training / validation / testing respectively;

[0182] Steps 2-3: Same as in Example 2

[0183] Experimental results: Balance accuracy: 64.28% ± 1.74%; Cohen's Kappa: 0.5237 ± 0.0111; Weighted F1: 64.42% ± 1.21%;

[0184] Example 4: Fine-tuning of downstream tasks for abnormal EEG detection

[0185] Taking the TUAB dataset as an example, it is used for binary classification of abnormal / normal EEG signals and contains 16 channels of 250Hz EEG signals.

[0186] Step 1: Data preprocessing, segmented into 10-second samples, resampled to 200Hz;

[0187] Steps 2-3: Same as in Example 2

[0188] Experimental results: Balance accuracy: 79.81% ± 0.45%, AUROC: 0.8738 ± 0.0101, AUC-PR: 0.8761 ± 0.0067.

[0189] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A dual-domain self-supervised EEG signal representation learning method based on decoupled fusion, characterized in that, include: Acquire and preprocess EEG sample data; The preprocessed EEG samples are divided into non-overlapping patches of fixed size along the time axis to obtain a patch sequence. For each patch, both time-domain and frequency-domain features are extracted simultaneously, and the extracted time-domain and frequency-domain features are fused to obtain the fused embedding vector for each patch. Based on the fused embedding vector, the features are dynamically decoupled into a time-centered view and a spatial-centered view through a learnable gating mechanism; Add position encoding to the time-centered view, the spatial-centered view, and the global feature stream respectively to obtain three position-aware feature streams; A course-based structured masking strategy is adopted to mask some patch embedding vectors in the three location-aware feature streams, and replace the patch embedding vectors at the mask positions with learnable mask tokens. The three feature streams after masking are input into the DeFuse encoder; the DeFuse encoder performs initial parallel feature decoupling enhancement, view fusion and first joint context relationship calculation, view reshaping and second joint context relationship calculation on the input feature streams through stacked decoupling fusion layers, outputs a deep representation rich in spatiotemporal context information, and predicts the original patch content at the masked position; Using a multi-objective loss function as the optimization objective, the error between the predicted patch content and the original patch is calculated. The network parameters of the DeFuse encoder are trained through backpropagation, thereby learning a general EEG signal representation.

2. The dual-domain self-supervised EEG signal representation learning method based on decoupling fusion according to claim 1, characterized in that, The temporal feature extraction is performed by extracting waveform features through a 2D convolutional layer. Each EEG block slice is processed through a series of 2D convolutional layers to directly learn local temporal patterns and structural features from the original signal waveform. This process can be summarized as a function. : ; in It is a time-domain embedding vector of dimension D; The frequency domain feature extraction involves applying an FFT transform to each EEG block slice to divide it into 6 frequency bands. Each frequency band is then processed by a dedicated MLP, and the features of all frequency bands are ultimately fused into a single structured spectral representation. This process is formalized as follows: ; ; ; in It is the amplitude spectrum. This is a characteristic of the b-th frequency band. It is the final frequency domain embedding vector; The final patch embedding is obtained by summing the outputs of the time and frequency domain branches element-wise. 。 3. The dual-domain self-supervised EEG signal representation learning method based on decoupling fusion according to claim 1, characterized in that, Based on the fused embedding vector, the features are dynamically decoupled into a temporally centered view and a spatially centered view through a learnable gating mechanism: Embedded Tensors A dynamic feature separator uses a learnable gating network to dynamically decouple features into a time-centered view. ) and space center view ( The gating mechanism generates specific weights for each sample based on the global average of the embeddings across spatial (C) and temporal (N) dimensions. ; ; ; in It is a gateway to learning in terms of time and space. This represents element-wise multiplication in broadcast mode. It is a linear transformation that projects features onto their respective D / 2D views.

4. The dual-domain self-supervised EEG signal representation learning method based on decoupling fusion according to claim 1, characterized in that, Add positional encoding to the time-centered view, spatial-centered view, and global feature stream respectively: Time location encoding: Add to Time View : ; Channel location coding: Add to space view : ; Global positional encoding: Compressed representation added to the full feature set:

5. The dual-domain self-supervised EEG signal representation learning method based on decoupling fusion according to claim 1, characterized in that, The DeFuse encoder consists of L identically stacked DeFuse encoder layers, each accepting three input streams: a time-centered view (…). ), spatial center view ( ) and global feature flow ( ); The initial parallel feature decoupling enhancement: The temporal and spatial views are first independently enhanced by a novel contextual relation computation module using standard pre-layer normalization (Pre-LN) to capture intra-domain dependencies. Temporal view The handling: ; Spatial view Perform the same transformation using independent sets of weights: ; The above-and-below relationship calculation between the view fusion and the first joint is as follows: Spatial features Reshaped to match temporal features And splice them together to form a joint representation This representation is then subjected to a joint attention step along the time dimension: ; ; ; The view reshaping and the second joint context relationship calculation: Output Reshaped into a time batch view The focus of the second union context relationship is shifted to capturing the dependencies between channels, which are now rich in integrated temporal information: ; ; Feature defusion and output: Representation of deep fusion The segments are projected back into separate temporal and spatial views for use by the next layer, while the global path is processed in parallel to handle the global feature flow x.

6. The dual-domain self-supervised EEG signal representation learning method based on decoupling fusion according to claim 1, characterized in that, The course-based structured masking strategy includes: 40% pipeline masking, 30% channel masking, and 30% random masking. The proportion of the mask patch starts from a low value and gradually increases during training according to cosine scheduling. For the current round k, the mask ratio for: ; in It is the total number of training rounds. .

7. The dual-domain self-supervised EEG signal representation learning method based on decoupling fusion according to claim 1, characterized in that, The multi-objective loss function is: Temporal reconstruction loss :predict With the original Standard MSE between mask patches ensures waveform-level accuracy: ; Spectral fidelity loss : L1 loss of patch amplitude spectrum forces perceptual similarity in the frequency domain: ; Space structure retention loss : The loss is calculated by penalizing the difference between the channel covariance matrices and using the Frobenius norm as a measure: ; Final training objective: a weighted sum of the three components. ; in It is a hyperparameter that balances the contribution of each loss term.

8. A dual-domain self-supervised EEG signal representation learning system based on decoupled fusion, comprising: The system includes a data preprocessing module, a dual-domain embedding module, a dynamic decoupling module, a multi-view position encoding module, a course-based structured masking module, a DeFuse encoder module, and a multi-target loss calculation module; each module executes the corresponding steps in the methods described in claims 1-7.