A method and device for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks
By fusing EEG, eye movement, micro-expression, and gait information through a multimodal heterogeneous graph convolutional neural network, the limitations of single-modal detection are addressed, enabling more accurate detection of emotional disorders and enhancing the robustness and adaptability of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2024-12-09
- Publication Date
- 2026-06-30
AI Technical Summary
Existing methods for detecting mood disorders mainly rely on single-modal information, which suffers from high subjectivity, limitations in physiological signals, and insufficient comprehensiveness, making it impossible to fully capture the multidimensional characteristics of emotions.
We employ a multimodal heterogeneous graph convolutional neural network, combining EEG, eye movement, micro-expression, and gait information. Through data alignment, feature extraction, and fusion, we utilize low-rank matrix factorization and the attention mechanism of Transformer to construct an emotion disorder detection model.
It improves the accuracy of mood disorder detection and the generalization ability of the model, can more comprehensively reflect emotional state, reduce the influence of noise, adapt to the imbalance of different modalities, and provide a more accurate clinical diagnostic tool.
Smart Images

Figure CN119949829B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of medical artificial intelligence technology, and in particular relates to a method and device for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks. Background Technology
[0002] Mood disorders refer to exaggerated, confused, or diminished normal emotional responses. They mainly include depression and anxiety disorders. According to the "Chinese Classification and Diagnostic Criteria for Mental Disorders (3rd Edition)," patients with mental disorders exhibit symptoms such as decreased energy or fatigue, poor concentration or distractibility, and reckless behavior.
[0003] Mood disorders can be detected from both neurophysiological and behavioral perspectives. For example, information such as electroencephalography (EEG), pupillary changes, electromyography (EMG), and facial expressions can quantitatively reflect emotional states. Studies have shown that patients with depression exhibit longer fixation times and fewer saccades when viewing negative emotional information; patients with anxiety disorders often show tension and unease in their facial expressions; and patients with mood disorders show significant changes in walking speed, posture, and stride length.
[0004] Traditional methods for detecting mood disorders often rely on information from only one modality, which has some limitations:
[0005] 1. High degree of subjectivity: Participants may not be able to accurately describe their emotional state, or may be influenced by other factors such as individual culture, personality, and context.
[0006] 2. Limitations of physiological signals: Physiological parameters of a single modality are insufficient to fully reflect emotional states. Physiological signals (such as skin conductance, heart rate, and respiratory rate) only indirectly reflect emotions and cannot comprehensively capture complex emotional experiences (individual differences).
[0007] 3. The inadequacy of a single modality in its comprehensiveness: Emotion is multidimensional: Emotion is not merely a physiological reaction or external expression, but also includes multiple dimensions such as cognition and subjective experience. A single modality cannot fully capture these dimensions.
[0008] For example, patent document CN118490227A discloses a method and system for extracting spatiotemporal patterns of brain waves for mood disorder assessment tasks. This includes: brain wave data acquisition and preprocessing, brain wave time slicing and artifact removal, subsequence segmentation and temporal dimensionality reduction, spatial relationship matrix construction, establishment of a temporal optimization module, calculation of state position weights, merging of temporal states, and establishment of an assessment model and extraction of interpretable features. This invention, targeting mood assessment tasks, can obtain a comprehensive and dynamic brain spatial pattern from a relatively small amount of data. Based on this pattern, more accurate and stable assessment results can be obtained, assisting doctors in diagnosing illness and differentiating between depression and bipolar disorder.
[0009] Patent document CN118070127A discloses a method for feature extraction and classification of bipolar disorder based on higher-order functional networks, including the following steps: S1, acquiring resting-state functional magnetic resonance imaging (fMRI) data of subjects, preprocessing the data to obtain the BOLD time series of each subject, and constructing a higher-order functional network; S2, using the weights of the higher-order functional network as a candidate feature set, and performing feature selection to obtain a feature set E1 with the greatest recognition ability for bipolar disorder patients; S3, calculating the community overlap index of each subject based on the higher-order functional network, and using it as a feature set E2; S4, fusing E1 and E2 to obtain the final feature set; S5, using the final feature set E3 to train a support vector machine classification model to complete the identification of bipolar disorder. Summary of the Invention
[0010] The purpose of this invention is to provide a method and apparatus for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks. The method for detecting mood disorders uses multimodal data, including electroencephalogram (EEG), eye movement, micro-expression, and gait information, to identify the types of mood disorders.
[0011] To achieve the first objective of this invention, the following technical solution is provided: a method for detecting mood disorders based on a multimodal heterogeneous graph convolutional neural network, comprising the following steps:
[0012] Acquire user data and corresponding emotional disorder types, including users' electroencephalogram (EEG) information, eye movement information, micro-expressions, and gait information;
[0013] Data alignment processing is performed on user data to obtain an initial data set with consistent data modalities. The initial data corresponding to each type of information is used as nodes, and the logic of inferring emotional disorders based on medical prior knowledge is used as connecting lines to construct the corresponding meta-path for inferring emotional disorders.
[0014] The dataset is composed of the initial data set, the types of mood disorders, and the corresponding meta-paths for inferring mood disorders.
[0015] A corresponding convolutional neural network is constructed based on a multi-network framework. The convolutional neural network includes a data alignment processing module, a heterogeneous feature extraction module, a feature fusion module, and a prediction module.
[0016] The data alignment processing module is used to perform data alignment processing on the input user data to generate an initial data set;
[0017] The heterogeneous feature extraction module includes a multimodal feature extractor, which is used to extract the data features of each group of data in the initial data group to output the corresponding multimodal feature group;
[0018] The feature fusion module performs feature encoding for each modal feature using a low-rank multimodal fusion method to obtain multiple sets of feature vectors in the same dimension, and uses a pre-constructed meta-path for inferring emotional disorders to associate each feature vector in the obtained multiple sets of feature vectors to obtain a dense representation of multimodal features.
[0019] The visualization module visualizes the obtained dense representation of multimodal features to output the film and television relationship of the distribution of various data in the user data;
[0020] A convolutional neural network was trained using a dataset to obtain a classification model for detecting mood disorders.
[0021] Input the user data to be analyzed into the classification model to output the film and television relationship of the distribution of various data.
[0022] This invention aligns multimodal data to ensure they have the same representation. To reduce noise from sensors or other sources, we use low-rank matrix factorization to decompose the feature representation of each modality into a low-dimensional representation. This helps reduce redundant information, improves the model's generalization ability, and by constructing a heterogeneous graph where nodes represent features of different modalities and edges represent the relationships between them, and finally using the Transformer's attention mechanism to adaptively allocate weights between features of different modalities, we can amplify the information of the effective modalities and reduce the impact of data noise. Simultaneously, the attention mechanism also helps handle imbalances between different modalities, ensuring that each modality receives appropriate attention. This comprehensive approach has broad application prospects in the field of mood disorder detection, providing a powerful tool for clinical diagnosis and intervention.
[0023] Specifically, the data alignment process includes cross-modal alignment and group-level alignment, which facilitates the construction of stable models and reduces the risk of overfitting.
[0024] Specifically, cross-modal alignment refers to mapping information from different modalities to a shared feature space and calculating attention scores between modalities to achieve synchronous conversion of multimodal asynchronous time series signals. This improves experimental results and also facilitates model generalization to new data samples.
[0025] Specifically, the group horizontal alignment refers to calculating the unit vector of multimodal latent space features and mapping the features onto a spherical space. By leveraging the feature constraints of the Frobenius L2 norm, the risk of overfitting is reduced, and the stability and generalization of the model are improved.
[0026] Specifically, the multimodal feature extractor includes an STGCN encoder for EEG information, a ResNet encoder for facial expression information, an LSTM encoder for eye movement information, and a ConLSTM encoder for gait information.
[0027] Specifically, the process of low-rank multimodal fusion is as follows:
[0028] For a single modal feature, LSTM is used to compress the corresponding time series information, and the hidden state context vector is extracted from the compressed single modal feature. The hidden state context vector is then encoded to obtain the corresponding feature vector.
[0029] Specifically, an LSTM network is used to process the data for each modality. The role of LSTM is to capture long-term dependencies in the time series, transforming complex time series data into a compact representation, thus preserving the dynamic changes most important for mood disorder detection. LSTM generates a series of hidden states while processing the time series, representing contextual information at each time point. Finally, information from certain key time steps (such as the last frame) or information from all time steps can be aggregated (e.g., by averaging or weighted averaging) to form a single contextual feature representation.
[0030] For each modality's context vector, further processing is performed using specific neural network layers (such as fully connected layers, convolutional layers, or dimensionality reduction modules) to map it to a unified feature space. During encoding, the complexity of the network can be constrained, or specific methods (such as low-rank decomposition) can be used to optimize the feature representation and reduce data redundancy. Multimodal data is optimized and fused using methods such as tensor decomposition or attention mechanisms to ensure that each modality is fully represented in the final feature representation, while avoiding redundancy and noise.
[0031] Specifically, the mood disorder inference metapath includes the depression pathway, bipolar disorder pathway, cognitive impairment pathway, and anxiety disorder pathway.
[0032] To achieve the second objective of this invention, the following technical method is provided: an emotion disorder detection device, implemented through the above-described emotion disorder detection method based on a multimodal heterogeneous graph convolutional neural network.
[0033] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0034] This study utilizes a multimodal heterogeneous graph neural network to effectively fuse information from different perceptual modalities. Simultaneously, low-rank matrix factorization (LMF) decomposes the feature representation of each modality into a low-dimensional representation, which helps reduce redundant information and improve the model's generalization ability. By constructing a heterogeneous graph where nodes represent features of different modalities and edges represent the relationships between them, and employing the Transformer's attention mechanism, weights are adaptively allocated among features of different modalities. This amplifies the information of the effective modalities while reducing the impact of data noise. The attention mechanism also helps address the imbalance between different modalities, ensuring that each modality receives appropriate attention. This comprehensive approach has broad application prospects in the field of mood disorder detection, providing a powerful tool for clinical diagnosis and intervention. Attached Figure Description
[0035] Figure 1 This is a flowchart of the mood disorder detection method provided in this embodiment;
[0036] Figure 2 A flowchart of the data alignment processing module provided in this embodiment;
[0037] Figure 3 This is a flowchart of feature decomposition provided in this embodiment;
[0038] Figure 4 A flowchart for feature encoding provided in this embodiment. Detailed Implementation
[0039] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0040] like Figure 1 As shown, this embodiment provides a method for detecting mood disorders, the steps of which are as follows: acquiring user data and the corresponding mood disorder type, including the user's EEG information, eye movement information, micro-expression information and gait information;
[0041] Data alignment processing is performed on user data to obtain an initial data set with consistent data modalities. The initial data corresponding to each type of information is used as nodes, and the logic of inferring emotional disorders based on medical prior knowledge is used as connecting lines to construct the corresponding meta-path for inferring emotional disorders.
[0042] The dataset is composed of the initial data set, the types of mood disorders, and the corresponding meta-paths for inferring mood disorders.
[0043] A corresponding convolutional neural network is constructed based on a multi-network framework. The convolutional neural network includes a data alignment processing module, a heterogeneous feature extraction module, a feature fusion module, and a prediction module.
[0044] The data alignment processing module is used to perform data alignment processing on the input user data to generate an initial data set;
[0045] The heterogeneous feature extraction module includes a multimodal feature extractor, which is used to extract the data features of each group of data in the initial data group to output the corresponding multimodal feature group;
[0046] The feature fusion module performs feature encoding for each modal feature using a low-rank multimodal fusion method to obtain multiple sets of feature vectors in the same dimension, and uses a pre-constructed meta-path for inferring emotional disorders to associate each feature vector in the obtained multiple sets of feature vectors to obtain a dense representation of multimodal features.
[0047] The visualization module visualizes the obtained dense representation of multimodal features to output the film and television relationship of the distribution of various data in the user data;
[0048] A convolutional neural network was trained using a dataset to obtain a classification model for detecting mood disorders.
[0049] Input the user data to be analyzed into the classification model to output the film and television relationship of the distribution of various data.
[0050] More specifically, such as Figure 2 As shown, a multimodal feature alignment and optimization framework is presented. This framework combines a convolutional neural network (CNN) and a multimodal transformer (MulT) to solve cross-modal and group-level data alignment problems, and is particularly suitable for the efficient integration and optimization of heterogeneous modal data in temporal and feature spaces.
[0051] In this embodiment, the framework uses four modalities of data as input: eye tracking, electroencephalography (EEG), facial expression, and gait. First, a specific convolutional network is used to extract features from the raw signals of each modality. The extracted low-level modal features contain time-series information and spatial distribution characteristics. Then, these features are input into a multimodal Transformer (MulT), which consists of multiple oriented paired cross-modal Transformer modules.
[0052] In the MulT module, each cross-modal Transformer constructs a query, key, and value matrix and utilizes an attention mechanism to learn the correlation between features from two modalities. Specifically, this process iteratively strengthens the low-level feature associations between the target modality and the other source modality, effectively capturing complex intermodal interactions. To this end, MulT employs multi-head self-attention and cross-attention mechanisms in each orientation module, resulting in more comprehensive and accurate information fusion.
[0053] The fused time-series data is mapped into a unified high-dimensional feature space, thereby achieving alignment and optimization of multimodal features. Within this high-dimensional feature space, the framework further introduces a Feature Alignment Loss function. This loss function not only considers the correlation between modalities but also reduces distribution discrepancies between modal features, improving the robustness and generalization ability of the final model.
[0054] After completing the feature alignment of multimodal data, in order to further ensure the differences between different individuals and maintain the feature consistency of the same individuals, this invention introduces a contrastive learning strategy. The aligned multimodal features are further optimized through the contrastive learning mechanism, thereby improving classification performance and model robustness.
[0055] Specifically, contrastive learning constructs positive pairs and negative pairs to ensure that the feature representations of samples within the same group are more closely linked, reducing confusion between samples from different groups. In this embodiment, temporal and spatial features of electroencephalogram (EEG) signals with spatiotemporal specificity are extracted. The specific steps are as follows:
[0056] 1. Feature extraction:
[0057] Spatial convolution is used to combine the amplitude characteristics of different potential spatial components to capture the differences in EEG signals in the spatial domain.
[0058] Temporal convolution is used to extract temporal patterns of EEG signal amplitude changes, thereby obtaining feature representations with strong temporal correlation.
[0059] 2. Comparative learning process:
[0060] In contrastive learning, the model constructs a feature representation. This indicates samples from a mini-batch of training data. For positive samples within the same set of samples... and Maximize their similarity; for negative pairs of samples from different groups and Minimize their similarity.
[0061] By introducing a contrastive loss function, this model ensures that the distribution of feature representations in the latent space is consistent with the target properties.
[0062] 3. Design of the contrast loss function:
[0063] Contrast loss aims to maximize the similarity between positive pairs while minimizing the similarity between negative pairs. Specifically, it takes the following form:
[0064] L =- 1 N ∑ i =1 N log exp (sim( Z i , Z i + ) / τ) ∑ j=1 2N l [j≠1] exp(sim( Z i , Z j ) / τ)
[0065] in:
[0066] N is the number of samples in the mini-batch;
[0067] ) represents a sample and Similarity measures (usually cosine similarity); It is a temperature parameter used to adjust the sensitivity of contrast loss.
[0068] In this embodiment, the total loss for small batches is:
[0069]
[0070] In this way, the model can further optimize the feature distribution in the aligned high-dimensional feature space, ensuring that the feature representation of the same individual is more concentrated, while improving the discriminability between different individuals, thereby providing higher accuracy and generalization ability for multimodal data analysis.
[0071] like Figure 3 As shown, the brain regions and neural mechanisms underlying the functional tasks of mood disorder research form the biological theoretical basis of the entire framework. Figure 3 The study primarily involves several key brain regions and their association with emotional tasks: Amygdala: As the core region for emotion processing, it is directly related to emotional tasks and participates in the generation and regulation of emotional responses. Prefrontal Cortex: Plays an important role in facial expression recognition, gait detection, and attention regulation, responsible for higher-level cognitive and emotional control. Anterior Cingulate Cortex: Closely related to emotional tasks, it helps regulate the intensity and response of emotions. Hippocampus: Directly related to emotional memory and emotional disorders such as depression. Right Temporoparietal Junction: Related to attention tasks, it is mainly responsible for the integration and distribution of multimodal information. Occipital Cortex and Frontal Eye Fields: Primarily related to eye-tracking tasks and visual attention. Basal Ganglia and Cerebellum: Involved in gait detection tasks, providing support for motor characteristics in emotional disorders. Figure 3The study also clearly identifies the functional tasks of each brain region and their corresponding impairments: Facial expression recognition task: dominated by the prefrontal cortex, it is a core component of emotion regulation; abnormalities may lead to depression or anxiety. Eye movement task: primarily controlled by the frontal eye field; abnormalities may be associated with inattention or anxiety. Gait detection: assesses an individual's motor and behavioral characteristics through the collaboration of the motor cortex and cerebellum; these characteristics often show abnormalities in mood disorders. Emotion task: dominated by the amygdala and anterior cingulate cortex, it directly involves emotion generation and regulation. Attention task: supported by the parieto-occipital junction and other cortical regions; abnormalities may be associated with bipolar disorder or attention deficit disorder. The study also highlights the complex interaction networks between brain regions, such as the strong connectivity between the amygdala and the prefrontal cortex, indicating the coupling of emotion processing and cognitive control. The connection between the parieto-occipital junction and the motor cortex reflects the synergistic effect of attention and motor regulation. The interaction between the hippocampus and the amygdala further supports the importance of emotional memory in mood disorders. Abnormal activity or disruption of connections in these brain regions may lead to different manifestations of mood disorders, providing a clear direction for subsequent functional network modeling and data fusion.
[0072] Figure 3 The upper right part of the diagram uses a functional network approach to demonstrate the abnormal features of brain region connectivity patterns in mood disorders (such as depression, bipolar disorder, and anxiety disorder). It provides a modeling approach for understanding the neural mechanisms of different mood disorders and reveals the differences among various disorders in the form of a functionally abnormal network.
[0073] Depression typically manifests as weakened or abnormally enhanced connections between specific brain regions. For example, the connection between the amygdala and the prefrontal cortex may be too weak, leading to a decline in mood regulation. The intensity of node color in the diagram may reflect the level of functional activity; some darker nodes in the depression pathway represent highly active areas, possibly related to excessive emotional burden. Compared to depression, the functional network of bipolar disorder may be more complex, characterized by excessive or unstable connectivity between different brain regions. This abnormal network feature may lead to extreme fluctuations in mood, attention, and behavior. These features may be represented in the diagram by the density or complexity of connections. The anxiety disorder pathway may be characterized by excessive connectivity in regions such as the amygdala and anterior cingulate cortex, leading to an overreaction to threatening signals. Simultaneously, abnormal connectivity in brain regions related to attention regulation may make it more difficult for individuals to shift their attention away from threatening stimuli. The network nodes and connections in these mood disorder pathways form the basis of heterogeneous diagram representations. Each disorder corresponds to a specific functional network, indicating the specificity of dynamic interactions between brain regions in different mood disorders.
[0074] These networks reveal the specific neural mechanisms underlying different mood disorders, providing a theoretical basis for understanding the pathological mechanisms of mood disorders. Analysis of these pathways can differentiate between different types of mood disorders, supporting more precise diagnosis.
[0075] Furthermore, feature extraction and fusion of multimodal data, including EEG, eye movement, facial expressions, and gait modalities, are employed. Through message passing, multi-head mapping, and heterogeneous fusion, these modal data are integrated into a unified high-dimensional feature space, providing input support for functional network modeling. This multimodal fusion method enhances the relevance and expressive power of cross-modal information. Focusing on functional network modeling, the paper specifically demonstrates the construction of functional networks using heterogeneous graphs and the extraction of specific brain region interaction patterns based on these meta-paths. This meta-path analysis provides a data-driven tool for revealing specific characteristics of emotional disorders, while also supporting the prediction and classification of disorder types.
[0076] like Figure 4 The diagram illustrates how features can be extracted from four modalities: electroencephalography (EEG), facial expressions, eye movements, and gait. The characteristics of each modality determine its corresponding neural network architecture and processing strategy. The following is a detailed description and technical principle of each modality:
[0077] Using a Spatio-Temporal Graph Convolution Network (STGCN) to extract EEG features can efficiently capture the interaction between spatial and temporal features in EEG signals.
[0078] 1) Graph Convolution: Using EEG channels as nodes, a graph structure is established through functional connections between brain regions, and graph convolution operations are applied to learn the relationship features between nodes.
[0079] 2) Temporal modeling: Based on graph convolution, temporal convolution is added to extract dynamic features and model the temporal dependence of signals.
[0080] Residual networks (ResNet) are used to extract visual features from facial expression images, including local changes (such as frowning, smiling, etc.) and overall shape, which are highly robust to subtle changes in expression.
[0081] 1) Convolution operation: Extract low-level features (such as edges and textures) and high-level features (such as overall expression patterns) of facial expressions through multi-layer convolution operations.
[0082] 2) Residual Connection: Solves the gradient vanishing problem in deep neural network training, while improving the model's expressive power.
[0083] 3) Pre-trained weights: Weights pre-trained on large-scale datasets (such as ImageNet) may be used for transfer learning to enhance the initial performance of the model.
[0084] Long Short-Term Memory (LSTM) networks are used to capture temporal features of eye-tracking data, such as changes in fixation point position and temporal correlation.
[0085] 1) Input data: Eye-tracking data typically includes fixation point coordinates (x, y) and time information, which are input into the LSTM model as time series data.
[0086] 2) Memory unit: The input gate, forget gate and output gate control the updating of historical state and the output of the current time step, and retain long-term dependency information.
[0087] 3) Output features: Generate a high-dimensional feature representation containing eye-tracking behavior patterns for subsequent fusion.
[0088] A Convolutional Long Short-Term Memory (ConLSTM) network is used to extract the temporal dynamic features and spatial patterns of gait for modeling gait behavior. ConLSTM combines the spatial feature extraction capability of convolutional networks with the temporal modeling capability of LSTM, making it sensitive to dynamic changes in gait data.
[0089] 1) Convolutional module: captures local spatial features of gait videos, such as the direction and amplitude of movement.
[0090] 2) LSTM network: Further time modeling of these local features to generate global dynamic features of gait sequences.
[0091] 3) Data augmentation: Perform augmentation processing such as rotation and scaling on the input gait data to improve the model's generalization ability.
[0092] Each modality utilizes a neural network architecture best suited to its characteristics, maximizing feature extraction efficiency. This ensures that the extracted multimodal features retain high-dimensional information representation, laying the foundation for subsequent fusion and emotional disorder modeling. Through neural networks designed with different architectures, multimodal data can be processed in parallel, improving the overall system efficiency.
[0093] The multimodal low-rank matrix eigenvalue decomposition integrates high-dimensional data from multimodal feature extraction, utilizing the concept of low-rank matrix decomposition to reduce computational complexity while preserving important information interactions between and within modes. The following is a detailed description of its technical aspects:
[0094] To achieve effective fusion of multimodal data and address issues such as feature differences between modalities, redundant information, and excessive dimensionality, low-rank decomposition is used to capture shared information between modalities while preserving key features unique to each modality.
[0095] The high-dimensional feature representation is decomposed into a set of low-rank matrices, which are represented as the product of modality-specific factors and shared factors.
[0096]
[0097] in, It is a mode-specific low-rank factor. It is a modality-sharing factor.
[0098] By constraining the factors through regularization terms, redundant information is eliminated, ensuring that the fusion process focuses on useful features.
[0099] First, the feature matrices extracted from each modality (e.g., EEG, facial expression, eye movement, gait) are decomposed into modality-specific and shared components using low-rank decomposition: Modality-specific factors capture unique features of each modality, such as frequency patterns in EEG and local textures in facial expressions. Modality-shared factors capture common characteristics between modalities, such as cross-modal temporal correlations. The modality-specific and shared factors are reconstructed using linear or nonlinear methods to generate a fused feature representation, which is further enhanced by an activation function (e.g., ReLU). Weights are assigned to each modality to ensure that more important modal features dominate during information fusion. Weight learning is achieved through an additional attention mechanism to automatically adjust the importance of modalities. Alignment operations in the temporal or spatial dimensions address the inconsistency in feature representations across different modalities.
[0100] Multimodal low-rank matrix eigenvalue decomposition is efficient, i.e., it reduces data dimensionality and computational complexity through low-rank representation; robust, i.e., the decomposition of modality-specific factors and shared factors enhances the ability to resist interference from redundant features; and information preservation, i.e., it takes into account the correlation between modes and the independence within modes, and achieves effective information fusion.
[0101] Multimodal low-rank matrix eigenvalue decomposition is one of the core steps of the entire model. Through a precise decomposition and reconstruction mechanism, complex high-dimensional features from multiple modalities are integrated into a compact and information-rich representation.
[0102] For each modality of data, an LSTM network is used for processing. The role of LSTM is to capture long-term dependencies in the time series, transforming complex time series data into a compact representation, thereby preserving the dynamic changes most important for mood disorder detection. LSTM generates a series of hidden states while processing the time series, representing the contextual information at each time point. Finally, information from certain key time steps (such as the last frame) can be extracted, or information from all time steps can be aggregated (such as by averaging or weighted averaging) to form a single contextual feature representation.
[0103] For each modality's context vector, further processing is performed using specific neural network layers (such as fully connected layers, convolutional layers, or dimensionality reduction modules) to map it to a unified feature space. During encoding, the complexity of the network can be constrained, or specific methods (such as low-rank decomposition) can be used to optimize the feature representation and reduce data redundancy. Multimodal data is then optimized and fused using methods such as tensor decomposition or attention mechanisms to ensure that each modality is fully represented in the final feature representation, while avoiding the introduction of redundancy and noise.
[0104] The cross-modal fusion component, by introducing a Transformer architecture and attention mechanism, aims to integrate multimodal data (such as EEG, facial expressions, eye movements, gait, etc.) into a semantically consistent shared representation, while preserving the diversity and correlation of modal features. The fusion process first uses features obtained from the low-rank cross-modal component as input, extracts local dependency features within the modality through convolutional operations (Conv1D), and adds positional information to the time-series data through positional embedding layers, thereby enhancing context awareness.
[0105] The core cross-modal fusion is implemented by Transformer, with the most important mechanism being Crossmodal Attention. In this mechanism, the feature vector of each modality is treated as a query, and the features of the other modalities are treated as keys and values. Attention calculations capture the dependencies between modalities. At the same time, the self-attention mechanism further enhances the fused feature representation, ensuring sufficient information interaction between modalities, and ultimately generating semantically consistent multimodal features.
[0106] To achieve cross-modal feature alignment, this section also employs strategies such as time synchronization, feature scale normalization, and modality mapping. Dynamic Time Warping (DTW) is used to align time-series modalities (such as eye movement and gait), while feature normalization adjusts the data scale of different modalities to be consistent. Simultaneously, the shared Transformer weight design enables different modalities to map and interact within a unified latent space.
[0107] In terms of optimization, the model incorporates multiple loss functions, including mode alignment loss and task-specific classification or regression losses. Mode alignment loss enhances semantic consistency between modes by minimizing the difference in mean and covariance among mode feature distributions. Furthermore, regularization terms and Dropout mechanisms are used to suppress overfitting and improve the model's generalization ability.
[0108] This embodiment also provides a mood disorder detection device, which is implemented through the mood disorder detection method provided in the above embodiment. The specific implementation details are as follows:
[0109] 1) Multimodal input and alignment: EEG, facial expression, eye movement, and gait data are aligned through time axis and feature level to unify into a multimodal input format, ensuring consistency between information temporality and feature alignment.
[0110] 2) Modeling of mood disorders: By mapping functional areas using brain network, linking emotional task pathways, and analyzing the interaction patterns and task responses of brain regions related to mood disorders.
[0111] 3) Low-rank matrix decomposition: Perform low-rank decomposition on the multimodal feature matrix to extract shared features and modality-specific factors, reduce data dimensionality and enhance feature expression.
[0112] 4) Cross-modal fusion: Based on the interaction mechanism of Transformer and convolutional network, modal features are fused to enhance cross-modal information sharing and high-order semantic expression capabilities.
[0113] 5) Diagnosis and prediction: By fusing feature vectors, classification tasks and performance evaluations are performed to generate results for detecting mood disorders and provide model performance index analysis.
[0114] Furthermore, the terms "upper," "lower," "inner," "outer," "front," and "rear" are used for descriptive purposes only and should not be construed as indicating or implying relative importance. Unless otherwise specifically stated, the relative steps, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the invention.
[0115] Of course, the above description is only a specific embodiment of the present invention and is not intended to limit the scope of the present invention. All equivalent changes or modifications made to the structure, features and principles described in the claims of the present invention should be included in the scope of the claims of the present invention.
[0116] Finally, it should be noted that the above-described embodiments are merely specific implementations of the present invention, used to illustrate the technical solutions of the present invention, and not to limit it. The scope of protection of the present invention is not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention, or make equivalent substitutions for some of the technical features; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks, characterized in that, Includes the following steps: Acquire user data and corresponding emotional disorder types, including users' electroencephalogram (EEG) information, eye movement information, micro-expressions, and gait information; Data alignment processing is performed on user data to obtain an initial data set with consistent data modalities. The initial data corresponding to each type of information is used as nodes, and the logic of inferring mood disorders based on medical prior knowledge is used as connecting lines to construct the corresponding mood disorder inference meta-path. The mood disorder inference meta-path includes the depression pathway, the bipolar disorder pathway, and the anxiety disorder pathway. The depression pathway is characterized by a weakened or abnormally enhanced connection between the amygdala and the prefrontal cortex. The bipolar disorder pathway is characterized by excessive or unstable connectivity between different brain regions. The anxiety pathway is characterized by excessive connectivity in the amygdala and anterior cingulate cortex regions. The dataset is composed of the initial data set, the types of mood disorders, and the corresponding meta-paths for inferring mood disorders. A corresponding convolutional neural network is constructed based on a multi-network framework. The convolutional neural network includes a data alignment processing module, a heterogeneous feature extraction module, a feature fusion module, and a visualization module. The data alignment processing module is used to perform data alignment processing on the input user data to generate an initial data set; The heterogeneous feature extraction module includes a multimodal feature extractor, which is used to extract the data features of each group of data in the initial data group to output the corresponding multimodal feature group; The feature fusion module performs feature encoding for each modal feature using a low-rank multimodal fusion method to obtain multiple sets of feature vectors in the same dimension, and uses a pre-constructed meta-path for inferring emotional disorders to associate each feature vector in the obtained multiple sets of feature vectors to obtain a dense representation of multimodal features. The visualization module visualizes the obtained dense representation of multimodal features to output the film and television relationship of the distribution of various data in the user data; A convolutional neural network was trained using a dataset to obtain a classification model for detecting mood disorders. Input the user data to be analyzed into the classification model to output the film and television relationship of the distribution of various data.
2. The method for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks according to claim 1, characterized in that, The data alignment process includes cross-modal alignment and group-level alignment.
3. The method for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks according to claim 2, characterized in that, Cross-modal alignment refers to the synchronous conversion of multimodal asynchronous time series signals by mapping information from different modalities to a shared feature space and calculating attention scores between modalities.
4. The method for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks according to claim 2, characterized in that, The group horizontal alignment refers to calculating the unit vector of multimodal latent space features and mapping the features onto a spherical space, using the Frobenius 2-norm feature constraint.
5. The method for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks according to claim 1, characterized in that, The multimodal feature extractor includes an STGCN encoder for EEG information, a ResNet encoder for facial expression information, an LSTM encoder for eye movement information, and a ConLSTM encoder for gait information.
6. The method for detecting mood disorders based on multimodal heterogeneous graph convolutional neural networks according to claim 1, characterized in that, The specific process of the low-rank multimodal fusion is as follows: For a single modal feature, LSTM is used to compress the corresponding time series information, and the hidden state context vector is extracted from the compressed single modal feature. The hidden state context vector is then encoded to obtain the corresponding feature vector.
7. A mood disorder detection device, characterized in that, The implementation of the mood disorder detection method based on multimodal heterogeneous graph convolutional neural network as described in any one of claims 1 to 6.