A multi-modal emotion recognition method and system
By combining spatial and temporal context encoders with modal attention mechanisms, a multimodal fusion method is developed to address the problem in existing technologies that fail to fully utilize the spatial contextual information of EEG signals and facial expression data, thereby achieving more efficient emotion recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA NORMAL UNIV
- Filing Date
- 2024-07-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing multimodal emotion recognition methods fail to fully utilize the spatial contextual information of EEG signals and facial expression data, resulting in insufficient recognition accuracy and stability. Furthermore, existing methods have limitations in model training efficiency and fusion strategies.
Multimodal features are extracted using spatial context encoders and temporal context encoders, and multimodal fusion is performed by combining modality attention mechanism. Modality features are learned through SwingTransformer and Transformer structure, and modality weights are dynamically adjusted for emotion classification.
It improves the accuracy and stability of emotion recognition, enhances the model's generalization ability, and enables it to adapt to different emotional expression scenarios, providing research and application support in the field of affective computing.
Smart Images

Figure CN119066559B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of artificial intelligence, computer vision, and affective computing, and specifically relates to a multimodal emotion recognition method and system. Background Technology
[0002] Emotion recognition technology has demonstrated its immense potential and value in numerous practical applications. For example, in human-computer interaction, this technology enables machines to better understand human needs and provide intelligent services. In medical diagnostics, emotion recognition greatly aids in the diagnosis and treatment of neurological disorders such as sleep disorders, schizophrenia, and Parkinson's disease, while also monitoring patients' physiological conditions, such as fatigue, drowsiness, depression, and pain. Furthermore, emotion recognition is also significant for research on mental illnesses such as autism, ADHD, and panic disorder.
[0003] In the past, research on emotion recognition has mainly focused on the analysis and application of single data sources (such as facial expressions or speech signals), such as Text Sentiment Analysis (TSA), Speech Emotion Recognition (SER), and Facial Expression Recognition (FER). Although these technologies have been able to identify emotions relatively accurately through machine learning and deep learning, each single-modal recognition method has certain limitations, such as insufficient recognition accuracy, difficulty in balancing accuracy and time efficiency, and lack of interpretability.
[0004] In recent years, researchers in the field of emotion recognition have gradually recognized that emotion expression is a complex and multidimensional process. It does not rely solely on a single perceptual channel but requires the integration of information from multiple perceptual channels to form a more comprehensive and accurate emotional representation. Therefore, multimodal emotion recognition technology has emerged, combining information from multiple perceptual sources to identify and understand human emotional states. Compared to unimodal emotion recognition, multimodal emotion recognition can fully leverage the consistency and complementarity features between different modalities of emotional data.
[0005] Facial expressions are one of the modalities that can accurately represent a variety of emotions. As a non-contact natural feature, the human face is widely used in various scenarios due to its high distinguishability, ease of acquisition, and stability. Electroencephalography (EEG), as a type of physiological signal, can reflect the electrical activity of nerve cells in the cerebral cortex, thus revealing the true emotional state. Many psychophysiological studies have indicated that emotions are related to the concentrated electrical activity of nerve cells in the cerebral cortex. Therefore, combining EEG signals with facial expression data, utilizing the complementary information from these two sensory channels, can overcome the limitations of a single data source in emotion recognition, thereby improving the accuracy and stability of emotion recognition.
[0006] Current patents involving multimodal emotion recognition based on EEG signals and facial expressions have the following problems:
[0007] A multimodal emotion recognition method and system based on confidence fusion, patent application number CN117591967A, fails to fully consider the inherent spatial context information in each modality of signal (EEG signal and facial expression data), neglecting to utilize this information to improve recognition accuracy. The LSTM (Long Short-Term Memory) method, due to its element-wise processing characteristics, limits the model's parallelization capability, resulting in low training efficiency. Furthermore, LSTM has limited ability to capture long-distance dependencies when processing long sequences, further affecting the accuracy of emotion recognition.
[0008] A multimodal emotion recognition method, device, electronic device, and storage medium, patent application number CN115359576A, fails to fully utilize the spatial context information of each modality signal. In the multimodal fusion stage, it employs a fixed weight allocation method, which ignores the differences in the dominance of different emotions across different modalities and fails to capture the potentially nonlinear and complex relationships between modalities. Therefore, this fusion strategy lacks theoretical and methodological rationality.
[0009] A multimodal emotion recognition method and system based on wearable devices, patent application number CN117520826A, fails to effectively utilize the spatial context information of each modality signal. The SVM (Support Vector Machine) method it employs has limitations in processing high-dimensional, complex multimodal data; it is sensitive to data noise and outliers, and is more suitable for small-scale, low-dimensional scenarios with high data quality. Furthermore, the method does not mention a specific multimodal fusion scheme, which to some extent limits its effectiveness in practical applications.
[0010] A multimodal emotion recognition method, system, device, and medium, patent application number CN117409396A, also attempts to combine electroencephalogram (EEG) signals with facial expressions for multimodal emotion recognition, directly concatenating EEG feature vectors and facial feature vectors. However, due to the inherent heterogeneity between facial images and EEG signals, concatenating these two features for emotion classification may result in the loss of semantic information related to emotions in each modality, thus reducing classification performance. Furthermore, this method only supports binary emotion recognition, limiting its application in more in-depth research and practical scenarios within the field of emotion recognition. Summary of the Invention
[0011] The purpose of this invention is to provide a multimodal emotion recognition method and system to solve the problem that current emotion recognition methods mentioned in the background art have the limitation of relying on a single data source in emotion recognition, which prevents computers from effectively recognizing and understanding the user's emotional state.
[0012] To achieve the above objectives, the present invention provides the following technical solution: a multimodal emotion recognition method, specifically comprising the following steps:
[0013] Step 1: Data preprocessing and feature extraction. For emotional EEG, the EEG data is processed using the EEGLAB tool. First, a zero-phase-shift FIR filter is used to bandpass filter the EEG data in the 0.5-50Hz band to eliminate electrical interference in the 50Hz line noise. Spherical interpolation is used to interpolate the positions of the removed defective electrodes, and ICA processing is performed to eliminate non-brain-related artifacts to ensure the accuracy of subsequent analysis. A sampling frequency of 256Hz is used to downsample the EEG data to speed up the calculation. After processing the EEG data, a three-dimensional transformation process of multi-channel EEG will be performed.
[0014] Step 2: Use a spatial context encoder to extract spatial context features from the preprocessed emotional EEG features and facial expression features.
[0015] Step 3: Further extract temporal context features using a temporal context encoder;
[0016] Step 4: Use a multimodal fusion module based on modal attention mechanism to fuse the spatiotemporal feature representations of different modalities, and finally perform multimodal feature emotion classification. Using the same network architecture to process data from different modalities is not only beneficial to the uniformity and efficiency of multimodal model training, but also helps to complement and fuse information from different modalities in subsequent stages, thereby improving the performance of emotion recognition.
[0017] As a preferred technical solution of the present invention, in step one, the three-dimensional transformation process involves converting the signals recorded by multiple EEG electrodes into a topographic map, wherein the power distribution of each electrode constitutes a specific point on the topographic map.
[0018] As a preferred technical solution in this invention, in step two, the spatial context encoder uses SwingTransformer as the backbone architecture to learn the spatial context information of EEG features and facial features, and ensures that high-resolution data information is not lost during the processing, which is crucial for capturing subtle changes in the spatial distribution of data. The spatial context features are extracted from the preprocessed EEG features or facial features to obtain a fused spatial context information feature map.
[0019] As a preferred technical solution in this invention, in step three, the time context encoder is designed based on the Transformer structure to alleviate problems such as face occlusion and abnormal EEG peaks in the time dimension and to capture the dynamic features of EEG signals and facial expressions changing over time.
[0020] As a preferred technical solution in this invention, when exploring multimodal emotion recognition, it was observed that the dominance of different emotions varies across different modalities. Some emotions may be primarily dominated by electroencephalogram (EEG) signals, while others may be more represented by facial expressions. To adapt to the backbone network of this invention and fully utilize the information from these two modalities, a multimodal attention mechanism is designed. This mechanism, based on the calculated attention weights (i.e., in step four, the multimodal fusion module based on the modal attention mechanism performs weighted fusion of the encoded features of EEG and facial expressions), allows for dynamic adjustment of the weights according to the contribution of each modality to a specific emotion. For emotions dominated by EEG, higher weights are given to EEG features; while for emotions dominated by facial expressions, greater weights are given to facial expression features.
[0021] As a preferred technical solution in this invention, the calculation method of the multimodal fusion module based on the modal attention mechanism is as follows:
[0022] First, given the EEG or face encoder output feature X∈Rpre from the backbone network, where pre represents the original dimension of feature X, this feature X is used as the input of this module for further processing to obtain multimodal feature representations. The following composite operation is used to obtain a feature set Xn∈Rpre*n of number n, where n represents the number of modalities. In this task, n is set to 2, indicating that bimodal data is being processed. The composite operation includes applying the dimension-aligned convolution operation AlignCov (aligning high-dimensional features to low-dimensional features), corrected linear units ReLU, and batch normalization (BN), which helps to align the dimensions of feature maps of different modalities and improve the stability of model training convergence.
[0023] Xn=ReLU(B(AlignCov(X))) (1)
[0024] Subsequently, to compute the modal attention tensor, a fully connected operator Fc is introduced. The fully connected layer learns the complex relationships between features of different modalities and outputs a modal attention weight vector. The softmax operator is applied to normalize the weight vector to obtain the attention score for each modality. These scores reflect the importance of different modalities in the current task. The computation process of the modal attention tensor can be expressed as:
[0025]
[0026] Finally, by element-wise multiplying the modal attention tensor with the original feature set Xn (⊙), the modal attention-weighted feature Xout is obtained. This operation weights and combines features from different modalities according to their attention scores, thereby achieving effective fusion of multimodal information. Mathematically, this is expressed as:
[0027] Xout=Sn⊙Xn (4)
[0028] To learn cross-modal associations and perform classification, Xout is input into a Multilayer Perceptron (MLP). The constructed MLP is a three-layer stacked fully connected network, utilizing its non-linear fitting capability to further extract and integrate deep features from multimodal information. Through the layer-by-layer transmission and processing of the MLP, potential associations between different modalities can be captured, and emotion classification tasks can be performed based on these features. Finally, a softmax layer is applied to obtain the normalized output probability of each emotion category. The role of the softmax layer is to transform the raw output score of the MLP into a probability distribution, ensuring that the probability value of each emotion category is between 0 and 1, and that the sum of the probabilities of all categories is strictly equal to 1. The probability distribution form helps in subsequent loss calculation and interpretation of emotion recognition results. Subsequently, the cross-entropy loss function is used as the optimization objective during training. The cross-entropy loss function is an effective tool for measuring the difference between the probability distribution predicted by the model and the actual emotion label distribution. Its calculation formula is based on the probability distribution output by the softmax layer and the actual emotion label distribution, and the difference is quantified by calculating the negative log-likelihood value between the two. During training, the model continuously adjusts its internal parameters to minimize the cross-entropy loss function, so that the predicted probability distribution gradually approaches the actual emotion label distribution, thereby improving the accuracy and performance of emotion recognition.
[0029] This invention also discloses a multimodal emotion recognition system, which is implemented by combining B / S (browser / server) and C / S (client / server) architectures. The system is divided into three core components: front-end, back-end, and database, in order to improve the maintainability and scalability of the system.
[0030] The front-end component, serving as the user interaction portal, uses C++ combined with the QT framework to build the client interface. Simultaneously, it incorporates the Vue.js framework to implement the back-end administrator data management interface, providing rich operation interfaces and intuitive display of emotion recognition results for both user types. This component is responsible for receiving user operation commands and sending task requests to the back-end via API interfaces. Furthermore, the front-end component is also responsible for receiving the task execution results returned by the back-end and displaying them to the user in a visual manner.
[0031] The backend component, as the core of the business logic, uses the Python programming language and leverages the Django REST framework to build a RESTful API, enabling data interaction between the frontend and backend. The Jinja2 template engine is used for page rendering to improve user experience. The backend component receives task requests sent by the frontend, executes corresponding system management or emotion recognition logic, and calls the database component for data storage and retrieval.
[0032] Database components include MySQL, Redis, and SQLite; MySQL, as a relational database management system, is responsible for storing structured data, such as user information and network models; Redis, as an in-memory database, is used to cache hot data and improve system response speed; the database components provide data access interfaces to support data operation requests from backend components; SQLite is used for encrypted storage and management of local data on the client side.
[0033] As a preferred technical solution of the present invention, it also includes
[0034] At the algorithm level, the MATLAB EEGLab and MNE-Python libraries are used to preprocess and extract features from EEG signals, and the Pillow library is used to preprocess facial expression images and extract key information.
[0035] Building and managing deep learning models based on the PyTorch and OpenMMLab deep learning framework;
[0036] For service deployment, Nginx is used as the reverse proxy server and uWSGI server is used as the web server.
[0037] Compared with the prior art, the beneficial effects of the present invention are:
[0038] 1. The method proposed in this invention fully considers the spatial and temporal contextual information of different modalities, namely EEG signals and facial expressions; by introducing a spatiotemporal context encoder, it can capture key information at different time points and spatial locations, thereby gaining a deeper understanding of the dynamic changes and spatial distribution characteristics of emotions; this comprehensive utilization of contextual information is crucial for improving the accuracy of emotion recognition; by simultaneously analyzing the neural dynamics in EEG signals and subtle changes in facial expressions, it can more accurately capture the complexity and multidimensionality of emotions, bringing significant performance improvements to emotion recognition technology;
[0039] 2. This invention proposes an innovative multimodal fusion method that can adaptively adjust the attention given to EEG signals and facial expressions according to specific circumstances. By introducing an attention mechanism or dynamic weight allocation strategy, the system can focus on key information in both modalities in real time to optimize the accuracy and robustness of emotion recognition. This adaptive fusion strategy not only improves the generalization ability of the model but also provides new ideas for the development of emotion recognition technology. In practical applications, it can cope with the diversity of emotional expression in different scenarios, providing strong support for research and application in the field of affective computing.
[0040] 3. This invention proposes a scheme to reduce the heterogeneity between EEG signals and facial expression data, which can reduce the differences between EEG signals and facial expression data and achieve effective fusion of the two modalities. This processing scheme not only improves the accuracy of multimodal emotion recognition, but also enhances the robustness and stability of the system. Attached Figure Description
[0041] Figure 1 This is a roadmap for extracting EEG features and facial expression features in this invention;
[0042] Figure 2 This is a flowchart illustrating the emotion classification method of this invention;
[0043] Figure 3 This is a schematic diagram of the processing flow of the spatial context encoder of the present invention;
[0044] Figure 4 This is a schematic diagram of the processing flow of the time context encoder of the present invention;
[0045] Figure 5 This is a flowchart of the multimodal fusion module based on the modal attention mechanism of the present invention.
[0046] Figure 6 This is a system diagram of the present invention. Detailed Implementation
[0047] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0048] Please see Figures 1 to 6 This invention provides a multi-technical solution: a multimodal emotion recognition method, specifically including the following steps:
[0049] Step 1: Data preprocessing and feature extraction. For emotional EEG, the EEG data is processed using the EEGLAB tool. First, a zero-phase-shift FIR filter is used to bandpass filter the EEG data in the 0.5-50Hz band to eliminate electrical interference in the 50Hz line noise. Spherical interpolation is used to interpolate the positions of the removed defective electrodes, and ICA processing is performed to eliminate non-brain-related artifacts to ensure the accuracy of subsequent analysis. A sampling frequency of 256Hz is used to downsample the EEG data to speed up the calculation. After processing the EEG data, a three-dimensional transformation process of multi-channel EEG will be performed.
[0050] Step 2: Use a spatial context encoder to extract spatial context features from the preprocessed emotional EEG features and facial expression features.
[0051] Step 3: Further extract temporal context features using a temporal context encoder;
[0052] Step 4: Use a multimodal fusion module based on modal attention mechanism to fuse the spatiotemporal feature representations of different modalities, and finally perform multimodal feature emotion classification. Using the same network architecture to process data from different modalities is not only beneficial to the uniformity and efficiency of multimodal model training, but also helps to complement and fuse information from different modalities in subsequent stages, thereby improving the performance of emotion recognition.
[0053] In this embodiment, in step one, the three-dimensional transformation process involves converting signals recorded by multiple EEG electrodes into a topographic map format. The power distribution of each electrode constitutes a specific point on the topographic map. This invention extracts the power distribution rectangle of a specified frequency band. This measure aims to make EEG data and facial expression data consistent in form, thereby enabling training using the same network architecture. Taking a 32-electrode channel as an example, firstly, the time-frequency decomposition tool of EEGLAB is used, combined with Morlet wavelets (with a period set to 1 and a window size of 256), to perform time-frequency analysis on the EEG data to plot event-related spectral perturbations. Perturbations (ERSP) images; for each EEG sample, this process generates one ERSP image from each channel; based on the 32 channel data of a sample, 32 ERSP images were successfully drawn, thus comprehensively demonstrating the power changes of different channels at different frequencies and time points; secondly, in order to accurately analyze the effect of visual stimulation on EEG activity, the moment of visual stimulation was marked as 0 ms on the time axis; considering that the fluctuation of event-related potentials (ERPs) usually occurs within 1000 ms after stimulation, it was decided to extract ERSP image data in 50 ms increments from 0 ms to 1000 ms; through this One operation yielded a total of 20 frames of ERSP image data; each frame has a data size of 32×18, which represents the power distribution of 32 EEG channels in 18 different frequency bands (θ band, from 14Hz to 31Hz) at a specific time. Furthermore, each row (i.e., 32×1 data) in each frame can be plotted as a 32×32 square topographic map of a specific frequency band using spline interpolation technology. These topographic maps can intuitively show the differences in power distribution among various EEG channels at specific frequency bands and times, thus obtaining an EEG data representation that is easy to analyze and adaptable to the network model proposed in this invention, providing a solid foundation for subsequent emotion recognition tasks.
[0054] For dynamic facial expression data, to uniformly process input videos of different lengths, an interpolation cropping operation is used to ensure that the number of frames in each video is fixed at 200, and the sampling interval is synchronized with the EEG signal. This processing method ensures that regardless of the length of the original video, a fixed length and number of video segments can be obtained according to the above scheme, providing standardized input data for subsequent emotion recognition. On this basis, the facial expression data is batch processed, and the Retinaface algorithm is used to detect face regions in each frame. The detected face regions are adjusted to a uniform size of 224×224 pixels to meet the input requirements of the deep learning model.
[0055] Due to visual noise interference within pixels, directly applying the emotion recognition model of this invention to the original video frames may not be the best choice, as this would make it difficult for the model to obtain stable and robust facial expression feature representations. In this invention, a deep convolutional neural network ResNet-18 model is used to extract features from each frame of the image and generate corresponding feature maps. These feature maps not only remove visual noise in the original pixels, but also retain key information related to emotional expression.
[0056] In this embodiment, in step two, the spatial context encoder uses SwingTransformer as the backbone architecture to learn the spatial context information of EEG features and facial features, and ensures that high-resolution data information is not lost during the processing, which is crucial for capturing subtle changes in the spatial distribution of data. The spatial context features are extracted from the preprocessed EEG features or facial features to obtain a fused spatial context information feature map.
[0057] Considering the low resolution of EEG feature maps, the input EEG feature maps are divided into non-overlapping patches, each patch being 2×2 pixels in size; the input face feature maps are divided into non-overlapping patches of size 4×4 pixels. Linear embedding is used to convert the patches of the EEG and face feature maps into feature vectors. Specifically, a convolution operation Conv2d(18, 18, kernel_size = (2, 2), stride = (2, 2)) is performed on the EEG feature maps to obtain EEG feature vectors, and a reshape operation is performed to obtain a feature map with a window size of 4×4 and a shape of (16, 4, 4, 18); a convolution operation Conv2d(3, 96, kernel_size = (4, 4), stride = (4, 4)) is performed on the face feature maps to obtain face feature vectors, and a reshape operation is performed to obtain a window size of 7×7 and a shape of (64). The feature map (7, 7, 96) is then input into the SwingTransformerBlock of Stage 1 for window attention calculation. First, W-MSA (Windows-based Multi-head Self-Attention) calculates the self-attention score of each window in the feature map, and then SW-MSA (ShiftedWindows-based Multi-head Self-Attention) calculates the attention score within the sliding window after the feature map window is offset, in order to capture information between adjacent windows. Then, through the PatchMerging module, the output feature map of the previous Stage is downsampled and the number of channels is expanded (the size is reduced to 1 / 2 of the original, while the number of channels is increased to 2 times the original) to be input into the next Stage. Through N Stages, the contextual information of the feature map is gradually extracted, thereby obtaining EEG and facial feature maps that fuse spatial contextual information.
[0058] In this embodiment, in step three, the temporal context encoder is designed based on the Transformer structure to alleviate problems such as face occlusion and abnormal EEG peaks in the time dimension and to capture the dynamic features of EEG signals and facial expressions changing over time. In a set of feature map sequences, if the positions of these features are swapped, the overall meaning and contextual relationship of the sequence may change significantly, causing the model to be unable to correctly understand and process these sequences. In order to add positional information of the feature sequences, relative position encoding is introduced on the basis of the self-attention mechanism.
[0059] Next, layer normalization (LN) is performed. Pre-LN (before LN is applied to the activation function) training is more stable, while post-LN (after LN is applied to the activation function) may face the risk of divergence in the early stage of training, and strategies such as learning rate warm start are needed to ensure the smooth progress of training.
[0060] To prevent multiple heads from learning the same attention, regularization methods are added to the multi-head attention mechanism to diversify the attention of multiple heads in the same layer; on the other hand, dropout technology is used to randomly suppress the output of some heads, which helps to prevent the formation of overly similar attention patterns between heads.
[0061] The Position-wise FFN (Feed-Forward Network) consists of two fully connected layers that operate on the last dimension of the input sequence. Because the state at each position in the sequence is updated individually, it is called position-wise. FFN accepts the output of self-attention as input, further extracting and transforming features to enhance the model's expressive power. Regarding the specific implementation and mathematical details of the position-wise FFN, it is usually expressed as FFN(x) = max(0, xW1+b1)W2+b2, where W1 and W2 are weight matrices, and b1 and b2 are bias terms; here, max(0, x) represents the ReLU activation function, used to introduce non-linearity.
[0062] By extracting temporal contextual features from EEG feature maps or facial feature maps that integrate spatial contextual information, a feature map integrating spatiotemporal contextual information can be obtained.
[0063] In this embodiment, when exploring multimodal emotion recognition, the present invention observed that the dominance of different emotions varies across different modalities. Some emotions may be primarily dominated by electroencephalogram (EEG) signals, while others may be more represented by facial expressions. To adapt to the backbone network of the present invention and fully utilize the information from these two modalities, a multimodal attention mechanism is designed. This mechanism, based on the calculated attention weights (i.e., in step four, the multimodal fusion module based on the modal attention mechanism performs weighted fusion of the encoded features of EEG and facial expressions), allows for dynamic adjustment of the weights according to the contribution of each modality to a specific emotion. For emotions dominated by EEG, higher weights are given to EEG features; while for emotions dominated by facial expressions, greater weights are given to facial expression features.
[0064] In this embodiment, the calculation method of the multimodal fusion module based on the modal attention mechanism is as follows:
[0065] First, given the EEG or face encoder output feature X∈Rpre from the backbone network, where pre represents the original dimension of feature X, this feature X is used as the input of this module for further processing to obtain multimodal feature representations. The following composite operation is used to obtain a feature set Xn∈Rpre*n of number n, where n represents the number of modalities. In this task, n is set to 2, indicating that bimodal data is being processed. The composite operation includes applying the dimension-aligned convolution operation AlignCov (aligning high-dimensional features to low-dimensional features), corrected linear units ReLU, and batch normalization (BN), which helps to align the dimensions of feature maps of different modalities and improve the stability of model training convergence.
[0066] Xn=ReLU(B(AlignCov(X))) (1)
[0067] Subsequently, to compute the modal attention tensor, a fully connected operator Fc is introduced. The fully connected layer learns the complex relationships between features of different modalities and outputs a modal attention weight vector. The softmax operator is applied to normalize the weight vector to obtain the attention score for each modality. These scores reflect the importance of different modalities in the current task. The computation process of the modal attention tensor can be expressed as:
[0068]
[0069] Finally, by element-wise multiplying the modal attention tensor with the original feature set Xn (⊙), the modal attention-weighted feature Xout is obtained. This operation weights and combines features from different modalities according to their attention scores, thereby achieving effective fusion of multimodal information. Mathematically, this is expressed as:
[0070] Xout=Sn⊙Xn (4)
[0071] To learn cross-modal associations and perform classification, Xout is input into a Multilayer Perceptron (MLP). The constructed MLP is a three-layer stacked fully connected network, utilizing its non-linear fitting capability to further extract and integrate deep features from multimodal information. Through the layer-by-layer transmission and processing of the MLP, potential associations between different modalities can be captured, and emotion classification tasks can be performed based on these features. Finally, a softmax layer is applied to obtain the normalized output probability of each emotion category. The role of the softmax layer is to transform the raw output score of the MLP into a probability distribution, ensuring that the probability value of each emotion category is between 0 and 1, and that the sum of the probabilities of all categories is strictly equal to 1. The probability distribution form helps in subsequent loss calculation and interpretation of emotion recognition results. Subsequently, the cross-entropy loss function is used as the optimization objective during training. The cross-entropy loss function is an effective tool for measuring the difference between the probability distribution predicted by the model and the actual emotion label distribution. Its calculation formula is based on the probability distribution output by the softmax layer and the actual emotion label distribution, and the difference is quantified by calculating the negative log-likelihood value between the two. During training, the model continuously adjusts its internal parameters to minimize the cross-entropy loss function, so that the predicted probability distribution gradually approaches the actual emotion label distribution, thereby improving the accuracy and performance of emotion recognition.
[0072] This invention also discloses a multimodal emotion recognition system, which is implemented by combining B / S (browser / server) and C / S (client / server) architectures. The system is divided into three core components: front-end, back-end, and database, in order to improve the maintainability and scalability of the system.
[0073] The front-end component, serving as the user interaction portal, uses C++ combined with the QT framework to build the client interface. Simultaneously, it incorporates the Vue.js framework to implement the back-end administrator data management interface, providing rich operation interfaces and intuitive display of emotion recognition results for both user types. This component is responsible for receiving user operation commands and sending task requests to the back-end via API interfaces. Furthermore, the front-end component is also responsible for receiving the task execution results returned by the back-end and displaying them to the user in a visual manner.
[0074] The backend component, as the core of the business logic, uses the Python programming language and leverages the Django REST framework to build a RESTful API, enabling data interaction between the frontend and backend. The Jinja2 template engine is used for page rendering to improve user experience. The backend component receives task requests sent by the frontend, executes corresponding system management or emotion recognition logic, and calls the database component for data storage and retrieval.
[0075] Database components include MySQL, Redis, and SQLite; MySQL, as a relational database management system, is responsible for storing structured data, such as user information and network models; Redis, as an in-memory database, is used to cache hot data and improve system response speed; the database components provide data access interfaces to support data operation requests from backend components; SQLite is used for encrypted storage and management of local data on the client side.
[0076] This embodiment also includes
[0077] At the algorithm level, the MATLAB EEGLab and MNE-Python libraries are used to preprocess and extract features from EEG signals, and the Pillow library is used to preprocess facial expression images and extract key information.
[0078] Building and managing deep learning models based on the PyTorch and OpenMMLab deep learning framework;
[0079] For service deployment, Nginx is used as the reverse proxy server and uWSGI server is used as the web server.
[0080] Although embodiments of the invention have been shown and described (see the detailed description above), it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A multi-modal emotion recognition method, characterized in that: Specifically, the steps include the following: Step 1: Data preprocessing and feature extraction. For emotional EEG, the EEG data is processed using the EEGLAB tool. First, a zero-phase-shift FIR filter is used to bandpass filter the EEG data in the 0.5-50Hz band to eliminate electrical interference in the 50Hz line noise. Spherical interpolation is used to interpolate the positions of the removed defective electrodes, and ICA processing is performed to eliminate non-brain-related artifacts to ensure the accuracy of subsequent analysis. A sampling frequency of 256Hz is used to downsample the EEG data to speed up the calculation. After processing the EEG data, a three-dimensional transformation process of multi-channel EEG will be performed. Step 2: Use a spatial context encoder to extract spatial context features from the preprocessed emotional EEG features and facial expression features. Step 3: Further extract temporal context features using a temporal context encoder; Step 4: A multimodal fusion module based on modal attention mechanism is used to fuse the spatiotemporal feature representations of different modalities, and finally, multimodal feature emotion classification is performed. In Step 4, the multimodal fusion module based on modal attention mechanism performs attention-weighted fusion of EEG and facial expression encoding features, allowing the weights to be dynamically adjusted according to the contribution of each modality to a specific emotion. For emotions dominated by EEG, EEG features will be given higher weights; while for emotions dominated by facial expressions, facial expression features will be given greater weights. The calculation method of the multimodal fusion module based on the modal attention mechanism is as follows: First, given the output of the backbone network or the EEG or face encoder output feature X e Rpre, pre represents the original dimension of the feature X; this feature X is input to the module for further processing to obtain a multi-modal feature representation; the following composite operation is used to obtain a feature group Xn e Rpre*n, n represents the number of modalities; in the task, n is set to 2, indicating that the processing is for dual modal data; the composite operation includes applying a dimension alignment convolution operation AlignCov, a correction linear unit and batch normalization , which helps to align the dimensions of different modal feature maps and the stability of model training convergence; Xn=ReLU(B(AlignCov(X))) (1) Subsequently, to compute the modal attention tensor, a fully connected operator Fc is introduced; the softmax operator is applied to normalize the weight vector to obtain the attention score for each modality; these scores reflect the importance of different modalities in the current task; the computation process of the modal attention tensor can be expressed as: Sn = softmax(Fc(Xn)), Sn e R pre*n (2) Finally, the modal attention weighted features are obtained by element-wise multiplication of the modal attention tensor and the original feature group ; the mathematical expression is as follows: (4) In order to learn cross-modal correlation information and perform classification, The input is a Multilayer Perceptron (MLP); the constructed MLP is a three-layer stacked fully connected network, which utilizes its nonlinear fitting capability to further extract and integrate deep features from multimodal information; through the layer-by-layer transmission and processing of the MLP, the potential correlation between different modalities can be captured, and emotion classification tasks can be performed based on these features; finally, it is processed by a softmax layer to obtain the normalized output probability of each emotion category; the role of the softmax layer is to convert the original output score of the MLP into a probability distribution, ensuring that the probability value of each emotion category is between 0 and 1, and that the sum of the probabilities of all categories is strictly equal to 1; Subsequently, the cross-entropy loss function is used as the optimization objective during the training process. Its calculation formula is based on the probability distribution output by the softmax layer and the true emotion label distribution. The difference is quantified by calculating the negative log-likelihood between the two. During the training process, the model continuously adjusts its internal parameters to minimize the cross-entropy loss function, so that the predicted probability distribution gradually approaches the true emotion label distribution, thereby improving the accuracy and performance of emotion recognition.
2. The multimodal emotion recognition method according to claim 1, characterized in that: In step one, the three-dimensional transformation process of EEG signal data involves converting the signals recorded by multiple EEG electrodes into an electrode power distribution map, where the power of each electrode constitutes a pixel at a specific location on the feature map.
3. The multimodal emotion recognition method according to claim 1, characterized in that: In step two, the spatial context encoder uses SwingTransformer as its backbone architecture to learn the spatial context information of EEG features and facial features. It ensures that high-resolution data information is not lost during the processing and extracts spatial context features from the preprocessed EEG features or facial features to obtain a fused spatial context information feature map.
4. The multimodal emotion recognition method according to claim 1, characterized in that: In step three, the temporal context encoder is designed based on the Transformer structure to alleviate the problems of face occlusion and abnormal EEG peaks in the time dimension and to capture the dynamic features of EEG signals and facial expressions changing over time.
5. A multimodal emotion recognition system, comprising the multimodal emotion recognition method according to any one of claims 1 to 4, characterized in that: This system combines B / S and C / S architectures and is divided into three core components: front-end, back-end, and database. The front-end component, serving as the user interaction portal, uses C++ combined with the QT framework to build the client interface. Simultaneously, it incorporates the Vue.js framework to implement the back-end administrator data management interface, providing rich operation interfaces and intuitive display of emotion recognition results for both user types. This component is responsible for receiving user operation commands and sending task requests to the back-end via API interfaces. Furthermore, the front-end component is also responsible for receiving the task execution results returned by the back-end and displaying them to the user in a visual manner. The backend component, as the core of the business logic, uses the Python programming language and leverages the Django REST framework to build a RESTful API, enabling data interaction between the frontend and backend. The Jinja2 template engine is used for page rendering to improve user experience. The backend component receives task requests sent by the frontend, executes corresponding system management or emotion recognition logic, and calls the database component for data storage and retrieval. Database components include MySQL, Redis, and SQLite; MySQL, as a relational database management system, is responsible for storing structured data; Redis, as an in-memory database, is used to cache hot data and improve system response speed; database components provide data access interfaces to support data operation requests from backend components; SQLite is used for encrypted storage and management of local data on the client side.
6. A multimodal emotion recognition system according to claim 5, characterized in that: Also includes At the algorithm level, the MATLAB EEGLab and MNE-Python libraries are used to preprocess and extract features from EEG signals, and the Pillow library is used to preprocess facial expression images and extract key information. Building and managing deep learning models based on the PyTorch and OpenMMLab deep learning framework; For service deployment, Nginx is used as the reverse proxy server and uWSGI server is used as the web server.
Citation Information
Patent Citations
Multi-modal emotion recognition method and device, electronic equipment and storage medium
CN115359576A
Multi-modal emotion recognition method, system and device and medium
CN117409396A
Multi-modal emotion recognition method and system based on wearable device
CN117520826A
Multi-modal emotion recognition method and system based on confidence fusion
CN117591967A