A multi-modal emotion recognition method based on feature decoupling and missing modal completion
By employing feature decoupling and missing modality completion methods, the problem of missing modality data in multimodal emotion recognition is solved, achieving efficient emotion recognition even with missing modalities and improving recognition accuracy and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGXI NORMAL UNIV
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-19
Smart Images

Figure CN122241473A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to computer intelligent recognition technology, and more particularly to data missing completion and multimodal emotion recognition technology, specifically a multimodal emotion recognition method based on feature decoupling and missing modality completion. Background Technology
[0002] With the development of artificial intelligence and affective computing technologies, multimodal emotion recognition (MER) has gradually become an important research direction in fields such as human-computer interaction, intelligent education, and psychological state analysis. MER integrates data from different information sources, such as visual signals, speech signals, physiological signals, and text information, to characterize human emotional states from multiple perspectives, thereby obtaining more accurate and stable emotion recognition results than single-modal approaches.
[0003] In recent years, with the rapid development of online education and distance learning models, learning through video has become an important learning method. During video learning, the learner's emotional state (such as interest, confusion, boredom, or pleasure) directly affects learning outcomes. Therefore, automatically identifying learners' emotional states while watching instructional videos is crucial for building intelligent learning systems with emotional perception capabilities. By analyzing learners' emotional changes in real time, feedback information can be provided to intelligent teaching systems, thereby enabling personalized teaching and optimization of the learning process.
[0004] In video learning scenarios, learners' physiological signals (such as eye movement signals and photoplethysmography signals) are widely used in emotion recognition research due to their objectivity and resistance to spoofing. Meanwhile, the semantic information contained in the instructional video itself provides contextual background, and combining it with learners' physiological signals can more comprehensively reflect learners' emotional changes. Therefore, fusing video semantic information with learners' physiological signals has become an important direction in current research on emotion recognition in video learning.
[0005] However, in practical applications, multimodal emotion recognition systems often face the problem of missing modal data. For example, due to sensor malfunctions, unstable device wear, or interference during data acquisition, learners' physiological signal data (such as eye movement signals or photoplethysmography signals) may be partially missing. When some modal data is missing, traditional multimodal emotion recognition models often struggle to maintain stable recognition performance, thus affecting the system's reliability in real-world application scenarios.
[0006] To address the problem of missing modalities, some studies have attempted to recover them through modality completion or generation methods, such as using autoencoders or generative models to predict missing modal features. However, these methods typically reconstruct the original features directly, lacking in-depth modeling of the information structure of different modalities, which can easily lead to significant discrepancies between the completed features and the true features. Furthermore, in multimodal data, different modalities often contain both shared and modality-specific information. If this information is not effectively distinguished, it can easily lead to information redundancy or feature confusion, thereby affecting the performance of subsequent modality completion and sentiment recognition.
[0007] On the other hand, in the process of multimodal emotion recognition, there are usually certain semantic differences between features of different modalities. If the consistency information between modalities is not effectively constrained, it is difficult to map different modal features to a unified semantic space, which may lead to the model being unable to fully utilize the information of each modality during the fusion process. In addition, in the case of modality missing, how to reasonably fill in the missing modal information while ensuring the expressive power of modal features is also an important technical problem that needs to be solved in the current field of multimodal emotion recognition.
[0008] Therefore, in video learning scenarios, how to effectively extract shared and modality-specific information between different modalities when modal data is missing, and how to restore the missing modal information through a reasonable feature completion mechanism, so as to achieve accurate identification of learners' emotional states, has become a key problem that urgently needs to be solved in multimodal emotion recognition technology. Summary of the Invention
[0009] The purpose of this invention is to address the shortcomings of existing technologies by providing a multimodal emotion recognition method based on feature decoupling and missing modality completion. This method can improve the accuracy and robustness of emotion recognition even under conditions of missing data.
[0010] The technical solution to achieve the objective of this invention is: A multimodal emotion recognition method based on feature decoupling and missing modality completion includes the following steps: 1) Data Acquisition: Simultaneously collect eye-tracking and PPG signals generated by learners while watching instructional videos, and extract semantic descriptions of instructional video segments, including: 1-1) Eye movement signal acquisition: A flat-panel eye tracker was used to collect data on fixation, saccades, pupil size, and eye movement trajectories generated by the learner during the learning process; 1-2) PPG signal acquisition: Wearable ear clip sensors are used to collect PPG signals generated by learners during the learning process; 1-3) Video semantic extraction: The method of generating video text descriptions is used to capture rich semantic information of the video. This text description includes detailed text information, describing the scenes, objects, actions and plots in the video, as well as expressing the emotions and emotional background of the video content. The automatic generation of video semantic descriptions is achieved by fine-tuning the mPLUG-owl model released by Alibaba DAMO Academy. 2) Data Preprocessing: Preprocessing is performed on the acquired eye-tracking data, PPG data, and video semantic information, including: 2-1) Eye-tracking data processing: Missing values in the eye-tracking data are filled in using a linear interpolation method. The formula for the linear interpolation method is defined as follows: , in The time points of the frames adjacent to the missing value. for The corresponding eye movement data values, where x is the time point corresponding to the missing eye movement data and y is the missing eye movement data, are then baseline corrected for these filled eye movement data. 2-2) PPG signal processing: First, the PPG signal is denoised using a filter, and then the PPG signal is baseline corrected. 2-3) Video semantic information processing: The semantic description text of the teaching video clips is processed. First, the text is cleaned, including removing special symbols, punctuation and extra spaces. Then, stop words and low-frequency words are removed. Stop words refer to words that appear frequently in the text but lack actual semantic meaning, while low-frequency words refer to words that appear less frequently. 3) Feature extraction: Feature extraction is performed on the preprocessed data, including: 3-1) Standard Feature Extraction of Eye Movement Signals: For eye movement data, temporal features such as pupil diameter, fixation time, and saccade duration are extracted. Then, the Pearson correlation coefficient between these features and emotional state is calculated. The p-value is used to determine the significance of the correlation coefficient, i.e., to determine the association between each feature and emotional state. When the p-value is less than 0.05, the correlation coefficient is considered significant, indicating a significant linear relationship between the two groups of samples. When the p-value is less than 0.01, the correlation coefficient is considered highly significant. Finally, 24 eye movement features that are significantly correlated with emotional state (i.e., p-values less than 0.05) are selected: fixation count, saccade count, fixation speed, fixation time, left pupil diameter, right pupil diameter, maximum average pupil diameter, fixation speed, saccade speed, left pupil diameter, right pupil diameter, minimum average pupil diameter, fixation speed, saccade speed, fixation time, left pupil diameter, right pupil diameter, average average pupil diameter, standard deviation, and variance of left pupil diameter, right pupil diameter, and average pupil diameter. The formula for the Pearson correlation coefficient is defined as follows: , in , It is the number of data points. and These are the first two variables. The value of each data point and These are the means of the two variables, respectively, based on the Pearson correlation coefficient. Calculate the p-value and the statistical test. When the sample size is Then: , According to degrees of freedom Find the p-value corresponding to the t-statistic in the t-distribution table; 3-2) PPG signal standard feature extraction: Refined feature extraction is performed on PPG data, including the extraction of time-domain features, frequency-domain features, and nonlinear features. The Pearson correlation coefficient between PPG features and sentiment features is calculated, and 29 PPG features that are significantly related to sentiment status are selected based on the significance of the p-value: HR, IBI, maximum, minimum, mean, median of R peak, and standard deviation of HR, SDSD, NNI_20, PNNI_20, RMSSD, range_NNI, CVSD, CVNNI, lf, hf, vlf, lf_hf_ratio, lfnu, hfnu, total_power, triangular_index, and sd1. 3-3) Video semantic feature extraction: Meaningful features are extracted from the processed semantic description text to better understand and express the semantic information of the video content. To achieve this goal, a pre-trained BERT model is used to perform deep semantic analysis on the cleaned and processed video semantic description text to obtain semantic feature vectors containing important information describing the video content. Subsequently, the PCA dimensionality reduction algorithm is used to reduce the dimension of the features and further remove redundant information, finally obtaining a semantic feature vector with 25 dimensions. 4) Feature decoupling: The complete eye-tracking features, PPG features, and video semantic features are extracted spatiotemporally using a dual-branch encoder consisting of two parallel CNN layers and two parallel LSTM layers, respectively, to extract the consistency features for each modality. and unique characteristics ; 4-1) Contrastive learning: Extracted modality consistency features Projected onto the same subspace, positive and negative pairs are obtained through "sample-level pairing" for contrastive learning. Positive pairs consist of two different features from the same sample, while negative pairs consist of any two features from different samples. The contrastive learning loss function is: , in X is the anchor point. For the correct answer, For negative pairs, N is the sample size. Cosine similarity between X and Y: , Reverse the order of X and Y, i.e., use Y as the anchor point and calculate again. The final loss function for contrastive learning is: ; 4-2) Orthogonal constraint: Modal consistency features extracted from the encoder of each modality. and unique characteristics To determine orthogonality, the calculation formula is: , in Denotes the Frobenius norm of a matrix. The total orthogonal loss function for all modes is: ; 4-3) Reconfiguration constraints: Consistency characteristics after decoupling each mode. and unique characteristics Features are obtained by splicing. Then the data is input into the decoder to obtain the reconstructed features. Finally, the loss is constructed by combining the original features with the calculated loss formula: , in The root mean square error is used to calculate the total reconstruction loss. ; 4-4) Classification constraint: Concatenate the consistency features and unique features of all modalities to obtain the fused features. The data is then fed into a classifier for classification, and the calculation formula is as follows: , in A classifier consisting of fully connected layers and the Softmax activation function. Indicates true emotional tags, The cross-entropy loss function; 4-5) Joint Training: The total loss of joint training during the feature decoupling stage. for: , in, For hyperparameters; 5) Missing Data Completion and Sentiment Recognition: A dual-branch encoder with the same structure as the feature decoupling stage is used. The encoder in this stage is initialized with network parameters pre-trained from the dual-branch encoder in the feature decoupling stage. Missing data is used for both training and testing. Three types of missing data are considered: The missing features are: video semantic features and PPG features are present, but eye-tracking features are missing. The missing features are: video semantic features and eye-tracking features are present, but PPG features are missing. The missing features are as follows: video semantic features are present, but eye-tracking features and PPG features are missing. Missing features are uniformly filled with zeros. The missing data is input into a dual-branch encoder to extract the missing modality consistency features and unique features, and then concatenate them to obtain the missing features. and Simultaneously, complete modal consistency features are extracted using a pre-trained encoder with complete data input. and unique characteristics spliced together and The mirror-image missing features are input into a pre-trained encoder to extract modality consistency features and modality-specific features of the mirror-image missing features, which are then concatenated to obtain... and ; 5-1) Modal completion and classification: This involves extracting missing data... and The input consists of an imaginary module composed of five autoencoders (AEs) connected in series with residuals to obtain the completed features. and spliced together ,Will enter The classifier performs sentiment classification and calculates the classification loss: , , in It consists of two fully connected layers and a SoftMax function. For the predicted sentiment category, For classification loss; 5-2) Joint Optimization: Design two loss functions to optimize the model, including: , , The total loss function for this stage is: , in, and For hyperparameters, For missing loss Loss due to imagination; 6) Testing and Evaluation: Conduct comprehensive testing and evaluation of the multimodal emotion recognition system, validate the model, and use weighted accuracy (WA) and unweighted accuracy (UA) to measure the model's performance.
[0011] Building upon current technology, this technical solution aims to more effectively and accurately identify the cognitive and emotional states of online learners under modality-deficient conditions. It combines feature decoupling and missing modality completion networks. While learners watch instructional videos, eye trackers and clip-on sensors collect their eye movements and photoplethysmography (PPG) signals, respectively. Feature extraction is then performed to obtain physiological signal features. Furthermore, semantic information is extracted from instructional video clips to obtain video semantic features. Under full-modality data conditions, a feature decoupling strategy is used to train a dual-branch encoder to extract modality consistency and unique features. Then, under modality-deficient data conditions, a pre-trained dual-branch encoder, combined with missing loss and imagination loss, is used to train a missing modality imagination network. This allows the missing modality network to identify four emotional states—interest, confusion, boredom, and happiness—under data-deficient conditions.
[0012] This technical solution has the following advantages: 1. Improve emotion recognition performance under modality missing conditions: This technical solution designs a modality feature decoupling and modality imagination mechanism, which can effectively supplement missing modality information when some modality data is missing, thereby reducing the impact of modality missing on emotion recognition performance and improving the robustness of the model in practical application scenarios; 2. Effective separation of modal shared information and modality-specific information: This technical solution introduces a feature decoupling structure to decompose different modal features into shared features and modality-specific features, and reduces redundancy and interference between features through a constraint mechanism, enabling the model to more fully mine complementary information in multimodal data and improve feature representation ability; 3. Enhance the consistent representation capability among multimodal features: This technical solution applies consistency constraints to shared features from different modalities, enabling information from different modalities to be mapped to a unified semantic space, thereby improving the effect of multimodal information fusion and enhancing the accuracy of emotion recognition; 4. Enhance the model's ability to represent complex emotional information: This technical solution integrates learner physiological signals and video semantic information, enabling the model to simultaneously utilize learner physiological responses and video content contextual information to characterize learner emotional states from multiple perspectives and improve the reliability of emotion recognition. 5. Improve system stability in real-world application environments: This technical solution promotes feature learning, modality completion, and sentiment classification tasks through multi-task joint optimization during model training, thereby enhancing the overall system's stability and practicality in real-world application environments.
[0013] This method first uses a feature decoupling network jointly trained by contrastive learning, orthogonal constraints, reconstruction, and classification constraints to obtain a pre-trained feature decoupling encoder under complete data conditions. This allows the feature decoupling encoder to decouple modal features into consistent features and unique features. Then, the pre-trained feature decoupling encoder is loaded into a missing modality completion network. With the help of missing data and imagination loss, the two imagination modules of the missing modality completion network are trained to complete the missing data. This ultimately improves the accuracy and robustness of emotion recognition under data missing conditions. Attached Figure Description
[0014] Figure 1 This is a flowchart illustrating the method used in the embodiment. Figure 2 This is a schematic diagram of the data flow in the method of the embodiment; Figure 3 This is a schematic diagram illustrating the process of video semantic generation and semantic feature extraction in this embodiment; Figure 4 This is a network structure diagram of the dual-branch encoder in the embodiment; Figure 5 This is a schematic diagram of the structure of the imaginary module consisting of cascaded residual autoencoders in the embodiment. Detailed Implementation
[0015] The present invention will be further described below with reference to the accompanying drawings and embodiments, but this is not intended to limit the scope of the invention.
[0016] Example: Reference Figure 1 , Figure 2 A multimodal emotion recognition method based on feature decoupling and missing modality completion includes the following steps: 1) Data Acquisition: Instructional videos are played on a computer monitor, and eye-tracking and PPG signals generated by learners during viewing are collected. Semantic descriptions of video segments are extracted, including: 1-1) Eye movement signal acquisition: A flat-panel eye tracker was used to collect data on fixation, saccades, pupil size, and eye movement trajectories generated by the learner during the learning process; 1-2) PPG signal acquisition: In this example, wearable ear clip sensors are used to collect the PPG signals generated by the learner during the learning process; 1-3) Video semantic extraction: The method of generating video text descriptions is used to capture rich semantic information of the video. This text description includes detailed text information, describing the scenes, objects, actions and plots in the video, as well as expressing the emotions and emotional background of the video content. The automatic generation of video semantic descriptions is achieved by fine-tuning the mPLUG-owl model released by Alibaba DAMO Academy. 2) Data Preprocessing: Preprocessing is performed on the acquired eye-tracking data, PPG data, and video semantic information, including: 2-1) Eye-tracking data processing: Missing values in the eye-tracking data are filled in using a linear interpolation method. The formula for the linear interpolation method is defined as follows: , in The time points of the frames adjacent to the missing value. for The corresponding eye movement data values, x is the time point corresponding to the missing eye movement data, and y is the missing eye movement data. Baseline correction is performed on these filled eye movement data to eliminate differences between different subjects and ensure data consistency. 2-2) PPG signal processing: The raw PPG signal may be affected by a variety of interference factors, including motion, changes in illumination, noise and electromagnetic interference. These factors may cause artifacts. First, a filter is used to denoise the signal to reduce the influence of high-frequency noise and improve the quality of the PPG signal. Then, baseline correction is performed on the PPG signal to eliminate differences between different subjects and ensure data consistency. These steps help to improve the accuracy and reliability of the PPG signal. 2-3) Video semantic information processing: The semantic description text of the teaching video clips is processed. First, the text is cleaned, including removing special symbols, punctuation and extra spaces. Then, stop words and low-frequency words are removed. Stop words refer to words that appear frequently in the text but lack actual semantic meaning, while low-frequency words refer to words that appear less frequently. 3) Feature extraction: Feature extraction is performed on the preprocessed data, including: 3-1) Standard Feature Extraction of Eye Movement Signals: For eye movement data, temporal features such as pupil diameter, fixation time, and saccade duration are extracted. To assess the correlation between these features and emotional state, Pearson correlation coefficients are calculated. The p-value is used to determine the significance of the correlation coefficient, i.e., to determine the correlation between each feature and emotional state. When the p-value is less than 0.05, the correlation coefficient is considered significant, indicating a significant linear relationship between the two groups of samples. When the p-value is less than 0.01, the correlation coefficient is considered highly significant. In this example, the most significant correlation coefficient is... Finally, 24 eye movement features that were significantly correlated with emotional state (p-value less than 0.05) were selected: fixation count, saccade count, fixation speed, fixation duration, left pupil diameter, right pupil diameter, maximum mean pupil diameter, fixation speed, saccade speed, left pupil diameter, right pupil diameter, minimum mean pupil diameter, fixation speed, saccade speed, fixation duration, left pupil diameter, right pupil diameter, mean mean pupil diameter, left pupil diameter, right pupil diameter, and mean pupil diameter. The standard deviation and variance of these features were then calculated. The Pearson correlation coefficient was defined as follows: , in , It is the number of data points. and These are the first two variables. The value of each data point and These are the means of the two variables, respectively, based on the Pearson correlation coefficient. To calculate the p-value, first calculate the statistic. When the sample size is Then: , Then based on the degrees of freedom Find the p-value corresponding to the t-statistic in the t-distribution table; 3-2) PPG Signal Standard Feature Extraction: Fine-grained feature extraction was performed on the PPG data, including the extraction of time-domain features, frequency-domain features, and nonlinear features. Similar to eye-tracking feature extraction, the Pearson correlation coefficient between PPG features and emotional features was calculated, and 29 PPG features significantly correlated with emotional state were selected based on the p-value. The eye-tracking and PPG signal features extracted in this example are shown in Table 1 below: Table 1. List of 24 eye-tracking features and 29 PPG signal features ; 3-3) Video semantic feature extraction: such as Figure 3As shown, meaningful features are extracted from the processed semantic description text to more deeply understand and express the semantic information of the video content. To achieve this goal, a pre-trained BERT model is used to perform deep semantic analysis on the cleaned and processed video semantic description text. The BERT model has the ability to understand the contextual relationships between texts, thus it can more accurately capture the rich semantic information in the text description. The BERT model is used to obtain semantic feature vectors containing important information describing the video content. Subsequently, the PCA dimensionality reduction algorithm is used to reduce the dimensionality of the features and further remove redundant information, finally obtaining a 25-dimensional semantic feature vector. In this example, in the BERT model, the text is first... After encoding and quantifying, the maximum output length was set to 200. Then, the quantified data was fed into the model to obtain the vectorized semantic features. Using the BERT model, semantic feature vectors containing important information describing the video content were obtained. Since the default output dimension of the BERT model is high (768 dimensions), there is some redundant information. Therefore, the PCA dimensionality reduction algorithm was further used to reduce the dimension of the features and remove redundant information. Experiments were conducted to reduce the semantic features to multiple dimensions such as 20, 25, 70, and 100. It was found that the best results were achieved when the dimension was reduced to 25. Therefore, the semantic features were finally reduced to 25 dimensions to achieve a more refined and efficient representation of semantic information. 4) Feature Decoupling: In this stage, the complete modality data is used to train the feature decoupling network. The complete eye-tracking features, PPG features, and video semantic features are extracted spatiotemporally using a dual-branch encoder consisting of two parallel CNN layers and two parallel LSTM layers. The structure is as follows: Figure 4 As shown, the consistency features of each modality were extracted. and unique characteristics ,include: 4-1) Contrastive learning: Extracted modality consistency features Projected onto the same subspace, and then "sample-level pairing" is used to obtain positive and negative pairs for contrastive learning. Positive pairs consist of different features from the same sample, while negative pairs consist of any two features from different samples. The contrastive learning loss function is: , in X is the anchor point. For the correct answer, For negative pairs, N is the sample size, and sim is the cosine similarity between X and Y: , To improve computational efficiency, the order of X and Y is reversed; that is, the calculation is performed again with Y as the anchor point. Therefore, the final loss function for contrastive learning is: ; 4-2) Orthogonal constraint: Modal consistency features extracted from the encoder of each modality. and unique characteristics To determine orthogonality, the calculation formula is: , in Denotes the Frobenius norm of a matrix. The total orthogonal loss function for all modes is: ; 4-3) Reconstruction Constraints: Concatenate the consistent features and unique features of each mode after decoupling to obtain the features. Then the data is input into the decoder to obtain the reconstructed features. Finally, the loss is constructed by combining the original features with the calculated loss formula: , in The root mean square error is used to calculate the total reconstruction loss. ; 4-4) Classification constraint: Concatenate the consistency features and unique features of all modalities to obtain the fused features. The data is then fed into a classifier for classification: The calculation formula is as follows: , in A classifier consisting of fully connected layers and activation functions. Indicates true emotional tags, The cross-entropy loss function; 4-5) Joint Training: The total loss of joint training during the feature decoupling stage. for: , in, For hyperparameters; 5) Missing Data Completion and Sentiment Recognition: The encoder for this stage is initialized using a pre-trained dual-branch encoder from the feature decoupling stage. Missing data is used for both training and testing the model. In this example, [missing data is used]. Taking missing eye-tracking features as an example, "miss" indicates that the corresponding modality data is missing. Zero-vector imputation involves inputting missing data into a dual-branch encoder to extract modal consistency features. and unique characteristics And splice them together to obtain and Simultaneously, modal consistency features are extracted using complete data input to a pre-trained encoder. and unique characteristics spliced together and ; Features missing from the mirror Input a pre-trained encoder to extract modal consistency features and modal unique features spliced together and ,include: 5-1) Modal completion and classification: This involves extracting missing data... and The input consists of an imaginary module composed of five autoencoders connected in series with residuals. The structure of the imaginary module is as follows: Figure 5 As shown, the completed features are obtained. and spliced together ,Will enter The classifier performs sentiment classification and calculates the classification loss. : , , Among them, classifier It consists of two fully connected layers and a SoftMax function. For the predicted sentiment category; 5-2) Joint Optimization: This example designs two loss functions to optimize the model, including: , , The total loss function for this stage is: , in, and For hyperparameters, For missing loss Loss due to imagination; 6) Testing and Evaluation: A comprehensive test and evaluation of the multimodal emotion recognition system was conducted to validate the model. Weighted accuracy (WA) and unweighted accuracy (UA) were used to measure the model's performance. In this example, the test results are shown in Table 2 below: Table 2 shows the experimental results on the VLMED dataset. ;
[0017] Experimental results show that the proposed method demonstrates excellent performance under modality missing conditions. On the VLMED dataset, the model achieved an average WA of 70.32% and an average UA of 69.14% under the three missing conditions, representing an improvement of 1.8% to 3.9% compared to the traditional AE, CRA, and MMIN models. This demonstrates the effectiveness of the proposed method for emotion recognition under modality missing conditions. Ablation experiments show that the average recognition accuracy for missing data is the highest when all modules are present, which fully confirms that each module of the model is indispensable.
Claims
1. A multimodal emotion recognition method based on feature decoupling and missing modality completion, characterized in that, Includes the following steps: 1) Data Acquisition: Simultaneously collect eye-tracking and PPG signals generated by learners while watching instructional videos, and extract semantic descriptions of instructional video segments, including: 1-1) Eye movement signal acquisition: A flat-panel eye tracker was used to collect data on fixation, saccades, pupil size, and eye movement trajectories generated by the learner during the learning process; 1-2) PPG signal acquisition: Wearable ear clip sensors are used to collect PPG signals generated by learners during the learning process; 1-3) Video semantic extraction: The method of generating video text descriptions is used to capture rich semantic information of the video. This text description includes detailed text information, describing the scenes, objects, actions and plots in the video, as well as expressing the emotions and emotional background of the video content. The automatic generation of video semantic descriptions is achieved by fine-tuning the mPLUG-owl model released by Alibaba DAMO Academy. 2) Data Preprocessing: Preprocessing is performed on the acquired eye-tracking data, PPG data, and video semantic information, including: 2-1) Eye-tracking data processing: Missing values in the eye-tracking data are filled in using a linear interpolation method. The formula for the linear interpolation method is defined as follows: , in The time points of the frames adjacent to the missing value. for The corresponding eye movement data values, where x is the time point corresponding to the missing eye movement data and y is the missing eye movement data, are then baseline corrected for these filled eye movement data. 2-2) PPG signal processing: First, the PPG signal is denoised using a filter, and then the PPG signal is baseline corrected. 2-3) Video semantic information processing: The semantic description text of the teaching video clips is processed. First, the text is cleaned, including removing special symbols, punctuation and extra spaces. Then, stop words and low-frequency words are removed. Stop words refer to words that appear frequently in the text but lack actual semantic meaning, while low-frequency words refer to words that appear less frequently. 3) Feature extraction: Feature extraction is performed on the preprocessed data, including: 3-1) Standard Feature Extraction of Eye Movement Signals: For eye movement data, temporal features such as pupil diameter, fixation time, and saccade duration were extracted. Then, the Pearson correlation coefficient between these features and emotional state was calculated. The p-value was used to determine the significance of the correlation coefficient, i.e., to determine the association between each feature and emotional state. When the p-value is less than 0.05, the correlation coefficient is considered significant, indicating a significant linear relationship between the two groups of samples. When the p-value is less than 0.01, the correlation coefficient is considered highly significant. Finally, 24 eye movement features that were significantly correlated with emotional state (i.e., p-values less than 0.05) were selected: fixation count, saccade count, fixation speed, fixation time, left pupil diameter, right pupil diameter, maximum mean pupil diameter, fixation speed, saccade speed, left pupil diameter, right pupil diameter, minimum mean pupil diameter, fixation speed, saccade speed, fixation time, left pupil diameter, right pupil diameter, mean mean pupil diameter, left pupil diameter, right pupil diameter, and the standard deviation and variance of the mean pupil diameter. The formula for the Pearson correlation coefficient is defined as follows: , in , It is the number of data points. and These are the first two variables. The value of each data point and These are the means of the two variables, respectively, based on the Pearson correlation coefficient. Calculate the p-value and the statistical test. When the sample size is Then: , According to degrees of freedom Find the p-value corresponding to the t-statistic in the t-distribution table; 3-2) PPG signal standard feature extraction: Refined feature extraction is performed on PPG data, including the extraction of time-domain features, frequency-domain features, and nonlinear features. The Pearson correlation coefficient between PPG features and sentiment features is calculated, and 29 PPG features that are significantly related to sentiment status are selected based on the significance of the p-value: HR, IBI, maximum, minimum, mean, median of R peak, and standard deviation of HR, SDSD, NNI_20, PNNI_20, RMSSD, range_NNI, CVSD, CVNNI, lf, hf, vlf, lf_hf_ratio, lfnu, hfnu, total_power, triangular_index, and sd1. 3-3) Video semantic feature extraction: The pre-trained BERT model is used to perform deep semantic analysis on the cleaned and processed video semantic description text to obtain semantic feature vectors containing important information describing the video content. Then, the PCA dimensionality reduction algorithm is used to reduce the dimension of the features and further remove redundant information, finally obtaining a semantic feature vector with 25 dimensions. 4) Feature decoupling: The complete eye-tracking features, PPG features, and video semantic features are extracted spatiotemporally using a dual-branch encoder consisting of two parallel CNN layers and two parallel LSTM layers, respectively, to extract the consistency features for each modality. and unique characteristics ; 4-1) Contrastive learning: Extracted modality consistency features Projected onto the same subspace, positive and negative pairs are obtained using "sample-level pairing" for contrastive learning. Positive pairs consist of two different features from the same sample, while negative pairs consist of any two features from different samples. The contrastive learning loss function is: , in X is the anchor point. For the correct answer, For negative pairs, N is the sample size. Cosine similarity between X and Y: , Reverse the order of X and Y, i.e., use Y as the anchor point and calculate again. The final loss function for contrastive learning is: ; 4-2) Orthogonal constraint: Modal consistency features extracted from the encoder of each modality. and unique characteristics To determine orthogonality, the calculation formula is: , in Denotes the Frobenius norm of a matrix. The total orthogonal loss function for all modes is: ; 4-3) Reconfiguration constraints: Consistency characteristics after decoupling each mode. and unique characteristics Features are obtained by splicing. Then the data is input into the decoder to obtain the reconstructed features. Finally, the loss is constructed by combining the original features with the calculated loss formula: , in The root mean square error is used to calculate the total reconstruction loss. ; 4-4) Classification constraint: Concatenate the consistency features and unique features of all modalities to obtain the fused features. The data is then fed into a classifier for classification, and the calculation formula is as follows: , in A classifier consisting of fully connected layers and the Softmax activation function. Indicates true emotional tags, The cross-entropy loss function; 4-5) Joint Training: The total loss of joint training during the feature decoupling stage. for: , in, For hyperparameters; 5) Missing Data Completion and Sentiment Recognition: A dual-branch encoder with the same structure as the feature decoupling stage is used. The encoder in this stage is initialized with network parameters pre-trained from the dual-branch encoder in the feature decoupling stage. Missing data is used for both training and testing. Three types of missing data are considered: The missing features are: video semantic features and PPG features are present, but eye-tracking features are missing. The missing features are: video semantic features and eye-tracking features are present, but PPG features are missing. The missing features are as follows: video semantic features are present, but eye-tracking features and PPG features are missing. Missing features are uniformly filled with zeros. The missing data is input into a dual-branch encoder to extract the missing modal consistency features and unique features, which are then concatenated to obtain the final data. and Simultaneously, complete modal consistency features are extracted using a pre-trained encoder with complete data input. and unique characteristics spliced together and The mirror-image missing features are input into a pre-trained encoder to extract modality consistency features and modality-specific features of the mirror-image missing features, which are then concatenated to obtain... and ; 5-1) Modal completion and classification: This involves extracting missing data... and The input consists of an imaginary module composed of five autoencoders (AEs) connected in series with residuals to obtain the completed features. and spliced together ,Will enter The classifier performs sentiment classification and calculates the classification loss: , , in It consists of two fully connected layers and a SoftMax function. For the predicted sentiment category, For classification loss; 5-2) Joint Optimization: Design two loss functions to optimize the model, including: , , The total loss function for this stage is: , in, and For hyperparameters, For missing loss Loss due to imagination; 6) Testing and Evaluation: Conduct comprehensive testing and evaluation of the multimodal emotion recognition system, validate the model, and use weighted accuracy (WA) and unweighted accuracy (UA) to measure the model's performance.