Speech expression multi-modal emotion recognition method based on weighted feature fusion

By extracting speech and facial expression keyframes, calculating the speech and facial expression weights in weighted feature fusion, extracting high-order emotion features using residual networks and 3D-CNN, and performing weighted feature fusion in a multilayer perceptron, the problem of insufficient single-modal emotion recognition rate is solved, and higher multimodal emotion recognition accuracy is achieved.

CN115691559BActive Publication Date: 2026-06-19CENT CHINA BRANCH OF CHINA DATANG CORP SCI & TECH RES INST CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CENT CHINA BRANCH OF CHINA DATANG CORP SCI & TECH RES INST CO LTD
Filing Date
2022-11-08
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing unimodal emotion recognition methods cannot fully utilize the complementarity between speech and facial expression modalities, resulting in insufficient emotion recognition rates.

Method used

By extracting speech and facial expression keyframes, calculating the speech and facial expression weights in weighted feature fusion, extracting high-order emotion features using residual networks and 3D-CNN, and performing weighted feature fusion in a multilayer perceptron, weighted fusion of speech and facial expression modalities is achieved.

Benefits of technology

It improves the recognition rate of multimodal emotion recognition, makes full use of the complementarity between speech and facial expression modalities, and effectively enhances the accuracy of emotion recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115691559B_ABST
    Figure CN115691559B_ABST
Patent Text Reader

Abstract

This invention discloses a multimodal emotion recognition method for speech and facial expressions based on weighted feature fusion, comprising: extracting speech keyframes and facial expression keyframes, and extracting MFCC features from the speech keyframes; aligning the extracted speech and facial expression keyframes, and calculating speech weights and facial expression weights in the weighted feature fusion by utilizing the complementarity of speech and facial expression modalities; extracting higher-order speech emotion features from the speech MFCC features using a trained residual network, and extracting higher-order facial expression emotion features from the facial expression keyframes using a trained 3D-CNN neural network; weighting the higher-order speech emotion features and higher-order facial expression emotion features using the speech weights and facial expression weights respectively, to obtain weighted higher-order speech emotion features and weighted higher-order facial expression emotion features; inputting the weighted higher-order speech emotion features and weighted higher-order facial expression emotion features into a multilayer perceptron (MLP) for weighted feature fusion and emotion classification. This invention can improve the recognition rate of multimodal emotion recognition for speech and facial expressions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence, and more specifically, relates to a method for multimodal emotion recognition of speech and facial expressions based on weighted feature fusion. Background Technology

[0002] With the rapid development of artificial intelligence, emotion recognition, as a core component of human-computer interaction, has been applied in many fields, including smart homes, travel recommendation systems, and health monitoring. Through in-depth research, researchers have gradually developed a widely used classification standard for emotion recognition. This standard categorizes emotions into six types: anger, disgust, fear, sadness, surprise, and happiness. Furthermore, emotion recognition primarily includes speech, facial expression, and physiological signal emotion recognition. Speech emotion recognition identifies a person's emotional state by analyzing their speech rate, tone, and semantics. Facial expression emotion recognition analyzes facial expressions and muscle movements to identify a person's emotional state; physiological signal emotion recognition identifies a person's emotional state by analyzing physiological signals such as respiration, heart rate, blood pressure, electroencephalogram (EEG) signals, and body temperature. Among these, facial expression and speech emotion recognition have been extensively and deeply studied because they allow for the simple and intuitive construction of facial expression and speech emotion recognition datasets.

[0003] Researchers have conducted extensive studies on unimodal speech and facial expression emotion recognition, finding that unimodal emotion recognition does not convey complete emotional information, and its recognition rate falls far short of expectations. Furthermore, human emotion is a comprehensive expression integrating multiple modalities, with these modalities exhibiting a degree of complementarity. For example, when a person is happy, their face is usually full of smiles and their voice is loud; when a person is angry, their face is usually gloomy, their eyes widen, their voice is low, and various physiological information such as blood pressure, brainwave signals, and body temperature also change. These all indicate that human emotion is a comprehensive expression integrating multiple modalities, and that these modalities are complementary. Therefore, to fully utilize the complementarity between modalities to improve the recognition rate of emotion recognition, the research focus of emotion recognition has gradually shifted from unimodal to multimodal.

[0004] The core of multimodal emotion recognition lies in fusion methods, including decision-level fusion, model-level fusion, and feature-level fusion. Decision-level fusion first uses multiple classifiers to identify the emotional state of each modality, then employs decision rules such as mean, summation, and product to fuse the data. This fusion method is simple in principle, but it does not utilize the complementarity between different emotional modalities. Model-level fusion leverages the correlation between different modalities to achieve multimodal emotion recognition, but it requires comprehensive consideration of the characteristics of each modality and specific experiments to determine the parameters of model-level fusion, which severely limits its development.

[0005] Feature layer fusion first extracts features from each modality, then fuses and classifies them. This fusion method leverages the complementarity between different emotional modalities and is widely used in multimodal emotion recognition. Summary of the Invention

[0006] The purpose of this invention is to propose a multimodal emotion recognition method for speech and facial expressions based on weighted feature fusion, thereby improving the recognition rate of multimodal emotion recognition for speech and facial expressions.

[0007] To achieve the above objectives, this invention proposes a multimodal emotion recognition method for speech expressions based on weighted feature fusion, comprising:

[0008] Extract speech and facial expression information from emotional videos, perform frame segmentation, extract speech keyframes and facial expression keyframes, and extract MFCC features of speech keyframes.

[0009] The extracted speech keyframes and facial expression keyframes are aligned, and the speech weights and facial expression weights in the weighted feature fusion are calculated by leveraging the complementarity of speech and facial expression modalities.

[0010] The high-order emotional features of speech in the speech MFCC features are extracted using the trained residual network, and the high-order emotional features of facial expressions in the facial expression keyframes are extracted using the trained 3D-CNN neural network.

[0011] The higher-order emotional features of speech and the higher-order emotional features of facial expressions are weighted using the speech weight and the facial expression weight, respectively, to obtain the weighted higher-order emotional features of speech and the weighted higher-order emotional features of facial expressions.

[0012] The weighted speech high-order emotion features and the weighted facial expression high-order emotion features are input into a multilayer perceptron (MLP) for weighted feature fusion and emotion classification.

[0013] Preferably, the extraction of speech keyframes includes:

[0014] The speech VAD algorithm is used to extract keyframes from speech information.

[0015] Preferably, the extraction of facial expression keyframes includes:

[0016] Extract the facial region from the expression frames and use information entropy to extract expression keyframes.

[0017] Preferably, the extraction of speech MFCC features includes:

[0018] Compensation preprocessing is performed on the high-frequency components of the speech information;

[0019] The preprocessed speech information is converted into the frequency domain using FFT transformation to obtain the spectrum of the speech information.

[0020] The Mel spectrum of speech information is obtained by processing the spectrum of speech information using a Mel filter bank.

[0021] Frequency analysis is performed on the Mel spectrum of the obtained speech information to obtain the Mel cepstral of the speech information;

[0022] Based on the Mel-Cepstral spectra of the obtained speech information, the MFCC features of the speech information are calculated using Discrete Cosine Transform.

[0023] Calculate the first-order and second-order differences of the MFCC features of speech information.

[0024] Preferably, the step of calculating the speech weights and expression weights in the weighted feature fusion by utilizing the complementarity of speech and facial expression modalities includes:

[0025] After aligning the speech keyframes and facial expression keyframes, determine whether the speech keyframes and facial expression keyframes in the aligned data frame are valid keyframes containing emotional features.

[0026] If both the speech keyframe and the facial expression keyframe in the aligned data frame are valid keyframes, then the weights of both the speech keyframe and the facial expression keyframe are set to 1.

[0027] If the speech keyframe in the alignment data frame is a valid keyframe and the facial expression keyframe is an invalid keyframe, then the weight of the corresponding facial expression keyframe in the alignment data frame is set to 0, and the weight of the corresponding speech keyframe in the alignment data frame is set to 2.

[0028] If the facial expression keyframe in the alignment data frame is a valid keyframe and the speech keyframe is an invalid keyframe, then the weight of the corresponding speech keyframe in the alignment data frame is set to 0, and the weight of the corresponding facial expression keyframe in the alignment data frame is set to 2.

[0029] The total number of aligned data frames where the speech keyframe is a valid keyframe and the facial expression keyframe is an invalid keyframe is counted.

[0030] Based on the total number of facial expression keyframes, the total number of speech keyframes, the total number of aligned data frames where facial expression keyframes are valid and speech keyframes are invalid, and the total number of aligned data frames where speech keyframes are valid and facial expression keyframes are invalid, the speech weight and facial expression weight in the weighted feature fusion are calculated.

[0031] Preferably, the calculation formula for the speech weights in weighted feature fusion is as follows:

[0032]

[0033] The expression weights in weighted feature fusion are defined by the following calculation formula:

[0034]

[0035] Among them, W s W represents the speech weights in weighted feature fusion. v For expression weights in weighted feature fusion, size(·) represents the length of the vector, and C v This represents a vector that stores the position parameters of facial expression keyframes, size(C v ) represents the total number of facial expression keyframes, C s This represents a vector that stores the location parameters of speech keyframes, size(C s ) represents the number of speech keyframes, P is the total number of frames where facial expression keyframes are valid and speech keyframes are invalid, and Q is the total number of aligned data frames where speech keyframes are valid and facial expression keyframes are invalid.

[0036] Preferably, the residual network includes five residual blocks with the same structure connected in sequence and two fully connected layers, wherein each residual block includes two convolutional layers, three ReLU activation function layers and two normalization layers connected in sequence.

[0037] The last fully connected layer has 2048 neurons.

[0038] Preferably, the 3D-CNN neural network includes seven convolutional layers, five pooling layers, and two fully connected layers connected in sequence;

[0039] The last fully connected layer has 2048 neurons.

[0040] Preferably, the higher-order emotional features of the speech and the higher-order emotional features of the facial expressions are weighted using the speech weights and the facial expression weights, respectively, and calculated using the following formula:

[0041]

[0042]

[0043] Among them, X W For weighted higher-order emotional features of speech, Y W For weighted higher-order emotional features of facial expressions, W s For speech weights, W v For facial expression weighting, The high-order emotional features of the speech output by the residual network. The dimension of the high-order emotional features of the facial expressions output by the 3D-CNN.

[0044] Preferably, the multilayer perceptron (MLP) includes an input layer, two hidden layers, and an output layer. The input layer contains 4096 neurons, the two hidden layers contain 2048 neurons and 1024 neurons, respectively, and the number of neurons in the output layer is equal to the number of emotion categories.

[0045] The beneficial effects of this invention are as follows:

[0046] The method of this invention first extracts speech keyframes and facial expression keyframes, and then extracts speech MFCC features. Next, the extracted speech and facial expression keyframes are aligned, and speech weights and facial expression weights are calculated in the weighted feature fusion process by utilizing the complementarity of speech and facial expression modalities. Then, a trained residual network is used to extract higher-order emotional features from the speech MFCC features, and a trained 3D-CNN neural network is used to extract higher-order emotional features from the facial expression keyframes. Finally, the higher-order emotional features of speech and facial expression are weighted using the speech weights and facial expression weights respectively to obtain a weighted sum. The invention utilizes the complementarity between speech and facial expression modalities, employs speech and facial expression keyframe alignment technology, and leverages the complementarity of speech and facial expression modalities to calculate speech and facial expression weights in the weighted feature fusion. These weights are then input into the feature fusion layer (MLP) to perform weighted fusion of speech and facial expression high-order emotion features, effectively improving the recognition rate of multimodal emotion recognition.

[0047] The system of the present invention has other features and advantages that will be apparent from or will be set forth in detail in the accompanying drawings and following detailed description, which together serve to explain the particular principles of the invention. Attached Figure Description

[0048] The above and other objects, features and advantages of the present invention will become more apparent from the accompanying drawings, in which like reference numerals generally denote like parts.

[0049] Figure 1 The flowchart illustrates the steps of a multimodal emotion recognition method for speech and facial expressions based on weighted feature fusion according to an embodiment of the present invention.

[0050] Figure 2The diagram illustrates the principle of a multimodal emotion recognition method for speech expressions based on weighted feature fusion, according to an embodiment of the present invention.

[0051] Figure 3 The illustration shows the MFCC feature map of speech information in a speech expression multimodal emotion recognition method based on weighted feature fusion according to an embodiment of the present invention.

[0052] Figure 4 The experimental results of the emotion recognition method of this invention on the RML dataset are shown.

[0053] Figure 5 The experimental results of the emotion recognition method of this invention on the eNTERFACE05 dataset are shown.

[0054] Figure 6 The experimental results of the emotion recognition method of this invention on the BAUM-1s dataset are shown. Detailed Implementation

[0055] This invention proposes a weighted fusion-based multimodal emotion recognition method for speech and facial expressions, based on a feature layer fusion approach. First, features from each modality are extracted, then fused and classified. The principle is as follows: First, the emotional video is divided into speech and facial expression information. The speech information is noise-removed using a speech VAD algorithm, and speech keyframes are saved. The facial expression information is modeled using information entropy, indicating that emotion generation is a continuous process, and facial expression keyframes are saved. Then, to fully utilize the complementarity between the speech and facial expression modalities and preserve important emotional information, the weights of speech and facial expression are calculated using a keyframe alignment technique. Second, the MFCC features of the speech keyframes and the facial expression keyframes are respectively input into a residual network and a 3D-CNN to extract high-order speech and facial expression emotion features. Finally, the high-order speech and facial expression emotion features are input into a multilayer perceptron (MLP) for weighted fusion and emotion classification. The weights of the weighted fusion are calculated using the keyframe alignment technique, which fully utilizes the complementarity between the speech and facial expression modalities and effectively improves the recognition rate of multimodal emotion recognition.

[0056] The invention will now be described in more detail with reference to the accompanying drawings. While preferred embodiments of the invention are shown in the drawings, it should be understood that the invention can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that the invention will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

[0057] Example

[0058] like Figure 1 As shown in the figure, this embodiment provides a multimodal emotion recognition method for speech expressions based on weighted feature fusion. The method specifically includes:

[0059] S1: Extract speech and facial expression information from the emotional video, perform frame segmentation, extract speech keyframes and facial expression keyframes, and extract MFCC features of the speech keyframes.

[0060] S2: Align the extracted speech keyframes and facial expression keyframes, and calculate the speech weights and facial expression weights in the weighted feature fusion by taking advantage of the complementarity of speech and facial expression modalities.

[0061] S3: Use the trained residual network to extract high-order emotional features of speech from the speech MFCC features, and use the trained 3D-CNN neural network to extract high-order emotional features of facial expressions from the facial expression keyframes.

[0062] S4: Use speech weight and facial expression weight to weight the higher-order emotional features of speech and facial expression respectively, and obtain the weighted higher-order emotional features of speech and facial expression.

[0063] S5: Input the weighted speech high-order emotion features and the weighted facial expression high-order emotion features into the multilayer perceptron (MLP) for weighted feature fusion and emotion classification.

[0064] The steps described above in this embodiment will be explained in detail below.

[0065] The principle of the speech expression multimodal emotion recognition method based on weighted feature fusion in this embodiment is as follows: Figure 2 As shown, the network model used in this embodiment mainly includes preprocessing, residual network, 3D-CNN, and weighted feature fusion layer (multilayer perceptron MLP).

[0066] Execute S1: Extract speech and facial expression information from the emotional video, perform frame segmentation, extract speech keyframes and facial expression keyframes, and extract MFCC features of the speech keyframes.

[0067] This step involves preprocessing to extract speech and facial expression keyframes and calculating the speech and facial expression weights in the weighted feature fusion. First, the emotional video is divided into speech and facial expression information, and then framed into speech frames and facial expression frames. Next, the speech frames are used to extract speech keyframes using the speech VAD algorithm and speech MFCC features are extracted. Then, the facial expression frames have facial regions extracted and facial expression keyframes are extracted using information entropy.

[0068] The specific process includes the following steps:

[0069] S101: Extract speech keyframes.

[0070] Specifically, speech information is divided into speech segments and noise segments. Speech segments contain emotional features, while the emotional features of noise segments are close to zero. This indicates that noise segments not only fail to improve the performance of emotion recognition but also cause data pollution, thus reducing the performance of emotion recognition. To address this issue, this paper utilizes the speech VAD algorithm to eliminate noise segments in speech information and saves the speech segments as speech keyframes. Specifically, the speech VAD algorithm employs a dual-threshold detection method based on short-time energy and zero-crossing rate, where the short-time energy of speech information is defined as:

[0071]

[0072] Where x(m) represents speech information and w(nm) represents the window function.

[0073] The zero-crossing rate of voice information is defined as:

[0074]

[0075] Where sgn[x(m)] is a symbolic function, defined as:

[0076]

[0077] Speech segments of speech information have high energy and low zero-crossing rate, while noise segments have low energy and high zero-crossing rate. Therefore, by setting dual thresholds for short-time energy and zero-crossing rate, each frame of speech information is calculated to belong to either the noise segment or the speech segment. The speech segment of speech information is defined as the speech keyframe S. k And define a vector C s Save audio keyframes S k The position parameter i.

[0078] S102: Extract facial expression keyframes.

[0079] Specifically, facial expression information includes the face region and the background region, with emotional information primarily residing in the face region. Therefore, this invention utilizes the Viola-Jones method to extract the face region of facial expression information. This method first calculates one or more matrix features of the facial expression image using an integral image, then trains multiple weak classifiers, strengthens the weak classifiers using the AdaBoost method, and finally performs classification using a cascaded approach. It is a simple and convenient face detection method. Furthermore, the generation of emotion is a continuous process, generally involving the following steps: start, maximization, reduction, and end. Extracting the facial expression frame at the point of maximum emotion effectively reduces redundancy in keyframes. Additionally, facial expression images can be treated as two-dimensional discrete signals, and emotional information can be calculated using information entropy. Higher image information entropy contains more information and better reflects the main information of the image; lower image information entropy contains less information and is less likely to reflect the main information. Therefore, this invention extracts facial expression keyframes based on information entropy and the fact that emotion generation is a continuous process. The specific principle is as follows:

[0080] First, the information entropy of the facial expression image is calculated. For an facial expression image I of size M×N and grayscale range (0 < L < 256), its grayscale value at coordinates (x, y) is defined as f(x, y), with a value range of [0, L-1]. Furthermore, the number of grayscale levels j in the i-th facial expression image is defined as f. i,j Given (x, y), the probability of gray level j is defined as:

[0081]

[0082] Based on equation (4), the information entropy of the i-th expression image is defined as:

[0083]

[0084] The average information entropy of N frames of facial expression images is defined as:

[0085]

[0086] Based on equation (5), the information entropy of each frame of facial expression is calculated, and the maximum value of the information entropy of N frames of facial expression images is defined as the facial expression reference frame V. r Because the generation of emotion is a continuous process, the greater the information entropy of an facial expression image, the more emotional information it contains. Therefore, the facial expression reference frame V... r It contains a significant amount of emotional information. Furthermore, the facial expression reference frame V... r The position parameter i is set to C v At the same time, to ensure that each facial expression reference frame V... r All are valid, facial expression reference frame V r H should be satisfiedi (I)>H ave .

[0087] Secondly, facial expression keyframes are extracted. These are derived from the facial expression reference frame V. r As can be seen from the definition, V r It contains more emotional information than other frames and is used as part of the expression keyframe. However, the expression reference frame V... r The number of frames is often insufficient to meet the requirements of feature extraction models. Therefore, it is necessary to select other frames as part of the facial expression keyframes. This invention calculates the facial expression reference frame V. r The similarity between the original expression frame and other original expression frames is used to determine whether other original expression frames can be used as expression keyframes, because the more similar other frames are to the expression reference frame V, the better. r It contains more emotional information. When calculating similarity, commonly used perceptual hashing and Hamming distance are employed. Specifically, the perceptual hashing algorithm is first used to calculate the hash fingerprint of the expression frame, and then the Hamming distance is used to calculate the expression reference frame V. r Similarity with other original expression frames. If the Hamming distance is less than 10, the two expression frames are similar; if the Hamming distance is greater than 10, the two expression frames are not similar. Finally, it will be compared with the expression reference frame V. r The position parameter i of the similar original facial expression frame is added to C. v In the middle, and put C v The facial expression frame y(i) corresponding to the position parameter i is defined as the facial expression keyframe V. k .

[0088] S103: Extract MFCC features from speech keyframes.

[0089] Specifically, speech MFCC features are commonly used in speech recognition because they are based on cepstral features, which align with human auditory habits, making them an effective speech feature extraction method. Speech MFCC includes preprocessing, Fast Fourier Transform (FFT), Mel spectrogram, and Mel cepstral analysis.

[0090] The specific principle is as follows:

[0091] The high-frequency components of speech information are distorted by the oral and nasal cavities. Preprocessing is typically used to compensate for these high-frequency components, and this is represented as:

[0092] x pre (n)=x(n)-ax(n-1)0.9<a<1.0 (9)

[0093] Where, x pre (n) represents the preprocessed speech information, x(n) represents the speech information at time n, and a is an empirical value, which is usually taken as 0.98.

[0094] The frequency domain of speech information makes it easier to observe its features. The FFT transform is typically used to convert speech information into the frequency domain, and it is represented as:

[0095]

[0096] Where M represents the number of FFT points for each frame of speech information, which is typically 256 or 512. Finally, the spectrum of the speech information is obtained by calculating the square of the modulus of equation (10).

[0097] The Mel spectrum of speech information is obtained by processing the spectrum of speech information using a Mel filter bank. The Mel filter bank typically uses a triangular filter, which is defined as follows:

[0098]

[0099] Where m represents the Mel frequency, its relationship with the frequency f of the speech information is expressed as follows:

[0100]

[0101] The Mel frequency of the speech information is analyzed to obtain the Mel cepstral spectrum of the speech information. Specifically, the logarithmic energy s(m) of the output of each filter bank is calculated and expressed as:

[0102]

[0103] The MFCC feature C(n) of speech information is calculated using discrete cosine transform and is represented as:

[0104]

[0105] Finally, the first-order difference d of the MFCC features of the speech information is calculated. t It is represented as:

[0106]

[0107] Among them, C t Let K represent the MFCC feature of the t-th speech information, and let K and Q represent the time difference and order of the MFCC feature of the speech information, respectively. Furthermore, from equation (15) and d... t It can calculate the second-order difference of speech information.

[0108] To more intuitively demonstrate the MFCC features of speech information, Figure 3The paper presents MFCC features, first-order MFCC features, and second-order MFCC features of speech information, as well as Mel-Cepstral, first-order Mel-Cepstral, and second-order Mel-Cepstral. The first-order and second-order MFCC features are calculated from the MFCC features and Equation (15), respectively, while the Mel-Cepstral is obtained by compressing the MFCC features. By using the Mel-Cepstral, first-order Mel-Cepstral, and second-order Mel-Cepstral to mimic the RGB three channels of an image, the temporal and spatial features of speech information are effectively extracted. This is because speech MFCC extracts the spatial features of speech information, and the first-order and second-order differences of speech MFCC effectively extract the temporal features of speech information.

[0109] Execute S2: Align the extracted speech keyframes and facial expression keyframes, and calculate the speech weights and facial expression weights in the weighted feature fusion by leveraging the complementarity of speech and facial expression modalities;

[0110] This step specifically includes:

[0111] After aligning the speech keyframes and facial expression keyframes, determine whether the speech keyframes and facial expression keyframes in the aligned data frame are valid keyframes containing emotional features.

[0112] If both the speech keyframe and the facial expression keyframe in the aligned data frame are valid keyframes, then the weights of both the speech keyframe and the facial expression keyframe are set to 1.

[0113] If the speech keyframe in the alignment data frame is a valid keyframe and the facial expression keyframe is an invalid keyframe, then the weight of the corresponding facial expression keyframe in the alignment data frame is set to 0, and the weight of the corresponding speech keyframe in the alignment data frame is set to 2.

[0114] If the facial expression keyframe in the alignment data frame is a valid keyframe and the speech keyframe is an invalid keyframe, then the weight of the corresponding speech keyframe in the alignment data frame is set to 0, and the weight of the corresponding facial expression keyframe in the alignment data frame is set to 2.

[0115] The total number of aligned data frames where the speech keyframe is a valid keyframe and the facial expression keyframe is an invalid keyframe is counted.

[0116] Based on the total number of facial expression keyframes, the total number of speech keyframes, the total number of aligned data frames where facial expression keyframes are valid and speech keyframes are invalid, and the total number of aligned data frames where speech keyframes are valid and facial expression keyframes are invalid, the speech weight and facial expression weight in the weighted feature fusion are calculated.

[0117] Specifically, the voice keyframe S k and facial keyframes Vk Alignment; the aligned frame data falls into one of three categories:

[0118] 1) Voice information is not a valid keyframe, but facial expression information is a valid keyframe. The number of frames in this case is defined as P.

[0119] 2) Voice information is a valid keyframe, but facial expression information is not. The number of frames in this case is defined as Q.

[0120] 3) Voice information is a valid keyframe, and facial expression information is a valid keyframe. The number of frames in this case is defined as M.

[0121] For case 1), it should be enhanced. For example, when people are too surprised to speak, the corresponding speech keyframe is not a valid keyframe, but their facial expressions provide more emotional information due to excessive surprise. In this case, the facial expression information in this frame provides more important emotional information, which is particularly important for multimodal emotion recognition and needs to be given priority. Similarly, case 2) also needs to be given priority. In addition, speech and facial expression high-order emotional features in feature layer fusion are generally connected in series or parallel, which means that the weight of speech and facial expression high-order emotional features in feature fusion is 1. However, this feature fusion method does not give priority to cases 1) and 2), which affects the performance of multimodal emotion recognition. Therefore, this invention uses weighted feature fusion to give priority to cases 1) and 2), and the specific principle is as follows:

[0122] First, the weights of both speech and facial expression keyframes are set to 1, because the weights of high-order emotional features of speech and facial expressions are both 1 during feature layer fusion. Second, for important emotional information in case 1), the emotional weights of both speech and facial expression keyframes are 1, but the speech information in this frame is not a keyframe. To handle this situation, leveraging the complementarity between speech and facial expression modalities, the weights of speech keyframes are transferred to facial expression keyframes; that is, the weight of the facial expression keyframe (effective keyframe) in case 1) is 2. Similarly, leveraging the complementarity between speech and facial expression modalities, the weight of speech keyframes in case 2) is set to 2. This approach, by increasing the weights of cases 1) and 2), focuses on processing important emotional information, effectively improving the performance of multimodal emotion recognition. Finally, based on the above analysis, the weights of speech and facial expressions are defined as follows:

[0123]

[0124]

[0125] Among them, W s W represents the speech weights in weighted feature fusion. vFor expression weights in weighted feature fusion, size(·) represents the length of the vector, and C v This represents a vector that stores the position parameters of facial expression keyframes, size(C v ) represents the total number of facial expression keyframes, C s This represents a vector that stores the location parameters of speech keyframes, size(C s ) represents the number of speech keyframes, P is the total number of frames where facial expression keyframes are valid and speech keyframes are invalid, and Q is the total number of aligned data frames where speech keyframes are valid and facial expression keyframes are invalid.

[0126] As can be seen from equations (7) and (8), the more frames P and Q there are in cases 1) and 2), the greater the weight of speech and facial expression, which is more conducive to multimodal emotion recognition.

[0127] Execute S3: Use the trained residual network to extract high-order emotional features of speech from the speech MFCC features, and use the trained 3D-CNN neural network to extract high-order emotional features of facial expressions from the facial expression keyframes.

[0128] Specifically, such as Figure 2 As shown, the residual network extracts high-order emotional features from speech. It consists of five residual blocks with the same structure (RB1-RB2-RB3-RB4-RB5) and two fully connected layers (FC6-FC7). Each residual block contains two convolutional layers (CL1-CL2), three ReLU activation function layers, and two normalization layers (BN). The specific parameters are shown in Table 1. Furthermore, the fully connected FC7 layer has 2048 neurons, indicating that the high-order emotional features output by the residual network are...

[0129] Table 1 Parameters of the residual network

[0130]

[0131] The training process of the residual network is as follows: First, a softmax layer is added after the FC7 layer of the residual network to classify the six emotions. Then, the parameters of the residual network are updated by optimizing the following formula using the backpropagation algorithm:

[0132]

[0133] Where K represents the number of speech training samples, S i Let θ represent the i-th speech training sample. s The parameters of the residual network are represented by yi, γ, and y represents the label. s (S i ;θ s) represents the high-order sentiment feature output by the FC7 layer of the residual network, softmax(γ) s (S i ;θ s )) represents the classification result output by the softmax layer. The loss function is defined as follows:

[0134]

[0135] in, The softmax(γ) in expression (16) s (S i ;θ s ), l represents the category of emotion: anger, disgust, fear, sadness, surprise, and happiness, with a value set to 6. y i,j and Let represent the label of the emotion category of the j-th voice training sample and the classification result output by the FC7 layer, respectively.

[0136] Furthermore, such as Figure 2 As shown, 3D-CNN is used to extract high-order emotion features from facial expressions. Originally designed for action recognition, this deep learning network has also been applied to video classification and multimodal emotion recognition because it effectively extracts the temporal and spatial features of emotional videos using three-dimensional convolutional operations. 3D-CNN contains seven convolutional layers (CL1-CL2-CL3-CL4a-CL4b-CL5a-CL5b), five pooling layers (Pool1-Pool2-Pool3-Pool4-Pool5), and two fully connected layers (FC6-FC7), with specific parameters shown in Table 2. The FC7 layer of the 3D-CNN has 2048 neurons, indicating that the dimensionality of the high-order emotion features output by the 3D-CNN is [missing information - likely a dimensionality value]. Furthermore, the training process of 3D-CNN is similar to that of residual networks. It is only necessary to modify the speech training samples and speech labels in equations (16) and (17) to expression training samples and expression labels.

[0137] Table 2 Parameter Table for 3D-CNN

[0138]

[0139] Steps S4-S5: Speech and facial expression weights are used to weight the higher-order emotional features of speech and facial expressions, respectively, to obtain weighted higher-order emotional features of speech and facial expressions. These weighted features are then input into a multilayer perceptron (MLP) for weighted feature fusion and emotion classification.

[0140] like Figure 2 As shown, the high-order emotional features of speech and facial expressions output by the residual network and the FC7 layer of the 3D-CNN are weighted using equations (7) and (8), respectively, and are represented as follows:

[0141]

[0142]

[0143] Weighted high-order speech features X W and facial features Y W The input is fed into a Multilayer Perceptron (MLP) to classify six emotions. Specifically, the residual network and the FC7 layer of the 3D-CNN each contain 2048 neurons, so the input layer of the MLP contains 4096 neurons. The two hidden layers of the MLP contain 2048 neurons and 1024 neurons, respectively. The number of neurons in the output layer of the MLP equals the number of emotion categories, defined as 6. Therefore, the structure of the MLP is 4096-2048-1024-6. Furthermore, the training process of the MLP is similar to that of the residual network.

[0144] This embodiment also includes:

[0145] Step S6: Experimentally verify the facial expression recognition effect of the method in this embodiment.

[0146] First, prepare the dataset.

[0147] The RML dataset is a publicly available dataset for speech and facial expression emotion recognition, collected by Leisen Multimedia Labs from subjects from eight different language, accent, and cultural backgrounds. It contains 720 emotional video clips expressing six basic emotions: anger, disgust, fear, sadness, surprise, and happiness. All emotional video clips were filmed in a quiet, bright environment with a sampling rate of 22050Hz and a frame rate of 30fps.

[0148] The eNTERFACE05 dataset is a publicly available dataset for voice and facial expression emotion recognition, collected from 42 subjects in 14 countries. It contains 1290 video clips expressing six basic emotions: anger, disgust, fear, sadness, surprise, and happiness. All emotional video clips were filmed in a quiet, well-lit environment with a sampling rate of 48000Hz and a frame rate of 25fps.

[0149] The BAUM-1s dataset is a publicly available dataset for speech and facial expression emotion recognition. It contains 1222 emotional video clips, of which only 521 express the six basic emotions: anger, disgust, fear, sadness, surprise, and happiness. Unlike the eNTERFACE05 and RML datasets, the BAUM-1s dataset represents the subjects' self-expression of emotions. The frame rate is 30fps.

[0150] Then, set the experimental parameters:

[0151] The experimental parameters are shown in Table 3. A residual network, a 3D-CNN, and a weighted feature fusion network (MLP) were implemented using PyTorch. The residual network and the 3D-CNN were pre-trained separately, and then the high-order speech and facial expression features output from the FC7 layer were input into the weighted feature fusion network (MLP) for training. To obtain better fusion results, the parameters of the residual network and the 3D-CNN were fixed during MLP training.

[0152] Table 3 Experimental parameters

[0153]

[0154] Perform ablation experiments:

[0155] The ablation experiment results are shown in Table 4. Speech monomodality and facial expression monomodality were respectively... Figure 2 The residual network and 3D-CNN are used to classify six emotions. The method described in this embodiment is used to classify six emotions in multimodal speech expression recognition. By comparing the recognition rates of unimodal and multimodal methods, it can be seen that multimodal fusion effectively improves the recognition rate of speech expression emotion recognition. This is because emotion is a comprehensive expression that integrates multiple modalities, and these modalities are complementary. Multimodal fusion can utilize the complementarity between modalities, which effectively improves the recognition rate of multimodal speech expression emotion recognition.

[0156] Table 4 Ablation Experiment Results

[0157]

[0158]

[0159] Conduct fusion experiments:

[0160] As shown in Table 5, the weighted feature fusion proposed in this invention achieves a higher recognition rate than feature layer fusion, average decision fusion, and product decision fusion. This is because both average and product decision fusion are decision layer fusions, utilizing residual networks and 3D-CNN to identify six emotions respectively, and then fusing them using averaging and decision rules. While these two fusion methods improve the recognition rate compared to single-modal emotion recognition, they do not consider the complementarity between speech and facial expression modalities, resulting in a lower recognition rate. Feature layer fusion utilizes… Figure 2The method described above identifies six emotions, but it does not employ weighted fusion. This method fails to process the key information in cases 1) and 2) of step S2, resulting in a low recognition rate. In contrast, the method of this invention, based on feature layer fusion, utilizes a weighted approach to focus on processing the key information in cases 1) and 2), effectively improving the recognition rate of speech expression emotion recognition.

[0161] Table 5. Results of the fusion experiment

[0162]

[0163]

[0164] Finally, experiments were conducted for each emotion category.

[0165] The experimental results of the method in this embodiment for each sentiment category on three datasets are presented separately. Figures 4-6 .

[0166] Depend on Figures 4-6 It is known that high-intensity emotion categories such as anger, happiness, and surprise are easier to identify.

[0167] The various embodiments of the present invention have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments.

Claims

1. A method for speech expression multi-modal emotion recognition based on weighted feature fusion, characterized in that, include: Extract speech and facial expression information from emotional videos, perform frame segmentation, extract speech keyframes and facial expression keyframes, and extract MFCC features of speech keyframes. The extracted speech keyframes and facial expression keyframes are aligned, and the speech weights and facial expression weights in the weighted feature fusion are calculated by leveraging the complementarity of speech and facial expression modalities. The calculation of speech weights and facial expression weights in weighted feature fusion by leveraging the complementarity of speech and facial expression modalities includes: The total number of aligned data frames where speech keyframes are valid and facial expression keyframes are invalid is counted, as well as the total number of aligned data frames where facial expression keyframes are valid and speech keyframes are invalid. Based on the total number of facial expression keyframes, the total number of speech keyframes, the total number of aligned data frames where facial expression keyframes are valid and speech keyframes are invalid, and the total number of aligned data frames where speech keyframes are valid and facial expression keyframes are invalid, the speech weight and facial expression weight in the weighted feature fusion are calculated. The formula for calculating the speech weights in weighted feature fusion is as follows: The formula for calculating the expression weights in weighted feature fusion is as follows: in, For speech weights in weighted feature fusion, For expression weights in weighted feature fusion, Represents the length of the vector. This represents a vector that stores the position parameters of facial expression keyframes. This represents the total number of frames for facial expression keyframes. This represents a vector that stores the location parameters of speech keyframes. This indicates the total number of speech keyframes. This represents the total number of frames where facial expression keyframes are valid and voice keyframes are invalid. The total number of aligned data frames where the speech keyframe is a valid keyframe and the facial expression keyframe is an invalid keyframe; The high-order emotional features of speech in the speech MFCC features are extracted using the trained residual network, and the high-order emotional features of facial expressions in the facial expression keyframes are extracted using the trained 3D-CNN neural network. The higher-order emotional features of speech and the higher-order emotional features of facial expressions are weighted using the speech weight and the facial expression weight, respectively, to obtain the weighted higher-order emotional features of speech and the weighted higher-order emotional features of facial expressions. The higher-order emotional features of speech and higher-order emotional features of facial expression are weighted using the speech weight and the facial expression weight, respectively, and calculated using the following formula: in, The weighted higher-order emotional features of speech. The weighted higher-order emotional features of facial expressions. For speech weights, For facial expression weighting, The high-order emotional features of the speech output by the residual network. The dimension of the high-order emotional features of facial expressions output by the 3D-CNN; The weighted speech high-order emotion features and the weighted facial expression high-order emotion features are input into a multilayer perceptron (MLP) for weighted feature fusion and emotion classification.

2. The emotion recognition method of claim 1, wherein, The extraction of speech keyframes includes: The speech VAD algorithm is used to extract keyframes from speech information.

3. The emotion recognition method of claim 1, wherein, The extraction of facial expression keyframes includes: Extract the facial region from the expression frames and use information entropy to extract expression keyframes.

4. The emotion recognition method of claim 1, wherein, The extracted speech MFCC features include: Compensation preprocessing is performed on the high-frequency components of the speech information; The preprocessed speech information is converted into the frequency domain using FFT transformation to obtain the spectrum of the speech information. The Mel spectrum of speech information is obtained by processing the spectrum of speech information using a Mel filter bank. Frequency analysis is performed on the Mel spectrum of the obtained speech information to obtain the Mel cepstral of the speech information; Based on the Mel-Cepstral spectra of the obtained speech information, the MFCC features of the speech information are calculated using Discrete Cosine Transform. Calculate the first-order and second-order differences of the MFCC features of speech information.

5. The emotion recognition method of claim 1, wherein, The residual network includes five residual blocks with the same structure connected in sequence and two fully connected layers. Each residual block includes two convolutional layers, three ReLU activation function layers and two normalization layers connected in sequence. The last fully connected layer has 2048 neurons.

6. The emotion recognition method according to claim 5, characterized in that, The 3D-CNN neural network includes seven convolutional layers, five pooling layers, and two fully connected layers connected in sequence. The last fully connected layer has 2048 neurons.

7. The emotion recognition method according to claim 1, characterized in that, The multilayer perceptron (MLP) includes an input layer, two hidden layers, and an output layer. The input layer contains 4096 neurons, the two hidden layers contain 2048 neurons and 1024 neurons, respectively, and the number of neurons in the output layer is equal to the number of emotion categories.