A dual-stream feature fusion facial expression recognition method and system based on enhanced ViT
By combining a local feature extraction module and an RKD-MSA module with a CB module for facial expression recognition, the accuracy problem of facial expression recognition models under occlusion and pose changes is solved, achieving high-precision and robust facial expression recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAZHONG NORMAL UNIV
- Filing Date
- 2024-05-22
- Publication Date
- 2026-06-30
AI Technical Summary
During the training process, some images used by the facial expression recognition model suffer from partial occlusion, changes in lighting, and head posture, resulting in low recognition accuracy.
We employ a local feature extraction module to capture subtle differences, use a self-attention mechanism with random key discarding (RKD-MSA) and a context propagation module (CB) to enhance the ViT layer, combine global and local feature fusion, and enhance the mutual influence between feature vectors through nonlinear transformation and dynamic weight allocation.
It improves the accuracy and robustness of facial expression recognition, especially its ability to capture detailed facial expression changes in complex environments, enhances the model's resistance to occlusion and noise interference, and maintains high accuracy and stability.
Smart Images

Figure CN118470776B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of image classification in computer vision, specifically relating to a dual-stream feature fusion facial expression recognition method and system based on enhanced ViT. Background Technology
[0002] Facial expression recognition technology refers to the use of computers to identify and judge the facial expressions of a person in order to determine their emotional state. This technology has wide applications in fields such as computer vision, human-computer interaction, and psychology. As computer technology has developed, facial expression recognition technology has gradually been introduced into the computer field.
[0003] With the rapid development of computer hardware and software technologies, modern facial expression recognition technology has made great progress, and the accurate learning of local facial features (such as subtle dynamic changes in eyebrows, eyes, and mouth) has been verified as a key strategy to enhance model performance in uncontrolled environments. Past research has typically relied on accurate facial landmarks. For example, in papers such as POSTER and POSTER++, researchers used the Mobileface network to detect facial landmarks as a supplement to local information. While the models achieved good results, facial landmark detection has a prominent problem: in real-world environments, it is highly susceptible to interference from factors such as strong lighting, significant pose changes, or severe occlusion, leading to missing landmarks or incorrect detection and localization. This weakens the accuracy of the relationships between facial landmarks, further affecting the accuracy and efficiency of expression recognition. Summary of the Invention
[0004] The main technical problem solved by this invention is that, during the training process of a facial expression recognition model, some images suffer from partial occlusion, changes in lighting, and changes in head posture, resulting in low recognition accuracy for some images after the model has completed training.
[0005] In view of these problems, it is crucial that this invention design a method capable of intelligently capturing key facial regions and suppressing irrelevant information when necessary. Such a method will bring new flexibility and robustness to the field of facial expression recognition, enabling it to maintain accurate recognition capabilities even in the face of challenges such as pose changes and occlusion. Specifically, it can be divided into three steps:
[0006] First, a novel local feature extraction module is proposed to capture subtle differences in facial expressions.
[0007] Secondly, considering that traditional self-attention mechanisms tend to over-rely on certain specific attention heads, previous studies have typically used random image patch discarding as a processing method. For example, TransFER proposed MAD, which sets the output of a selected attention head to zero with probability p1 to incentivize the model to utilize other attention heads to capture the diversity of the input data. However, in this invention, the features input to ViT are already low-dimensional features after convolutional dimensionality reduction, while the key vector plays a decisive role in the self-attention mechanism. Therefore, RKD-MSA is proposed, which uses the key as a discarding unit without structured discarding. This aims to break the model's fixed dependence on specific features, force the model to learn to assign higher weights to the remaining key vectors, and increase its robustness to occlusion and noise.
[0008] Finally, while the previous step allows the model to better focus on other unremoved vectors, it may result in information loss and perturbations in attention distribution. Therefore, the model introduces a CB module, which applies a carefully designed nonlinear transformation and dynamic weight allocation to each feature vector in the MLP output to enhance the interaction between feature vectors and the overall representational ability, thereby enabling the model to more accurately distinguish and identify different emotional states.
[0009] The facial expression recognition technology solution based on enhanced ViT dual-stream feature fusion provided by this invention is as follows:
[0010] Step 1: Obtain the training dataset of facial expression images and perform preprocessing;
[0011] Step 2: Construct an expression recognition neural network model, including a global feature extraction layer, a local feature extraction layer, a feature fusion layer, an enhanced ViT layer, and an output layer;
[0012] The global feature extraction layer is used to process the preprocessed facial expression images to obtain a global feature matrix;
[0013] The local feature extraction layer extracts local features from the preprocessed facial expression image using a local feature extraction algorithm to obtain a local feature matrix;
[0014] The feature fusion layer is used to align the two types of feature matrices in terms of dimensions. The feature dimension parameters are predefined. The two types of feature sub-blocks are aligned and concatenated in terms of feature dimensions through the convolutional layer. After positional encoding, the fused feature matrix is obtained.
[0015] The fused feature matrix is input into the enhanced ViT layer, which uses a self-attention mechanism that randomly discards keys instead of a multi-head attention mechanism to calculate the attention weight matrix in the original ViT. A context propagation module is introduced at the end of the MLP to better learn the long-distance relationships of the feature matrix. Finally, the final expression probability output is obtained through the classification head.
[0016] Step 3: Use the dataset from Step 1 to train the facial expression recognition neural network model from Step 2. During the training process, a loss function is used to calculate the loss value between the predicted facial expression label and the real label.
[0017] Step 4: Use the trained facial expression recognition neural network model to achieve facial expression recognition.
[0018] Furthermore, the preprocessing includes: uniformly scaling and standardizing the original expression images, and performing data augmentation on each of them. The data augmentation methods include random horizontal flipping, random vertical flipping, random addition of Gaussian noise, and random removal of image regions.
[0019] Furthermore, the global feature extraction layer uses an IR50 model trained on the Ms-Celeb-1M dataset.
[0020] Furthermore, in the local feature extraction layer, the input image first passes through a convolutional layer to initially extract low-level feature maps to obtain the global feature map F. conv1 Then, it enters the max pooling layer to obtain F. maxpool Then, it sequentially enters the quartic convolutional layer, the channel-spatial attention layer, and the inverted residual block layer to further enhance and extract local features.
[0021] Furthermore, the quad convolutional layer spatially segments the input feature map, generating four convolutional layers of size H. i ×W i ×C i Feature sub-block F i Where i∈{1,2,3,4}, H i W i C i These represent the height, width, and channel of the feature sub-image, respectively. Each feature sub-block corresponds to a major part of the face. Each sub-block will be refined through two concatenated 3×3 depthwise convolutional layers. Each depthwise convolutional layer is followed by a batch normalization process, the formula of which is:
[0022] F i =BN2(D2(BN1(D1(F) i ))))
[0023] Here, D1 and D2 represent the operations of the first and second depthwise convolutional layers, respectively, while BN1 and BN2 represent the operations on the first and second depthwise convolutional layers.
[0024] The corresponding batch normalization operation is then performed; at this point, the feature image dimension becomes H. i ′×W i ′×C local ,in
[0025] Hi ′、W i ′、C local Let H represent the height, width, and channels of the feature images, respectively. These feature images are then concatenated in their original spatial order to obtain a result of size H. qc ×W qc ×C qc Complete features F qc H qc W qc C qc Representing features F respectively qc Height, width, and passageway.
[0026] Furthermore, the channel-spatial attention layer includes two parallel branches: a channel attention module and a spatial attention module.
[0027] 1) The channel attention module first processes the input feature map F qc Perform global average pooling (GAP) to obtain global statistics for each channel.
[0028]
[0029] Then, channel weights are obtained by passing them through a multilayer perceptron network (MLP), which includes one or more hidden layers and finally an output layer.
[0030]
[0031] Among them, W k and b k Let z represent the weights and biases of the k-th layer, respectively; L represent the total number of MLPs; σ represent the sigmoid activation function, used to normalize the attention weights to the range (0,1); z k M is an intermediate variable, ReLU is the activation function, and M is the intermediate variable. c (F qc ) represents the channel attention heatmap M c (F qc );
[0032] 2) The spatial attention module aims to highlight the most important spatial regions in the input feature map. It evaluates the importance of each spatial location through convolutional operations to generate a spatial weight map. Specifically, the input feature map first passes through a 1×1 convolutional layer, followed by a series of 3×3 dilated convolutional layers with a dilation rate d. Each convolutional layer is followed by batch normalization and a ReLU activation function. Finally, a 1×1 convolutional layer is passed to generate the final spatial attention heatmap M. s (F qc );
[0033] 3) Finally, the channel attention heatmap M is fused together using the fusion module. c (F qc ) and spatial attention heatmap M s (F qc To merge:
[0034] M(F qc )=σ(M c (F qc )⊙M s (F qc ))⊙F qc
[0035] Where ⊙ represents element-wise multiplication, σ represents the sigmoid activation function, (F qc () is the attention heatmap after fusion.
[0036] Furthermore, the inverted residual block layer divides the input feature map into two branches, performs convolution operations only on one of the branches, and then merges the outputs of the two branches. Specifically, the input feature map is M(F qc For simplicity, it is represented by X here, and its dimensions are H×W×C. inp The input feature map is equally divided into two parts, X1 and X2, along the channel dimension, with each part having C channels. inp / 2;
[0037] X1,X2 = split(X,2)
[0038] Here, "split" represents the segmentation operation, and the second branch contains three key convolution operations:
[0039] 1) 1x1 convolution operation: Apply 1×1 convolution to reduce the number of channels in the feature map, and then perform batch normalization and ReLU activation function;
[0040] X′2=ReLU(BN(Conv 1×1 (X2)))
[0041] 2) 3×3 depthwise separable convolution operation: Spatial filtering is performed using 3×3 depthwise separable convolution, followed by batch normalization and ReLU activation;
[0042] X″2=ReLU(BN(DepthwiseConv 3×3 (X′2)))
[0043] 3) 1x1 convolution operation: Apply 1×1 convolution again to restore the number of channels, and complete feature extraction through batch normalization and ReLU activation function;
[0044] X″′2=ReLU(BN(Conv1×1 (X″2)))
[0045] The processed X2 and the unprocessed X1 are merged through a channel-dimension concatenation operation, forming a structure with dimensions H×W×C. out The intermediate feature map Y′;
[0046] Y′=concat(X1,X″′2)
[0047] Here, `concat` represents the concatenation operation; the final step is the channel shuffling operation:
[0048] Y = ChannelShuffle(Y′)
[0049] Here, ChannelShuffle represents the channel shuffling operation; the obtained Y is the final local feature matrix F. local .
[0050] Furthermore, the self-attention mechanism RKD-MSA, which randomly discards keys, generates a dropout mask after calculating the attention score.
[0051] attn=Attention+Bernoulli(m_r)×-1×10 -l2
[0052] First, initialize a tensor of the same size as the Attention function with all values of 0.15 and assign it to m_r. This means that each element in m_r is set to 0.15. Then, update the Attention function by first creating a random tensor using the Bernoulli distribution generated from m_r, and then multiplying the positions of elements with a value of 1 in this random tensor by -1 × 10. -12 The value is added to the Attention, which actually reduces the value of the Attention at certain locations;
[0053] Finally, the elements in the dropout mask attn that are related to the Key are set to 0, indicating that these attention weights are discarded.
[0054] Furthermore, the loss function used in step 3 includes the cross-entropy loss function L. CE and label smoothing loss function L LS The formula for the cross-entropy loss function is as follows: Where n represents the number of samples; the formula for the label smoothing loss function is: Where y is the true value. These are the predicted values, α is the smoothing parameter, and KLDivergence is the KL divergence.
[0055] This invention also provides a dual-stream feature fusion facial expression recognition system based on enhanced ViT, comprising the following modules:
[0056] The data acquisition module is used to acquire the training dataset of facial expression images and perform preprocessing.
[0057] The model building module is used to build a neural network model for facial expression recognition, including a global feature extraction layer, a local feature extraction layer, a feature fusion layer, an enhanced ViT layer, and an output layer.
[0058] The global feature extraction layer is used to process the preprocessed facial expression images to obtain a global feature matrix;
[0059] The local feature extraction layer extracts local features from the preprocessed facial expression image using a local feature extraction algorithm to obtain a local feature matrix;
[0060] The feature fusion layer is used to align the two types of feature matrices in terms of dimensions. The feature dimension parameters are predefined. The two types of feature sub-blocks are aligned and concatenated in terms of feature dimensions through the convolutional layer. After positional encoding, the fused feature matrix is obtained.
[0061] The fused feature matrix is input into the enhanced ViT layer, which uses a self-attention mechanism that randomly discards keys instead of a multi-head attention mechanism to calculate the attention weight matrix in the original ViT. A context propagation module is introduced at the end of the MLP to better learn the long-distance relationships of the feature matrix. Finally, the final expression probability output is obtained through the classification head.
[0062] The model training module is used to train the facial expression recognition neural network model in the model building module using the dataset in the data acquisition module. During the training process, a loss function is used to calculate the loss value between the predicted facial expression label and the real label.
[0063] The facial expression recognition module is used to achieve facial expression recognition using a trained facial expression recognition neural network model.
[0064] Compared with the prior art, the present invention has the following beneficial effects:
[0065] (1) Enhanced local feature extraction capability: This invention uses a local feature extraction module to obtain subtle differences in facial expressions, which improves the accuracy of emotion recognition, especially in capturing subtle changes in facial expressions.
[0066] (2) Optimization of self-attention mechanism: Traditional self-attention mechanisms tend to over-rely on specific features. The RKD-MSA proposed in this invention reduces the model's dependence on a single attention head by setting the output of the key vector to zero with probability p1, thereby enhancing the ability to capture the diversity of input data and improving the model's robustness to occlusion and noise.
[0067] (3) Enhanced interaction between feature vectors: The CB module introduced in this invention applies nonlinear transformation and dynamic weight allocation to the MLP output after the self-attention mechanism, which increases the dense interaction between the undiscarded information, improves the interaction between feature vectors and the overall representation ability, and reduces the information loss and attention distribution disturbance that may be caused by RKD-MSA.
[0068] (4) Improved overall performance: This invention combines the advantages of the above technologies, enabling emotion recognition to maintain high accuracy and stability in various environments, especially in the fine recognition of facial expressions in complex situations. Attached Figure Description
[0069] Figure 1 This is a model structure diagram of the facial expression recognition method according to an embodiment of the present invention. Detailed Implementation
[0070] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0071] This invention discloses a dual-stream feature fusion facial expression recognition method based on enhanced ViT. The method includes the following steps: screening a facial expression dataset from a natural environment, preprocessing the original images, and inputting the preprocessed images into a network model. The model first enters two feature extraction modules in parallel to extract global and local features respectively. The global feature extraction layer uses an IR-50 network. The local feature extraction layer first passes through a convolutional layer to initially extract low-level feature maps. Then, it enters a max-pooling layer to enhance salient features and reduce model parameters and computational load, while also reducing the risk of overfitting. Subsequently, the features sequentially enter a quadrature convolution layer (QuadConv), a channel-spatial attention layer, and an inverted residual block (IR block) layer to further enhance and extract local features. The extracted global and local features are then fused and fed into the enhanced ViT. In this enhanced ViT, a self-attention mechanism with random key dropout (RKD-MSA) is used to calculate attention weights, and a CB module is introduced at the end of the MLP. This invention employs a newly proposed local feature extraction module to perform deep convolution processing on key facial regions and uses a channel-space modulator to enhance attention, capturing more nuanced local information. Furthermore, an improved ViT method with random key dropout is proposed, enhancing the model's robustness in complex environments. Simultaneously, the CB module enhances the relationships between keys that are not dropped. Ultimately, this addresses the issues of low recognition rates due to differences in similarity between expression classes and differences within classes, as well as poor robustness to lighting conditions, partial occlusion, and head pose changes. Specifically:
[0072] See Figure 1 The present invention provides a dual-stream feature fusion facial expression recognition method based on enhanced ViT, comprising the following steps:
[0073] S1. Predefine feature dimension parameters, obtain training dataset, and perform image preprocessing; This invention uses the natural environment facial expression datasets RAF-DB and FER+, using numbers 0-6 to represent expressions of happiness, surprise, sadness, anger, disgust, fear, and neutrality, respectively.
[0074] Furthermore, the predefined feature dimension parameter refers to the feature dimension that two feature sub-blocks need to be aligned during the feature alignment process. This parameter will affect the subsequent position embedding, feature compression extraction, and the initialization parameters of the fully connected layer. In this invention, the predefined parameter is 256.
[0075] In image preprocessing, the probabilities of random horizontal flipping, random vertical flipping, random addition of Gaussian noise, and random region erasure were set to 0.25, 0.5, 0.25, and 0.5, respectively. The aspect ratios of the randomly erased regions were set to 0.1 and 0.02, respectively. The image size parameter set in image alignment during image preprocessing was 224×224, which scaled the input fused feature image and the original feature image to 224×224. The parameters for image normalization in image preprocessing were derived from ImageNet. The mean values for each RGB channel were set to 0.485, 0.456, and 0.406, respectively, and the variances for each channel were set to 0.229, 0.224, and 0.225, respectively. Image normalization scaled the pixel values of the image to the range (-1, 1), which accelerated model training and convergence. In data augmentation, random flipping, random noise, and random region erasure enhanced the model's robustness to pose changes and local occlusion.
[0076] Step S2: Construct an expression recognition neural network model, including a global feature extraction layer, a local feature extraction layer, a feature fusion layer, an enhanced ViT layer, and an output layer.
[0077] The global feature extraction layer is formed by inputting the preprocessed image into the feature extraction layer to obtain the global feature matrix;
[0078] The local feature extraction layer obtains the local feature matrix by preprocessing the local features of the facial expression image using the local feature extraction algorithm;
[0079] The feature fusion layer is used to align the two types of feature matrices output by the feature extraction layer. The feature dimension parameters are predefined. The two types of feature matrices are aligned and concatenated in the feature dimension through the convolutional layer. After positional encoding, the fused feature matrix is obtained.
[0080] The fused feature matrix is input into the enhanced ViT layer, which uses the RKD-MSA (Multi-head Self-Attention with Random Key-drop) module instead of the multi-head attention module to calculate the attention weight matrix in the original ViT. A context broadcasting (CB) module is introduced at the end of the MLP to better learn the long-distance relationship of the feature matrix. Finally, the final expression probability output is obtained through the classification head.
[0081] S21. The global feature extraction layer in this invention uses the IR50 model trained on the Ms-Celeb-1M dataset. During training, the training parameters of the IR50 are fine-tuned. The input is a preprocessed image of 224×224×3. Finally, the global feature matrix F is extracted and obtained. global Size H global ×W global ×C global H global =W global =512, C global =7. Freezing training parameters effectively reduces the number of parameters during model training. The pre-trained model has good feature extraction capabilities, and after feature extraction, the model can obtain high-level semantic information of the image.
[0082] S22. In this invention, the local feature extraction layer first inputs a preprocessed image of size 224×224×3 into a convolutional layer (Conv) to extract low-level feature maps. This step generates a preliminary global feature map F. conv1 Its size becomes H conv1 ×W conv1 ×C conv1 H conv1 =W conv1 =112, C conv1 =16. Next, the feature map passes through a max-pooling layer to enhance salient features and reduce model parameters and computational load, while also mitigating the risk of overfitting. The max-pooling operation yields F... maxpool Its H maxpool =W maxpool =56. Afterwards, the features are sequentially fed into the QuadConv convolutional layer, the Channel-spatial Modulator layer, and the Inverted Residual Block layer (IR Block) to further enhance and extract local features. Each module will be described in detail below.
[0083] (1) The QuadConv performs spatial segmentation on the input feature map, generating four quadrants of size H. i ×W i ×C i Feature sub-block F i Where i∈{1,2,3,4}, and H i =W i =28. Each feature sub-block corresponds to a major facial feature, such as the eyes, nose, and mouth. Each sub-block is refined through two concatenated 3×3 depthwise convolutional layers, followed by a batch normalization process, the formula of which is:
[0084] F i =BN2(D2(BN1(D1(F) i ))))
[0085] Here, D1 and D2 represent the operations of the first and second depthwise convolutional layers, respectively, while BN1 and BN2 represent the corresponding operations of the convolutional layers.
[0086] Batch normalization operation. At this point, the feature image dimension becomes H. i ′×W i ′×C local H i ′=W i ′=14,C local =64. These feature images are stitched together in their original spatial order to obtain a result of size H. qc ×W qc ×C qc Complete features F qc , where H qc =W qc =28.
[0087] (2) The Channel-spatial Modulator includes two parallel branches, namely the channel attention module and the spatial attention module.
[0088] 1) The channel attention module first processes the input feature map F qc Perform global average pooling to obtain global statistics for each channel. The channel weights are then obtained through an MLP network, which includes one or more hidden layers, and finally through an output layer.
[0089] First, the input feature map F is processed. qc Perform global average pooling to obtain global statistics for each channel.
[0090]
[0091] The channel weights are then obtained by passing the weights through an MLP network, which includes one or more hidden layers and finally an output layer.
[0092]
[0093] Among them, W i and b i Let represent the weights and biases of the i-th layer, respectively; L represent the total number of MLP layers; and σ represent the sigmoid activation function, used to normalize the attention weights to the range (0,1).
[0094] 2) The spatial attention module aims to highlight the most important spatial regions in the input feature map. This is primarily achieved through convolutional operations to evaluate the importance of each spatial location, generating a spatial weight map. Specifically, the input feature map F... qc First, a 1×1 convolutional layer is applied, followed by a series of 3×3 dilated convolutional layers with a dilation rate *d*. Each convolutional layer is then followed by batch normalization and a ReLU activation function. Finally, a 1×1 convolutional layer is applied to generate the final spatial attention heatmap M. s (F qc ).
[0095] 3) Finally, the channel attention heatmap M is fused together using the fusion module. c (F qc ) and spatial attention heatmap M s (F qc To merge:
[0096] M(F qc )=σ(M c (F qc )⊙M s (F qc ))⊙F qc
[0097] Where ⊙ represents element-wise multiplication, σ represents the sigmoid activation function, and M(F qc () is the attention heatmap after fusion.
[0098] (3) The IR Block divides the input feature map into two branches, performs convolution operations on only one branch, and then merges the outputs of the two branches. This design can enhance feature representation without significantly increasing the computational burden. Specifically, the input feature map is M(F qc For simplicity, it is represented by X here, and its dimensions are H×W×C. inp The input feature map is equally divided into two parts, X1 and X2, along the channel dimension, with each part having C channels. inp / 2. This partitioning allows modules to process two smaller feature sets in parallel, thus reducing computational cost.
[0099] X1,X2 = split(X,2)
[0100] Here, "split" represents the segmentation operation, and the second branch contains three key convolution operations:
[0101] 1) Pointwise Convolution (1x1 Convolution): First, a 1×1 convolution is applied to reduce the number of channels in the feature map, and then batch normalization and ReLU activation function are performed.
[0102] X′2=ReLU(BN(Conv 1×1 (X2)))
[0103] 2) Depthwise Convolution (3×3 Convolution): Next, spatial filtering is performed using 3×3 depthwise separable convolution, followed by batch normalization and ReLU activation.
[0104] X″2=ReLU(BN(DepthwiseConv 3×3 (X′2)))
[0105] 3) Pointwise Convolution: Finally, a 1×1 convolution is applied again to restore the number of channels, and feature extraction is completed by batch normalization and ReLU activation function.
[0106] X″′2=ReLU(BN(Conv 1×1 (X″2)))
[0107] These three steps constitute a powerful feature extractor that significantly reduces the number of parameters and computational cost while maintaining sufficient model complexity to capture important features.
[0108] The processed X2 and the unprocessed X1 are merged through a channel-dimension concatenation operation, forming a structure with dimensions H×W×C. out The intermediate feature map Y′. This step ensures that the module output contains both the original feature information and the processed feature information.
[0109] Y′=concat(X1,X″′2)
[0110] Here, `concat` represents the concatenation operation; the final step is a channel shuffling operation, implemented through a special permutation function, designed to increase the cross-information between feature map channels. This effectively increases the diversity of feature combinations between network layers, thereby improving the overall performance of the model.
[0111] Y = ChannelShuffle(Y′)
[0112] Here, ChannelShuffle represents the channel shuffling operation; the obtained Y is the final local feature matrix F. local .
[0113] S23, F global and F local Dimension alignment is performed by predefining feature dimension parameters. The two types of feature matrices are aligned and concatenated along the feature dimension through convolutional layers. After positional encoding, the fused feature matrix is obtained.
[0114] S24. Input the fused feature matrix into the enhanced ViT layer, such as... Figure 1 On the right side of the image, the enhanced ViT layer uses the RKD-MSA (Multi-head Self-Attention with Random Key-drop) module instead of the multi-head attention module to calculate the attention weight matrix in the original ViT. A context broadcasting (CB) module is introduced at the end of the MLP to better learn the long-distance relationships of the feature matrix. Finally, the final expression probability output is obtained through a classification head. Specifically:
[0115] The fused feature matrix is then normalized using layer normalization (represented by "norm" in the diagram), and the attention weight matrix is calculated using RKD-MSA. Specifically, the input sequence is first linearly projected into three different spaces to generate query (Q), key (K), and value (V) matrices. These projections are achieved by multiplying by the corresponding weight matrix: Q = ZW Q K = ZW K V = ZW V ;
[0116] Where Z is the input embedding, W Q W K W VThese are the weight matrices for the query, key, and value, respectively. Q, K, and V are further divided into multiple "heads," each calculating an attention score. This score determines the importance of each input in generating the output. Here d k It represents the dimension of the key vector, used to scale the dot product and prevent the gradient vanishing problem.
[0117] The outputs computed by each head are concatenated to form a unified output matrix. This concatenated output then passes through a linear layer to integrate the information from the different heads, generating the final output result while maintaining the same input and output dimensions. Unlike traditional MSA, RKD-MSA requires generating a dropout mask after calculating the Attention layer.
[0118] m_r=torch.ones_like(Attention)×0.15
[0119] attn=Attention+Bernoulli(m_r)×-1×10 -12
[0120] The first formula represents initializing a tensor of the same size as the Attention function with all values of 0.15 and assigning it to m_r. This means that each element in m_r is set to 0.15.
[0121] The second formula describes the Attention update operation. First, a random tensor is created using the Bernoulli distribution generated from m_r. Then, the positions of elements with a value of 1 in this random tensor (randomly determined according to the Bernoulli distribution) are multiplied by -1 × 10. -12 The value is added to the Attention. Doing so actually reduces the Attention value at certain locations.
[0122] Then, the attention weights associated with the Key are discarded: specifically, the elements in the mask attn that are associated with the Key are set to 0, indicating that these attention weights are discarded.
[0123] Next, a renormalization (norm) is performed: To keep the sum of attention weights equal to 1, the discarded attention weights need to be renormalized. Specifically, the discarded attention weights are divided by the sum of the remaining attention weights to ensure that the normalized attention weights still meet the probability distribution requirements.
[0124] Next, we will enter the MLP, which consists of two linear transformation layers. The first layer is: Z′=ReLU(ZW1+b1), where W1 is a D×D matrix. ff The matrix, D ff This refers to the internal layer dimensions, where b1 is the offset. The second layer: Z″ = Z′W2 + b2, where W2 is the internal layer dimension. ff The matrix is ×D, where b2 is the bias. The CB module will be introduced after the MLP layer, with the following steps:
[0125] 1) Global context calculation of sequences
[0126] The CB module first focuses on computing the global context of the sequence, which is a set of characteristics representing all labels in the sequence. This global context is calculated using the feature vector x of each label in the sequence. i The weighted average is used to obtain the result, as shown in the following formula:
[0127]
[0128] Where N is the total number of tags in the sequence, x i Let be the feature vector of the i-th label, and CB(X) be a vector of dimension d representing the average feature of the entire sequence. This step is done by simple arithmetic averaging, but it plays a crucial role in capturing global information.
[0129] 2) Local broadcasting of context information
[0130] After obtaining the global context representing the entire sequence, the CB module associates this context vector with each individual label x in the sequence. i This process, called context broadcasting, ensures that each tag retains its original local characteristics while incorporating global information. Each broadcasted tag CB(x) i The calculation formula for ) is as follows:
[0131] CB(x i )=x i +CB(X)
[0132] Here we can see the original marker x i By adding it to the average vector CB(X) of the entire sequence, a new representation vector CB(x) is obtained. i This operation not only preserves local information but also incorporates global context, thereby enhancing the model's understanding of each token in the sequence.
[0133] It's also worth noting that the CB module is placed after the MLP layer. This is because, through uniform attention, the weights in the preceding MSA and MLP blocks can be updated using gradient signals. If it were placed before the MLP, subsequent weights in the corresponding MLP block would not receive gradient signals during training.
[0134] The final output is fed into the classification head to obtain the probability distribution of each expression.
[0135] Step S3: Use the dataset from Step S1 to train the facial expression recognition neural network model constructed in Step S2. During the training process, the loss function calculates the loss value between the predicted facial expression label and the real label.
[0136] The loss function used in this embodiment includes the cross-entropy loss function L. CE and label smoothing loss function L LS The formula for the cross-entropy loss function is as follows: Where n represents the number of samples; the formula for the label smoothing loss function is: Where y is the true value. These are the predicted values, α is the smoothing parameter, and KLDivergence is the KL divergence.
[0137] Step S4: Use the trained facial expression recognition neural network model to achieve facial expression recognition.
[0138] Furthermore, the accuracy of the method of this invention obtained by the model proposed in this patent on the RAF-DB and FERPlus datasets was compared with other facial expression recognition algorithms SCN, PSR, RAN, KTN, VTFF, TransFER, Meta-Face2Exp, EAC, and POSTER. The results are shown in Table 1.
[0139] Table 1. Comparison of Experimental Results of Facial Expression Recognition Algorithms
[0140]
[0141]
[0142] As shown in Table 1, the facial expression recognition method proposed in this invention has a higher accuracy than common facial expression recognition algorithms such as SCN, RAN, KTN, VTFF, TransFER, GFFT, EAC, and PF-ViT.
[0143] In summary, the local feature extraction module proposed in this invention can capture subtle differences in facial expressions, enhancing the model's fine-grained recognition capabilities; the RKD-MSA module provides the model with stronger anti-interference capabilities and a better grasp of key information; and the introduction of the CB module enhances the connections between non-discarded key vectors.
[0144] In another embodiment, the present invention also provides a dual-stream feature fusion facial expression recognition system based on enhanced ViT, comprising the following modules:
[0145] The data acquisition module is used to acquire the training dataset of facial expression images and perform preprocessing.
[0146] The model building module is used to build a neural network model for facial expression recognition, including a global feature extraction layer, a local feature extraction layer, a feature fusion layer, an enhanced ViT layer, and an output layer.
[0147] The global feature extraction layer is used to process the preprocessed facial expression images to obtain a global feature matrix;
[0148] The local feature extraction layer extracts local features from the preprocessed facial expression image using a local feature extraction algorithm to obtain a local feature matrix;
[0149] The feature fusion layer is used to align the two types of feature matrices in terms of dimensions. The feature dimension parameters are predefined. The two types of feature sub-blocks are aligned and concatenated in terms of feature dimensions through the convolutional layer. After positional encoding, the fused feature matrix is obtained.
[0150] The fused feature matrix is input into the enhanced ViT layer, which uses a self-attention mechanism that randomly discards keys instead of a multi-head attention mechanism to calculate the attention weight matrix in the original ViT. A context propagation module is introduced at the end of the MLP to better learn the long-distance relationships of the feature matrix. Finally, the final expression probability output is obtained through the classification head.
[0151] The model training module is used to train the facial expression recognition neural network model in the model building module using the dataset in the data acquisition module. During the training process, a loss function is used to calculate the loss value between the predicted facial expression label and the real label.
[0152] The facial expression recognition module is used to achieve facial expression recognition using a trained facial expression recognition neural network model.
[0153] The specific implementation methods of each module are the same as those of each step, and will not be described in this invention.
[0154] The above-described embodiments illustrate several implementation methods of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention. Therefore, the protection scope of this invention patent should be determined by the appended claims.
Claims
1. A dual-stream feature fusion expression recognition method based on enhanced ViT, characterized in that, Includes the following steps: Step 1: Obtain the training dataset of facial expression images and perform preprocessing; Step 2: Construct an expression recognition neural network model, including a global feature extraction layer, a local feature extraction layer, a feature fusion layer, an enhanced ViT layer, and an output layer; The global feature extraction layer is used to process the preprocessed facial expression images to obtain a global feature matrix; The local feature extraction layer extracts local features from the preprocessed facial expression image using a local feature extraction algorithm to obtain a local feature matrix; The feature fusion layer is used to align the two types of feature matrices in terms of dimensions. The feature dimension parameters are predefined. The two types of feature sub-blocks are aligned and concatenated in terms of feature dimensions through the convolutional layer. After positional encoding, the fused feature matrix is obtained. The fused feature matrix is input into the enhanced ViT layer, which uses a self-attention mechanism that randomly discards keys instead of a multi-head attention mechanism to calculate the attention weight matrix in the original ViT. A context propagation module is introduced at the end of the MLP to better learn the long-distance relationships of the feature matrix. Finally, the final expression probability is output through the classification head. Step 3: Use the dataset from Step 1 to train the facial expression recognition neural network model from Step 2. During the training process, a loss function is used to calculate the loss value between the predicted facial expression label and the real label. Step 4: Use the trained facial expression recognition neural network model to achieve facial expression recognition.
2. The enhanced ViT-based dual-flow feature fusion expression recognition method of claim 1, wherein: The preprocessing includes: uniformly scaling and standardizing the original expression images, and performing data augmentation on each of them. The data augmentation methods include random horizontal flipping, random vertical flipping, random addition of Gaussian noise, and random removal of image regions.
3. The dual-stream feature fusion facial expression recognition method based on enhanced ViT as described in claim 1, characterized in that: The global feature extraction layer uses an IR50 model trained on the Ms-Celeb-1M dataset.
4. The dual-stream feature fusion facial expression recognition method based on enhanced ViT as described in claim 1, characterized in that: In the local feature extraction layer, the input image first passes through a convolutional layer to initially extract low-level feature maps and obtain global feature maps. Then it enters the max pooling layer to obtain Then, it sequentially enters the quartic convolutional layer, the channel-spatial attention layer, and the inverted residual block layer to further enhance and extract local features.
5. The dual-stream feature fusion facial expression recognition method based on enhanced ViT as described in claim 4, characterized in that: The quad convolutional layer spatially segments the input feature map, generating four convolutional layers of size 1.
5. Featured sub-blocks ,in , These represent the height, width, and channel of the feature sub-image, respectively. Each feature sub-block corresponds to a major part of the face. Each sub-block will be refined through two concatenated 3×3 depthwise convolutional layers. Each depthwise convolutional layer is followed by a batch normalization process, the formula of which is: here and These represent the operations of the first and second depthwise convolutional layers, respectively. and This indicates the corresponding batch normalization operation; at this point, the feature image dimension becomes... ,in Let these represent the height, width, and channels of the feature images, respectively. These feature images are then concatenated in their original spatial order to obtain a result of size [size missing]. Complete features ,in Representing features respectively Height, width, and passageway.
6. The dual-stream feature fusion facial expression recognition method based on enhanced ViT as described in claim 5, characterized in that: The channel-space attention layer includes two parallel branches: a channel attention module and a spatial attention module. The channel attention module first processes the input feature map Perform global average pooling To obtain the global statistics for each channel. ; Then, channel weights are obtained by passing them through a multilayer perceptron network (MLP), which includes one or more hidden layers and finally an output layer. in, and They represent the first Layer weights and biases This represents the total number of MLPs. This represents the sigmoid activation function, used to normalize the attention weights to... between, As an intermediate variable, For activation function, Channel attention heatmap ; The spatial attention module aims to highlight the most important spatial regions in the input feature map. It evaluates the importance of each spatial location through convolutional operations, generating a spatial weight map. Specifically, the input feature map first passes through a 1×1 convolutional layer, followed by a series of layers with dilation rates. The system uses 3×3 dilated convolutional layers, each followed by batch normalization and ReLU activation functions; finally, a 1×1 convolutional layer is used to generate the final spatial attention heatmap. ; Finally, the channel attention heatmap is processed by the fusion module. Spatial attention heatmap To merge: in This represents element-wise multiplication. This represents the sigmoid activation function. This is the attention heatmap after fusion.
7. The dual-stream feature fusion facial expression recognition method based on enhanced ViT as described in claim 6, characterized in that: The inverted residual block layer divides the input feature map into two branches, performs a convolution operation on only one branch, and then merges the outputs of the two branches. Specifically, the input feature map is... To simplify the writing, here we use X It indicates that its size is The input feature map is equally divided into two parts along the channel dimension. and The number of channels in each part is ; in, The second branch represents the segmentation operation and contains three key convolution operations: 1) 1x1 convolution operation: Apply 1×1 convolution to reduce the number of channels in the feature map, and then perform batch normalization and ReLU activation function; 2) 3×3 depthwise separable convolution operation: Spatial filtering is performed using 3×3 depthwise separable convolution, followed by batch normalization and ReLU activation; 3) 1x1 convolution operation: Apply 1×1 convolution again to restore the number of channels, and complete feature extraction through batch normalization and ReLU activation function; Processed Compared with unprocessed By merging the channels, a dimension of [dimensional value] is formed. intermediate feature map ; in, This indicates a splicing operation; the final step is a channel shuffling operation. in, This represents the channel shuffling operation; the resulting Y is the final local feature matrix. .
8. The dual-stream feature fusion facial expression recognition method based on enhanced ViT as described in claim 1, characterized in that: The self-attention mechanism RKD-MSA, which randomly discards keys, is used in calculating attention scores. Next, generate the dropout mask; First, put a with Tensors of the same size are initialized to all values of 0.15 and assigned to... This means Each element in the array is set to 0.15, and then... To perform an update operation, first use from The generated Bernoulli distribution Create a random tensor, then multiply the positions in this random tensor where the element value is 1 by... The value added to In this way, doing so actually reduces the size in certain locations. The value; Finally, apply the dropout mask. Setting the elements at positions related to the Key to 0 indicates that these attention weights are discarded.
9. The dual-stream feature fusion facial expression recognition method based on enhanced ViT as described in claim 1, characterized in that: The loss function used in step 3 includes the cross-entropy loss function. and label smoothing loss function The formula for the cross-entropy loss function is as follows: , where n represents the number of samples; the formula for the label smoothing loss function is: ,in y It is the actual value. It is a predicted value. For smoothing parameters, Let KL divergence be denoted as KL divergence.
10. A dual-stream feature fusion facial expression recognition system based on enhanced ViT, characterized in that, Includes the following modules: The data acquisition module is used to acquire the training dataset of facial expression images and perform preprocessing. The model building module is used to build a neural network model for facial expression recognition, including a global feature extraction layer, a local feature extraction layer, a feature fusion layer, an enhanced ViT layer, and an output layer. The global feature extraction layer is used to process the preprocessed facial expression images to obtain a global feature matrix; The local feature extraction layer extracts local features from the preprocessed facial expression image using a local feature extraction algorithm to obtain a local feature matrix; The feature fusion layer is used to align the two types of feature matrices in terms of dimensions. The feature dimension parameters are predefined. The two types of feature sub-blocks are aligned and concatenated in terms of feature dimensions through the convolutional layer. After positional encoding, the fused feature matrix is obtained. The fused feature matrix is input into the enhanced ViT layer, which uses a self-attention mechanism that randomly discards keys instead of a multi-head attention mechanism to calculate the attention weight matrix in the original ViT. A context propagation module is introduced at the end of the MLP to better learn the long-distance relationships of the feature matrix. Finally, the final expression probability is output through the classification head. The model training module is used to train the facial expression recognition neural network model in the model building module using the dataset in the data acquisition module. During the training process, a loss function is used to calculate the loss value between the predicted facial expression label and the real label. The facial expression recognition module is used to achieve facial expression recognition using a trained facial expression recognition neural network model.