A method based on Inception-CBAM + Micro-expression recognition method and system

By using the Inception-CBAM+ deep learning network, combined with optical flow and parallel attention mechanisms, the problem of uneven action information in micro-expression recognition is solved, improving the recognition effect, especially in terms of unweighted F1 score and average recall.

CN116052245BActive Publication Date: 2026-06-30JIANGSU UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JIANGSU UNIV OF SCI & TECH
Filing Date
2022-12-07
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing micro-expression recognition methods suffer from problems such as cumbersome traditional manual feature design, inconsistent distribution of facial shapes and micro-expression movements among individuals, and uneven motion information in facial images, which leads to insufficient attention from deep learning networks and reduced recognition performance.

Method used

We employ the Inception-CBAM+ deep learning network, combining the Inception feature extraction module, the CBAM+ attention mechanism module, and the fully connected classification module. We extract action information using optical flow and use leave-one-out cross-validation for training and testing. We also design parallel channel and spatial attention mechanisms to reduce feature loss.

Benefits of technology

It improves the unweighted F1 score and unweighted average recall of micro-expression recognition, enhances the focus on key regions, and improves the recognition effect, outperforming traditional methods and other deep learning methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116052245B_ABST
    Figure CN116052245B_ABST
Patent Text Reader

Abstract

This invention discloses a method based on Inception-CBAM + The micro-expression recognition algorithm and system include: preprocessing data, detecting and cropping faces using the dlib algorithm, extracting optical flow features of each micro-expression segment using the TV-L1 optical flow algorithm and standardizing the standard deviation; and designing the Inception-CBAM. + Deep learning networks, including Inception feature extraction and CBAM + This invention introduces an attention mechanism and a fully connected classification module; it utilizes leave-one-out cross-validation for data training and testing. The Inception module and CBAM attention mechanism are introduced into micro-expression recognition, enabling the extraction of multi-scale features while better extracting features more favorable to the target task; furthermore, the serial CBAM is adjusted to a parallel CBAM. + The improved structure reduces feature loss caused by serial structures, effectively improving the accuracy of micro-expression recognition results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer vision and micro-expression recognition technology, and relates to a method based on Inception-CBAM. + Micro-expression recognition methods and systems. Background Technology

[0002] Facial expressions are the most direct way to reveal personal feelings, intentions, and emotional states, and can be divided into macro-expressions and micro-expressions. In terms of time, macro-expressions generally last 0.75s-2s, while micro-expressions generally last 0.04s-0.2s. Spatially, the range of motion in micro-expressions is much smaller than that in macro-expressions. Furthermore, because micro-expressions are spontaneous, meaning they are not under the body's control, they often reflect true emotions more accurately than macro-expressions, and are therefore widely used in lie detection, depression treatment, political psychology, and homeland security. However, due to their small range of motion and short duration, micro-expression recognition faces significant challenges.

[0003] Common micro-expression recognition methods can be divided into traditional methods and deep learning methods. Traditional methods generally extract features manually, the most common being the Triorthogonal Plane Local Binary Pattern (LBP-TOP). This method combines the temporal and spatial features of three orthogonal planes in an image sequence to represent the features of the entire micro-expression video. Benefiting from its low computational complexity, many derivative methods have been proposed, such as the Six-Intersection Local Binary Pattern (LBP-SIP) and the Spatiotemporal Local Binary Pattern (STLBP). Although traditional methods can continuously improve the accuracy of micro-expression recognition, they still lack efficient feature extraction capabilities. In contrast, Convolutional Neural Networks (CNNs) can automatically capture subtle changes in micro-expressions and achieve higher accuracy than traditional methods. Therefore, many deep learning-based micro-expression recognition methods have emerged in recent years. In research, networks that obtain motion features from start frames and peak frames can effectively recognize micro-expressions. For example, the Dual-Inception network uses TV-L1 optical flow as its input and performs micro-expression recognition without any data augmentation. The CapsuleNet network utilizes data augmentation methods to design a capsule network based on peak frames. The STSTNet network learns features from three optical flow features.

[0004] Current deep learning-based micro-expression recognition methods tend to employ multi-branch network designs, neglecting the inherent information unevenness within the image itself—that is, micro-expressions are concentrated around the eyebrows, mouth, and nose. Using ordinary convolutional operations causes the deep learning network to focus on every part of the face, reducing attention to key regions and thus lowering recognition accuracy. Summary of the Invention

[0005] The purpose of this invention is to overcome the shortcomings of the prior art and provide a micro-expression recognition method and system based on Inception-CBAM+, thereby solving the problems of the prior art: the traditional manual feature design for micro-expression recognition is cumbersome, the distribution of facial shapes and micro-expression movements is inconsistent among individuals, and the movement information in facial images is uneven.

[0006] To solve the above-mentioned technical problems, the present invention adopts the following technical solution.

[0007] An Inception-CBAM based invention + The micro-expression recognition method includes the following steps:

[0008] Step 1: Collect micro-expression image datasets and preprocess the images, including cropping faces and extracting facial motion information using optical flow.

[0009] Step 2: Design Inception-CBAM + Deep learning network, including Inception feature extraction module, CBAM + Attention mechanism module and fully connected classification module;

[0010] Step 3: Use leave-one-out cross-validation to split the training set and the test set, train on the training set, and test on the test set to obtain the final recognition result;

[0011] Step 4: Analyze the recognition results.

[0012] Specifically, step one includes the following process:

[0013] 1-1. Review the literature on micro-expression recognition in recent years and select the three most commonly used datasets: CASMEII, SAMM, and SMIC;

[0014] 1-2. Use the Dlib method to detect 68 landmarks of the face, determine the four vertices of the rectangle formed by the 68 landmarks, and use the rectangle formed by the four vertices to crop the face out of the original image;

[0015] 1-3. Extract motion information from each micro-expression image fragment using the TV-L1 optical flow method: Let the face images within the micro-expression fragment be f = {f1, f2, f3, ... f...} n}, take the start frame image and the peak frame image {f onset f apex The horizontal optical flow field u and vertical optical flow field v between the start frame and the peak frame were calculated using the TV-L1 optical flow method.

[0016] u, v = tvl1(f onset fapex (1)

[0017] To enrich the number of information channels for network input, the optical strain ε is calculated using u and v, where, It is the first-order partial derivative of the optical flow component:

[0018]

[0019] The horizontal optical flow field u, the vertical optical flow field v, and the optical strain ε are used to construct a three-dimensional matrix x = (u, v, ε), which is then reduced to 28×28. Finally, the reduced three-dimensional matrix is ​​standardized by standard deviation to generate Inception-CBAM. + Network input:

[0020]

[0021] Specifically, the Inception-CBAM+ deep learning network mentioned in step two includes:

[0022] 2-1. Inception Feature Extraction Module: This module uses a combination of Inception and maxpooling layers to extract multi-scale features and reduce dimensionality of the optical flow map. Inception uses parallel 1×1 convolutions, 3×3 convolutions, 5×5 convolutions, and 3×3 maxpooling layers to calculate four components, which are then concatenated according to their dimensions, as shown in the following formula:

[0023] Inception(x) = contact(f 1×1 (x), f 3×3 (x), f 5×5 (x), f 1×1 (maxpool(x))) (4)

[0024] Where contact refers to concatenating matrices along their dimensions, f 1×1 f 3×3 f 5×5 These represent 1×1 convolution, 3×3 convolution, and 5×5 convolution, respectively.

[0025] The Inception feature extraction is calculated by stacking two Inception layers and a maxpooling layer.

[0026] The output feature of the module is given by the following formula:

[0027] feature(x)=maxpool(Inception(maxpool(Inception(x)))) (5)

[0028] 2-2.CBAM + Attention mechanism module: Contains a pair of parallel channel attention mechanisms and spatial attention mechanisms; assuming there is an input feature map F∈R C×H×W The channel attention mechanism first uses a max-pooling layer (maxpool1) and an average-pooling layer (avgpool) to extract two different spatial contextual information from the feature map F. The information extracted by the max-pooling layer is denoted as... The information extracted by the average pooling layer (avgpool) is denoted as follows: Then and The data is fed into a weighted multilayer perceptron (MLP); this MLP contains two hidden layers: the first layer downsamples neurons at a sampling ratio of r, with the number of neurons being C / r; the second layer restores the number of neurons to C; finally, CBAM... + The attention mechanism module adds the two together, and the result is passed through the sigmoid activation function to output the attention weight for each channel; the attention weight function M for the entire channel... c (·) can be represented as:

[0029] M c (F)=σ(MLP(avgpool(F))+MLP(maxpool(F))) (6)

[0030] The spatial attention mechanism first performs a max pooling operation on the input feature map F along the channel dimension to obtain... The input feature map F is obtained by performing average pooling. Then The features are concatenated along the dimensional direction and convolutional operations are used to generate the final spatial attention feature map; the entire spatial attention weight function M s (·) can be represented as:

[0031]

[0032] Finally, the input feature map F and the output M of the channel attention mechanism are combined. c (F) Output M of the spatial attention mechanism s (F) summed together, CBAM + The final output M(·) of the attention mechanism module can be expressed as:

[0033] M(F) = F + F·M c (F)+F·M s (F) (8)

[0034] 2-3. Fully Connected Classification Module: This module integrates CBAM... + The output M(F) of the attention mechanism module is flattened into a vector and fed into a fully connected layer to generate a 3D vector. Since the dataset is small, directly feeding all neurons would cause severe overfitting; therefore, half of the neurons are randomly discarded. Finally, the probability p of each class is calculated using the softmax function. i .

[0035] Specifically, step three describes the process of using leave-one-out cross-validation to split the training and test sets, training on the training set, and testing on the test set to obtain the final recognition result:

[0036] Using leave-one-out cross-validation, let the set of participants be S = {S1, S2, S3...S}. n Each subject has several micro-expression samples. In each cross-validation, the micro-expression samples of one subject are selected as the test set, and the remaining samples are used as the training set. The model is trained n times to obtain the final recognition results, including the unweighted average recall rate and the unweighted F1 score.

[0037] Specifically, the process of analyzing the recognition results described in step four includes:

[0038] Organize the identification results, draw the confusion matrix, calculate the unweighted average recall and unweighted F1 score, compare and analyze them with traditional methods and popular deep learning methods, and formulate the next step plan based on the identification results;

[0039] The analysis was conducted on a mixed dataset consisting of publicly available microexpression databases CASME II, SAMM, and SMIC. The video samples from CASME II and SAMM were taken at 200fps, while those from SMIC were taken at 100fps. Samples labeled "other" were removed, and the remaining samples were divided into three categories: happiness (positive), sadness (negative), fear (negative), contempt (negative), and surprise (surprise). The resulting mixed dataset contained 68 participants and 442 microexpression samples.

[0040] Unweighted average recall (UAR) and unweighted F1 score (UF1) were used as evaluation metrics. To calculate the unweighted average recall, the number of true positive samples in class c and the total number of samples (TP) were required. c N c The calculation formula is:

[0041]

[0042] To calculate the unweighted F1 score, we need to obtain the number of true positive, false positive, and false negative samples (TP) in class c. c FP c 、FN c The calculation formula is as follows:

[0043]

[0044] Furthermore, the analysis described in step four is implemented using Python on a Linux platform, employing the Ubuntu 18.04 operating system, an NVIDIA GeForce GTX 1080 graphics card, and the PyTorch deep learning framework. The random seed is set to 100, and all analyses are performed under this random seed. The network input is a 3×28×28 three-channel optical flow map. The cross-entropy loss function is used, the initial learning rate is set to 0.001, and the cosine annealing learning rate decay method is used, with a minimum learning rate of 0.0001. The Adam optimizer is used, the maximum number of training epochs is 300, and the batch size is directly used from the total number of samples in the MEGC2019 dataset.

[0045] An Inception-CBAM based invention + The micro-expression recognition system includes a data preprocessing module and Inception-CBAM. + Deep learning network units;

[0046] The data preprocessing module preprocesses the image, including cropping faces and extracting facial motion information using optical flow.

[0047] Inception-CBAM + Deep learning network units include:

[0048] The Inception feature extraction module uses a combination of Inception and maxpooling layers to extract and reduce the dimensionality of optical flow maps at multiple scales. Inception uses parallel 1×1, 3×3, and 5×5 convolutions and 3×3 max pooling for computation, and finally concatenates the four components according to their dimensions, as shown in the following formula:

[0049] Inception(x) = contact(f 1×1 (x), f 3×3 (x), f 5×5 (x), f 1×1 (maxpool(x))) (4)

[0050] Where contact refers to concatenating matrices along their dimensions, f 1×1 f3×3 f 5×5 These represent convolutions of 1×1, 3×3, and 5×5, respectively.

[0051] The final feature is calculated by stacking two layers of Inception and maxpool, and the formula is as follows:

[0052] feature(x)=maxpool(Inception(maxpool(Inception(x)))) (5)

[0053] CBAM + Attention mechanism module: Contains a pair of parallel channel attention mechanisms and spatial attention mechanisms; assuming there is an input feature map F∈R C×H×W The channel attention mechanism first uses a max-pooling layer and an average-pooling layer to extract two different spatial contextual information from the feature map F. The information extracted by the max-pooling layer is denoted as... The information extracted by the average pooling layer (avgpool) is denoted as follows: Then and The data is fed into a weighted multilayer perceptron (MLP); this MLP contains two hidden layers: the first layer downsamples neurons at a sampling ratio of r, with the number of neurons being C / r; the second layer restores the number of neurons to C; finally, CBAM... + The attention mechanism module will be processed by a multilayer perceptron (MLP). and The sums are then processed by the sigmoid activation function, and the output is the attention weight for each channel; thus, the attention weight function M for the entire channel is obtained. c (·) can be expressed as, where σ refers to the activation function Sigmoid:

[0054] M c (F)=σ(MLP(avgpool(F))+MLP(maxpool(F))) (6)

[0055] The spatial attention mechanism first performs a max pooling operation on the input feature map F along the channel dimension to obtain... The input feature map F is obtained by performing average pooling. Then The features are concatenated along the dimensional direction and convolutional operations are used to generate the final spatial attention feature map; the entire spatial attention weight function M s (·) can be represented as:

[0056]

[0057] Finally, the input feature map F and the output M of the channel attention mechanism are combined. c (F) Output M of the spatial attention mechanism s (F) summed together, CBAM + The final output M(·) of the attention mechanism module can be expressed as:

[0058] M(F) = F + F·M c (F)+F·M s (F) (8)

[0059] Fully connected classification module: CBAM + The output M(F) of the attention mechanism module is flattened into a vector and fed into a fully connected layer to generate a 3D vector. Since the dataset is small, directly feeding all neurons would cause severe overfitting; therefore, half of the neurons are randomly discarded. Finally, the probability p of each class is calculated using the softmax function. i .

[0060] Compared with the prior art, the present invention has the following advantages and beneficial effects:

[0061] 1. This invention constructs a deep learning network that combines Inception and CBAM attention mechanisms. After extracting features using Inception at multiple scales, more comprehensive fused features are obtained. After extracting features using channel attention and spatial attention, more discriminative features are obtained. Experimental verification shows that, compared with methods that do not combine Inception and CBAM attention mechanisms, this invention increases the unweighted F1 score and unweighted average recall by 0.0606 and 0.0676, respectively.

[0062] 2. This invention changes the traditional serial CBAM attention mechanism to parallel CBAM. + The attention mechanism, namely parallel channel attention and spatial attention, reduces the feature loss caused by serial micro-expression action information. Experiments have verified that changing the CBAM attention mechanism to a parallel structure can improve the performance by 0.0037 and 0.001.

[0063] 3. Comparative experiments were designed on the MEGC2019 mixed dataset. The results show that, compared with traditional methods and popular deep learning methods, the Inception-CBAM proposed in this invention is superior. + The best recognition results demonstrate the effectiveness of this method. Attached Figure Description

[0064] Figure 1This is a flowchart of a method according to an embodiment of the present invention.

[0065] Figure 2 This is a system structure block diagram of one embodiment of the present invention.

[0066] Figure 3 This is a diagram illustrating the Inception architecture and parameters of one embodiment of the present invention.

[0067] Figure 4 CBAM is an embodiment of the present invention. + Architecture diagram.

[0068] Figure 5 This is a confusion matrix diagram according to an embodiment of the present invention.

[0069] Figure 6 This is a diagram showing the simple convolutional network architecture and parameters used in the ablation experiment of this invention. Detailed Implementation

[0070] An Inception-CBAM based invention + This paper presents a micro-expression recognition method and system, addressing the problems of tedious manual feature extraction, inconsistent distribution of facial shapes and micro-expression movements among individuals, and uneven motion information in facial images. The method includes: data preprocessing, face detection and cropping using the Dlib method, extraction of optical flow features for each micro-expression segment using the TV-L1 optical flow method and standardization of their standard deviation; and the design of the Inception-CBAM system. + The network consists of three modules: Inception feature extraction module, CBAM module, and so on. + This invention incorporates an attention mechanism module and a fully connected classification module; leave-one-out cross-validation is used for training and testing. By introducing the Inception module and the CBAM attention mechanism into micro-expression recognition, it can extract multi-scale features while better extracting features more favorable to the target task. Furthermore, the traditional serial CBAM structure is modified into a parallel CBAM. + The improved structure reduces feature loss caused by serial structures, further enhancing micro-expression recognition results.

[0071] The present invention will now be described in further detail with reference to the accompanying drawings.

[0072] An Inception-CBAM based invention + The micro-expression recognition method includes the following steps:

[0073] Step 1: Collect micro-expression image datasets and preprocess the images, including cropping faces and extracting facial motion information using optical flow.

[0074] Step 2: Design Inception-CBAM + Deep learning network, including Inception feature extraction module, CBAM + Attention mechanism module and fully connected classification module;

[0075] Step 3: Use leave-one-out cross-validation to split the training set and the test set, train on the training set, and test on the test set to obtain the final recognition result;

[0076] Step 4: Analyze the recognition results;

[0077] The specific methods for collecting micro-expression image datasets and preprocessing the images in step one of this invention, including cropping faces and extracting facial motion information using optical flow, are as follows:

[0078] (1) Review the literature on micro-expression recognition in recent years and select the three most commonly used datasets: CASMEII, SAMM and SMIC.

[0079] (2) During the training of deep learning networks, irrelevant background factors should be avoided. This invention uses the Dlib method to detect 68 landmarks of the face, determines the four vertices of the rectangle formed by the 68 landmarks, and uses the rectangle formed by the four vertices to crop the face out of the original image.

[0080] (3) Micro-expressions are a type of motion information, characterized by small amplitude. However, the original image contains too much identity information, and directly inputting the original image will cause the network to learn too much identity information while ignoring motion information. This invention uses the TV-L1 optical flow method to extract motion information from each micro-expression segment. Let the face images within the micro-expression segment be f = {f1, f2, f3, ... f...} n}, take the start frame image and the peak frame image {f onset f apex The horizontal optical flow field u and vertical optical flow field v between the start frame and the peak frame were calculated using the TV-L1 optical flow method.

[0081] u, v = tvl1(f onset f apex (1)

[0082] To enrich the number of information channels for network input, the optical strain ε is calculated using u and v, where, It is the first-order partial derivative of the optical flow component:

[0083]

[0084] The horizontal optical flow field u, the vertical optical flow field v, and the optical strain ε are used to construct a three-dimensional matrix x = (u, v, ε), which is then reduced to 28×28. Finally, the reduced three-dimensional matrix is ​​standardized by standard deviation to generate Inception-CBAM. + Network input:

[0085]

[0086] like Figure 2 As shown, this invention provides a method based on Inception-CBAM. + The micro-expression recognition system includes: a data preprocessing module and Inception-CBAM. + Deep learning network unit.

[0087] The data preprocessing module preprocesses the image, including cropping faces and extracting facial motion information using optical flow.

[0088] Inception-CBAM + The network units include: the Inception feature extraction module, the CBAM+ attention mechanism module, and the fully connected classification module.

[0089] In step two of this invention, the Inception-CBAM is designed. + The specific process of deep learning networks includes:

[0090] (1) Inception Feature Extraction Module: In traditional convolutional networks, the kernel size of each layer is unique, and feature learning can only be performed on a fixed-size receptive field, resulting in very limited spatial features. Considering that everyone's face shape is different and the distribution of micro-expressions is inconsistent, using excessively large convolutional kernels will not learn subtle features and will result in excessive computation, while using excessively small convolutional kernels will lead to a smaller receptive field. Although the receptive field can be increased by continuously stacking small convolutional kernels, the large number of network layers will lead to severe overfitting due to the limited size of the micro-expression dataset. To solve this problem, this module uses a combination of Inception and maxpooling layers to extract features at multiple scales and reduce dimensionality of the optical flow map; Inception uses parallel 1×1 convolutions, 3×3 convolutions, 5×5 convolutions and 3×3 maxpooling layers to calculate four components and concatenate them according to the dimensions, as shown in the following formula:

[0091] Inception(x) = contact(f 1×1 (x), f 3×3 (x), f 5×5 (x), f 1×1 (maxpool(x))) (5)

[0092] Where contact refers to concatenating matrices along their dimensions, f 1×1 f 3×3 f 5×5 These represent convolutions of 1×1, 3×3, and 5×5, respectively.

[0093] The final feature is calculated by stacking two layers of Inception and maxpool, as shown in the following formula:

[0094] feature(x)=maxpool(Inception(maxpool(Inception(x)))) (6)

[0095] (2) The nose and eyebrow regions of the face contribute more to micro-expression recognition. To improve the network's attention to these regions, this invention incorporates a CBAM attention mechanism into the deep learning network. Analysis of its original sequential channel-space attention structure reveals that regardless of which attention module is activated first, the weights calculated by the latter are influenced by the feature maps generated by the former. However, micro-expressions are characterized by small movement amplitudes, making feature extraction difficult and preventing the first activated attention module from completely extracting important regions. This leads to unstable weight learning in the original CBAM network, thus reducing the final micro-expression recognition performance. To address this issue, this invention employs a parallel channel-space attention CBAM mechanism. + .

[0096] CBAM + The attention mechanism module contains a pair of parallel channel attention mechanisms and spatial attention mechanisms. Assume there is an input feature map F∈R. C×H×W The channel attention mechanism first extracts two different spatial contextual information from the feature map F using a max-pooling layer and an average-pooling layer. The information extracted by the max-pooling layer is denoted as... The information extracted by the average pooling layer (avgpool) is denoted as follows: Then and The data is fed into a weighted multilayer perceptron (MLP); this MLP contains two hidden layers: the first layer downsamples neurons at a sampling ratio of r, with the number of neurons being C / r; the second layer restores the number of neurons to C; finally, CBAM... + The attention mechanism module adds the two together, and the result is passed through the sigmoid activation function to output the attention weight for each channel; the attention weight function M for the entire channel... c (·) can be represented as

[0097] M c (F)=σ(MLP(avgpool(F))+MLP(maxpool(F))) (7)

[0098] The spatial attention mechanism first performs a max pooling operation on the input feature map F along the channel dimension to obtain... The input feature map F is obtained by performing average pooling. Then The features are concatenated along the dimensional direction and convolutional operations are used to generate the final spatial attention feature map; the entire spatial attention weight function M s (·) can be represented as:

[0099]

[0100] Finally, the input feature map F and the output M of the channel attention mechanism are combined. c (F) Output M of the spatial attention mechanism s (F) summed together, CBAM + The final output M(·) of the attention mechanism module can be expressed as:

[0101] M(F) = F + F·M c (F)+F·M s (F) (9)

[0102] (3) Fully Connected Classification Module: The output from the previous step is flattened into a vector and fed into the fully connected layer to generate a 3D vector. Since the dataset is small, directly feeding all neurons would cause severe overfitting; therefore, half of the neurons are randomly discarded. Finally, the probability p of each class is calculated using the softmax function. i .

[0103] In step three of this invention, the training set and the test set are split using leave-one-out cross-validation. The training set is used for training, and the test set is used for testing to obtain the final recognition result, which is described as follows:

[0104] (1) Using leave-one-out cross-validation, let the set of participants be S = {S1, S2, S3...S}. n Each subject has several micro-expression samples. In each cross-validation, the micro-expression samples of one subject are selected as the test set, and the remaining samples are used as the training set. The model is trained n times to obtain the final recognition results, including the unweighted average recall rate and the unweighted F1 score.

[0105] The analysis and description of the recognition results in step four of this invention are as follows:

[0106] (1) Organize the identification results, draw the confusion matrix, calculate the unweighted average recall and unweighted F1 score, compare and analyze them with traditional methods and popular deep learning methods, and formulate the next step plan based on the identification results.

[0107] This invention analyzes a mixed dataset composed of publicly released microexpression databases CASME II, SAMM, and SMIC. The video samples from CASME II and SAMM were taken at 200fps, while those from SMIC were taken at 100fps. A very small number of samples labeled "other" were removed, and the remaining samples were categorized into three groups: happiness, sadness, fear, contempt, and anger, and surprise. The resulting mixed dataset contains 68 participants and 442 microexpression samples, and the specific sample distribution is shown in Table 1.

[0108] Table 1 Sample Distribution

[0109]

[0110] Unweighted Average Recall (UAR) and Unweighted F1-score (UF1) are used as evaluation metrics. These metrics require that the proposed method has good classification performance for all classes and effectively prevents classification results from being biased towards one class. To calculate the unweighted average recall, the true positive samples in class c and the total number of samples (TP) are needed. c N c The calculation formula is as follows:

[0111]

[0112] To calculate the unweighted F1 score, we need to obtain the number of true positive, false positive, and false negative samples in class c (TP). c FP c 、FN c The calculation formula is as follows:

[0113]

[0114] This invention's analysis is implemented using Python on a Linux platform, employing Ubuntu 18.04 as the operating system and an NVIDIA GeForce GTX 1080 graphics card, and utilizing the PyTorch deep learning framework. To reduce experimental randomness and increase reproducibility, the random seed is set to 100, and all analyses are performed under this seed. The network input is a 3×28×28 three-channel optical flow map. Cross-entropy loss is used, with an initial learning rate of 0.001. Cosine annealing is used to decay the learning rate, with a minimum learning rate of 0.0001. The Adam optimizer is used, and the maximum number of training epochs is 300. Due to the small input image size and shallow network layers, the GPU memory used in each training session is limited; therefore, the batch size is directly used as the number of samples from the entire MEGC2019 dataset.

[0115] As shown in Table 2, the method proposed in this invention outperforms other methods on the hybrid dataset and the SMIC dataset. Specifically, compared to traditional manual feature extraction LBP-TOP, the unweighted F1 score and unweighted average recall of this invention are improved by 0.1538 and 0.165, respectively, on the hybrid dataset. Furthermore, the method of this invention scores significantly higher on CASMEII than the other two datasets. This is because CASMEII is captured by a 200fps camera, providing more accurate peak frames, leading to more accurate optical flow calculations and better characterization of motion changes. Conversely, the method of this invention exhibits lower recognition performance on the SMIC dataset, which may be due to two reasons. First, since the SMIC dataset does not provide the definite location of peak frames, peak frame detection methods cannot accurately detect them. Second, the SMIC video frame rate is low (100fps) and there is significant background noise, such as shadows, highlights, illumination, and flickering lights.

[0116] The results of the ablation experiments are shown in Table 3. It can be seen that the network performance is significantly improved after replacing the simple convolutional layer feature extraction method with the Inception multi-scale feature extraction method. The network performance is further improved after introducing the traditional serial CBAM. Replacing CBAM with the CBAM proposed in this invention... + Subsequently, the network performance improved to its highest level. This effectively demonstrates the superiority of multi-scale feature extraction and the parallel dual-channel attention mechanism.

[0117] Due to the imbalance of the dataset, the negative emotion category, which has a larger proportion of samples, performs better than the other categories. Future work plans to utilize generative networks such as GANs to augment the data of the less represented categories, or to construct deep networks to increase attention to these less represented categories, thereby improving overall recognition performance.

[0118] Table 2 Comparison of the present invention with existing methods

[0119]

[0120] Table 3 Ablation Experiment Results

[0121]

Claims

1. A method based on Inception-CBAM + The micro-expression recognition method is characterized by, Includes the following steps: Step 1: Collect micro-expression image datasets and preprocess the images, including cropping faces and extracting facial motion information using optical flow. Step 2: Design Inception-CBAM + Deep learning network, including Inception feature extraction module, CBAM + Attention mechanism module and fully connected classification module; Step 3: Use leave-one-out cross-validation to split the training set and the test set, train on the training set, and test on the test set to obtain the final recognition result; Step 4: Analyze the recognition results; The process of step one includes: 1-1. Reviewing recent literature on micro-expression recognition, we selected the three most commonly used datasets: CASMEII, SAMM, and SMIC. 1-2. Use the Dlib method to detect 68 landmarks of the face, determine the four vertices of the rectangle formed by the 68 landmarks, and use the rectangle formed by the four vertices to crop the face out of the original image; 1-3. Extract motion information from each micro-expression image fragment using the TV-L1 optical flow method: Let the face image within the micro-expression fragment be denoted as . Take the start frame image and the peak frame image. The horizontal optical flow field between the start frame and the peak frame was calculated using the TV-L1 optical flow method. and vertical optical flow field : ; To increase the number of information input channels on the network, utilize Calculate optical strain in, It is the first-order partial derivative of the optical flow component: ; horizontal optical flow field Vertical optical flow field and optical strain Forming a three-dimensional matrix and the three-dimensional matrix The matrix is ​​reduced to 28×28; finally, the reduced 3D matrix is ​​standardized by standard deviation to generate Inception-CBAM. + Network input: ; Inception-CBAM as described in step two + Deep learning networks include: 2-1. Inception Feature Extraction Module: This module uses a combination of Inception and maxpooling layers to extract multi-scale features and reduce dimensionality of the optical flow map. Inception uses parallel 1×1 convolutions, 3×3 convolutions, 5×5 convolutions, and 3×3 maxpooling layers to calculate four components, which are then concatenated according to their dimensions, as shown in the following formula: ; in, This refers to concatenating matrices along their dimensions. These represent 1×1 convolution, 3×3 convolution, and 5×5 convolution, respectively. The output features of the Inception feature extraction module are calculated by stacking two Inception layers and a maxpooling layer. Its formula is: ; 2-2. CBAM + Attention mechanism module: Contains a pair of parallel channel attention mechanisms and spatial attention mechanisms; assuming there is an input feature map. The channel attention mechanism first extracts feature maps using a max-pooling layer and an average-pooling layer. The two different spatial context information in the data are extracted by the max pooling layer and denoted as follows: Then and The data is fed into a weighted multilayer perceptron (MLP); this MLP contains two hidden layers: the first layer downsamples neurons at a sampling ratio of r, with the number of neurons being C / r; the second layer restores the number of neurons to C; finally, CBAM... + The attention mechanism module adds the two together, and the result is passed through the sigmoid activation function to output the attention weight for each channel; the attention weight function for the entire channel... : ; Spatial attention mechanism, firstly, focuses on the input feature map along the channel dimension. Perform max pooling operation to obtain For the input feature map Perform average pooling operation to obtain Then The features are concatenated along the dimensional direction and convolutional operations are used to generate the final spatial attention feature map; the entire spatial attention weight function... Represented as: ; Finally, the input feature map F and the output of the channel attention mechanism are combined. Output of spatial attention mechanism Add them together, CBAM + The final output of the attention mechanism module It can be represented as: ; 2-3. Fully Connected Classification Module: This module integrates CBAM... + Output of the attention mechanism module Flattened into a vector, it is fed into a fully connected layer to generate a 3D vector. Due to the small dataset, directly feeding all neurons would cause severe overfitting, so half of the neurons are randomly discarded. Finally, the probability of each class is calculated using the softmax function. .

2. A method based on Inception-CBAM according to claim 1 + The micro-expression recognition method is characterized by, Step three describes the process of using leave-one-out cross-validation to split the training and test sets, training on the training set, and testing on the test set to obtain the final recognition result. Using leave-one-out cross-validation, let the set of participants be denoted as . Each subject has several micro-expression samples. In each cross-validation, the micro-expression samples of one subject are selected as the test set, and the remaining samples are used as the training set. The model is trained n times to obtain the final recognition results, including the unweighted average recall rate and the unweighted F1 score.

3. A method based on Inception-CBAM according to claim 1 + The micro-expression recognition method is characterized by, The process of analyzing the recognition results described in step four includes: Organize the identification results, draw the confusion matrix, calculate the unweighted average recall and unweighted F1 score, compare and analyze them with traditional methods and popular deep learning methods, and formulate the next step plan based on the identification results; The analysis was conducted on a mixed dataset consisting of publicly available microexpression databases CASME II, SAMM, and SMIC. The video samples for CASME II and SAMM were at 200fps, and those for SMIC were at 100fps. Samples labeled "other" were removed, and the remaining samples were divided into three categories: happiness (positive), sadness (negative), fear (negative), contempt (negative), and surprise (positive). The resulting mixed dataset contained 68 participants and 442 microexpression samples. Unweighted average recall (UAR) and unweighted F1 score (UF1) are used as evaluation metrics; to calculate the unweighted average recall, the number of true positive samples in class c and the total number of samples are required. , The calculation formula is: ; To calculate the unweighted F1 score, we need to obtain the number of true positive, false positive, and false negative samples in class c. , , The calculation formula is as follows: 。 4. A method based on Inception-CBAM according to claim 3 + The micro-expression recognition method is characterized by, The analysis described in step four is implemented using Python on a Linux platform, with an Ubuntu 18.04 operating system, an NVIDIA GeForce GTX1080 graphics card, and the PyTorch deep learning framework. The random seed is set to 100, and all analyses are performed under this random seed. The network input is a 3×28×28 three-channel optical flow map. The cross-entropy loss function is used, the initial learning rate is set to 0.001, and the cosine annealing learning rate decay method is used, with a minimum learning rate of 0.0001. The Adam optimizer is used, the maximum number of training epochs is 300, and the batch size is directly used as the number of samples in the entire MEGC2019 dataset.

5. A method based on Inception-CBAM + The micro-expression recognition system is characterized by, Includes data preprocessing module and Inception-CBAM + Deep learning network units; The data preprocessing module preprocesses the image, including cropping faces and extracting facial motion information using optical flow. Inception-CBAM + Deep learning network units include: The Inception feature extraction module uses a combination of Inception and maxpooling layers to extract and reduce the dimensionality of optical flow maps at multiple scales. Inception uses parallel 1×1, 3×3, and 5×5 convolutions and 3×3 max pooling for computation, and finally concatenates the four components according to their dimensions, as shown in the following formula: ; in, This refers to concatenating matrices along their dimensions. These represent convolutions of 1×1, 3×3, and 5×5, respectively. The final features are calculated by stacking two layers of Inception and maxpool. Its formula is: ; CBAM + Attention mechanism module: Contains a pair of parallel channel attention mechanisms and spatial attention mechanisms; assuming there is an input feature map. Among them, the channel attention mechanism first utilizes a max pooling layer. and average pooling layer Extract feature maps Two different spatial context information in the data are processed through a max pooling layer. The extracted information is denoted as Then... and The data is fed into a weighted multilayer perceptron (MLP); this MLP contains two hidden layers: the first layer downsamples neurons at a sampling ratio of r, with the number of neurons being C / r; the second layer restores the number of neurons to C; finally, CBAM... + The attention mechanism module will be processed by a multilayer perceptron (MLP). and The sums are then processed by the sigmoid activation function, which outputs the attention weights for each channel; thus, the attention weight function for the entire channel is... ,in The activation function is Sigmoid. ; Spatial attention mechanism, firstly, focuses on the input feature map along the channel dimension. Perform max pooling operation to obtain For the input feature map Perform average pooling operation to obtain Then The features are concatenated along the dimensional direction and convolutional operations are used to generate the final spatial attention feature map; the entire spatial attention weight function... Represented as: ; Finally, the input feature map F and the output of the channel attention mechanism are combined. Output of spatial attention mechanism Add them together, CBAM + The final output of the attention mechanism module It can be represented as: ; Fully connected classification module: CBAM + Output of the attention mechanism module Flattened into a vector, it is fed into a fully connected layer to generate a 3D vector. Due to the small dataset, directly feeding all neurons would cause severe overfitting, so half of the neurons are randomly discarded. Finally, the probability of each class is calculated using the softmax function. .