Micro-expression recognition method and device, computer device and storage medium
By using a 3D micro-expression recognition model to perform face detection and spatiotemporal feature extraction on videos, the problem of low micro-expression recognition rate has been solved, achieving more accurate micro-expression recognition and improving the efficiency of intelligent diagnosis and remote consultation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2023-06-09
- Publication Date
- 2026-06-26
AI Technical Summary
Existing micro-expression recognition technologies have low recognition rates and struggle to accurately capture and describe micro-expression characteristics, especially since micro-expression video sequences are short in duration and limited in number.
A three-dimensional micro-expression recognition model is used to extract the spatiotemporal features of micro-expressions and macro-expressions by performing face detection and data preprocessing on the video to be processed, and then extracting shared spatiotemporal features for expression classification.
It improves the accuracy and expressiveness of micro-expression recognition, thereby enhancing the efficiency and effectiveness of consultations in intelligent diagnosis and remote consultations.
Smart Images

Figure CN116665278B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of image recognition and digital healthcare, and more particularly to a micro-expression recognition method, apparatus, computer device, and storage medium. Background Technology
[0002] In recent years, with the rapid development of artificial intelligence technology, the task of facial micro-expression recognition has received increasing attention. Currently, micro-expression recognition methods mainly identify facial movement characteristics and texture features to capture micro-expressions. For example, in digital healthcare scenarios such as intelligent diagnosis and remote consultation, recognizing a patient's micro-expressions can assist in identifying their current condition. However, because facial movements in micro-expressions are relatively small, it is difficult to fully capture and accurately describe their characteristics. Furthermore, micro-expression video sequences often suffer from short durations and a limited number of micro-expressions, increasing the difficulty of feature capture and hindering further improvements in recognition rates. Therefore, existing micro-expression recognition methods suffer from low recognition rates. Summary of the Invention
[0003] Therefore, it is necessary to provide a micro-expression recognition method, device, computer equipment, and storage medium to address the aforementioned technical problems and solve the problem of low micro-expression recognition rate in existing micro-expression recognition technologies.
[0004] A micro-expression recognition method includes:
[0005] Obtain the video to be processed, which includes micro-expressions and macro-expressions;
[0006] The video to be processed is subjected to data preprocessing including face detection to obtain a preprocessed video;
[0007] Spatiotemporal features of the preprocessed video are extracted using a three-dimensional micro-expression recognition model to obtain spatiotemporal features of micro-expressions and spatiotemporal features of macro-expressions.
[0008] Extract the shared spatiotemporal features of the micro-expression and the macro-expression from the micro-expression spatiotemporal features;
[0009] The micro-expressions are classified according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features to obtain the micro-expression recognition results.
[0010] A micro-expression recognition device, comprising:
[0011] The pending video module is used to acquire pending videos containing micro-expressions and macro-expressions;
[0012] The preprocessing video module is used to perform data preprocessing on the video to be processed, including face detection, to obtain a preprocessed video.
[0013] The spatiotemporal feature extraction module is used to extract spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features.
[0014] A shared spatiotemporal feature module is used to extract shared spatiotemporal features of micro-expressions and macro-expressions from the micro-expression spatiotemporal features and the macro-expression spatiotemporal features;
[0015] The recognition result module is used to classify the micro-expressions according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features, and obtain the recognition result of the micro-expressions.
[0016] A computer device includes a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the micro-expression recognition method described above when executing the computer-readable instructions.
[0017] One or more readable storage media storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the micro-expression recognition method described above.
[0018] The aforementioned micro-expression recognition method, apparatus, computer equipment, and storage medium acquire a video to be processed containing micro-expressions and macro-expressions; perform data preprocessing, including face detection, on the video to be processed to obtain a preprocessed video; extract spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain spatiotemporal features of micro-expressions and macro-expressions; extract shared spatiotemporal features of the micro-expressions and macro-expressions from the spatiotemporal features of the micro-expressions and macro-expressions; and classify the micro-expressions according to the shared spatiotemporal features, the spatiotemporal features of the micro-expressions, and the spatiotemporal features of the macro-expressions to obtain the recognition result of the micro-expressions. This invention acquires the spatiotemporal features of micro-expressions, macro-expressions, and shared spatiotemporal features between the spatiotemporal features of micro-expressions and macro-expressions from the video to be processed using a three-dimensional micro-expression recognition model, and performs micro-expression recognition based on the acquired spatiotemporal features of the video to be processed in three dimensions. While considering the spatiotemporal features of micro-expressions, it also fully considers the spatiotemporal features of macro-expressions, thus increasing the expressive power of micro-expressions. Furthermore, the spatiotemporal features of macro-expressions are richer than those of micro-expressions, thus improving the accuracy of micro-expression recognition. The aforementioned micro-expression recognition method can be applied to intelligent diagnosis and treatment, and remote consultation, thereby improving the accuracy of recognizing patients' micro-expressions during consultations, and ultimately enhancing consultation efficiency and effectiveness. Attached Figure Description
[0019] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a schematic diagram of an application environment for a micro-expression recognition method according to an embodiment of the present invention;
[0021] Figure 2 This is a flowchart illustrating a micro-expression recognition method according to an embodiment of the present invention;
[0022] Figure 3 This is a schematic diagram of a micro-expression recognition device according to an embodiment of the present invention;
[0023] Figure 4 This is a schematic diagram of a computer device according to an embodiment of the present invention. Detailed Implementation
[0024] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0025] The micro-expression recognition method provided in this embodiment can be applied to, for example, Figure 1 In this application environment, the client communicates with the server. Clients include, but are not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented using a standalone server or a server cluster consisting of multiple servers.
[0026] In one embodiment, such as Figure 2 As shown, a micro-expression recognition method is provided, which can be applied to... Figure 1 Taking the server-side as an example, the explanation includes the following steps:
[0027] S10. Obtain the video to be processed, which contains micro-expressions and macro-expressions.
[0028] In essence, micro-expressions are fleeting facial expressions that reveal a person's true feelings and emotions. Conversely, macro-expressions are facial expressions that last longer and also reveal true feelings and emotions. The video to be processed refers to a video containing both micro-expressions and macro-expressions, intended for micro-expression recognition. Generally, micro-expressions express emotions such as happiness, sadness, fear, anger, and disgust. Similarly, macro-expressions also express emotions such as happiness, sadness, fear, anger, and disgust.
[0029] S20. Perform data preprocessing, including face detection, on the video to be processed to obtain a preprocessed video.
[0030] In essence, data preprocessing refers to the preprocessing of data from the video to be processed. Data preprocessing includes, but is not limited to, face detection. Face detection specifically refers to the process of detecting faces in the sequence of video frames to be processed, thereby obtaining facial landmarks. Preprocessed video refers to the video obtained after data preprocessing.
[0031] S30. The preprocessed video is subjected to spatiotemporal feature extraction using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features.
[0032] Understandably, the 3D micro-expression recognition model is a trained 3D convolutional neural network. This model includes a micro-expression recognition network and a macro-expression recognition network, used to extract spatiotemporal features from the preprocessed video. For example, the 3D micro-expression recognition model could be 3D-ResNet10 (a 3D deep residual learning network). The spatiotemporal features of the preprocessed video include micro-expression spatiotemporal features and macro-expression spatiotemporal features. Micro-expression spatiotemporal features refer to the features of the preprocessed video regarding micro-expressions in both time and space dimensions. Macro-expression spatiotemporal features refer to the features of the preprocessed video regarding macro-expressions in both time and space dimensions.
[0033] S40. Extract the shared spatiotemporal features of the micro-expression and the macro-expression from the micro-expression spatiotemporal features and the macro-expression spatiotemporal features.
[0034] Understandably, there are shared spatiotemporal features between micro-expressions and macro-expressions in a preprocessed video. Shared spatiotemporal features are the same spatiotemporal features between micro-expression spatiotemporal features and macro-expression spatiotemporal features. For example, if a preprocessed video has both micro-expression spatiotemporal features expressing happiness and macro-expression spatiotemporal features expressing happiness, then there are shared spatiotemporal features between the micro-expression spatiotemporal features and macro-expression spatiotemporal features of the preprocessed video that express happiness.
[0035] S50. Classify the micro-expressions according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features to obtain the micro-expression recognition result.
[0036] Understandably, after acquiring the shared spatiotemporal features, micro-expression spatiotemporal features, and macro-expression spatiotemporal features of the preprocessed video, the 3D micro-expression recognition model can classify expressions based on the correlations between these features. The types of expressions include, but are not limited to, happiness, sadness, fear, anger, and disgust. The recognition result is the result of the 3D micro-expression recognition model identifying and classifying micro-expressions in the preprocessed video based on the shared spatiotemporal features, micro-expression spatiotemporal features, and macro-expression spatiotemporal features.
[0037] In steps S10-S50, a video to be processed containing micro-expressions and macro-expressions is acquired; the video to be processed undergoes data preprocessing including face detection to obtain a preprocessed video; spatiotemporal features are extracted from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features; shared spatiotemporal features of the micro-expressions and macro-expressions are extracted from the micro-expression spatiotemporal features and macro-expression spatiotemporal features; and micro-expressions are classified according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features to obtain the micro-expression recognition result. In this embodiment, the three-dimensional micro-expression recognition model acquires the micro-expression spatiotemporal features, macro-expression spatiotemporal features, and shared spatiotemporal features between the micro-expression spatiotemporal features and macro-expression spatiotemporal features of the video to be processed, and performs micro-expression recognition based on the acquired spatiotemporal features of the video to be processed in three dimensions. This approach considers both the spatiotemporal features of micro-expressions and the spatiotemporal features of macro-expressions, thereby increasing the expressive power of micro-expressions. Furthermore, the spatiotemporal features of macro-expressions are richer than those of micro-expressions, improving the accuracy of micro-expression recognition. The aforementioned speech recognition method can be applied in the field of digital healthcare, such as in intelligent diagnosis and remote consultations, where it can improve the accuracy of recognizing patients' micro-expressions during consultations, thereby enhancing consultation efficiency and effectiveness.
[0038] Optionally, in step S20, i.e., performing data preprocessing including face detection on the video to be processed to obtain a preprocessed video, the following steps are included:
[0039] S201. Use a visual library to perform face detection on the video frame sequence to be processed in the video to be processed, and obtain facial key points;
[0040] S202. Perform face cropping based on the facial key points to obtain a face video frame sequence;
[0041] S203. Generate the preprocessed video based on the face video frame sequence.
[0042] Understandably, a vision library generally refers to pre-written code and data used to build or optimize computer programs. For example, OpenCV is the most widely used open-source computer vision library to date, encompassing applications such as face recognition and object detection. The video to be processed contains at least one sequence of video frames. This sequence of video frames refers to a sequence of video frames containing micro-expressions and / or macro-expressions. Face detection refers to the process of using a vision library to detect faces within the sequence of video frames to be processed, obtaining facial landmarks. These facial landmarks can include 68 key points of the face. Face cropping refers to cropping faces from the sequence of video frames to be processed based on the detected facial landmarks, obtaining a sequence of video frames containing micro-expressions and / or macro-expressions, eliminating the influence of irrelevant background on micro-expression recognition. A pre-processed video is generated based on the sequence of video frames containing micro-expressions and / or macro-expressions.
[0043] In this embodiment, by performing face detection on the video frame sequence containing micro-expressions and / or macro-expressions, and cropping the detected faces, faces containing micro-expressions and / or macro-expressions can be accurately cropped out, and the influence of irrelevant backgrounds is eliminated, thereby improving the accuracy of micro-expression recognition.
[0044] Optionally, in step S203, namely generating the preprocessed video based on the face video frame sequence, the following steps are included:
[0045] S2031. Based on the facial key points, perform face alignment on the face video frames in the face video frame sequence to generate a reference face video.
[0046] S2032. Perform temporal image interpolation on the reference face video to obtain the preprocessed video.
[0047] In essence, face alignment refers to the process of aligning faces in a sequence of face video frames based on facial landmarks to obtain a baseline face video. A baseline face video with face alignment facilitates better and faster micro-expression recognition, improving its accuracy and efficiency. Generally, due to the short duration of micro-expressions, it is difficult to accurately obtain their spatiotemporal features. Therefore, extending the duration of micro-expressions can better capture their spatiotemporal features. Similarly, extending the duration of macro-expressions can also better capture their spatiotemporal features. Specifically, by performing temporal image interpolation on the baseline face video—that is, by performing temporal difference analysis within the baseline face video—the number of video frames included in the baseline face video is increased, extending the duration of micro-expressions and / or macro-expressions in the baseline face video can improve their spatiotemporal features, thereby enhancing the accuracy of micro-expression recognition.
[0048] Optionally, the three-dimensional micro-expression recognition model includes a micro-expression recognition network and a macro-expression recognition network;
[0049] In step S30, namely, extracting spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features, the following steps are taken:
[0050] S301. Micro-expression spatiotemporal features are extracted from the video frame sequence to be processed in the preprocessed video through the micro-expression recognition network to obtain the micro-expression spatiotemporal features.
[0051] S302. The macro expression spatiotemporal features are extracted from the sequence of video frames to be processed in the preprocessed video through the macro expression recognition network to obtain the macro expression spatiotemporal features.
[0052] Understandably, the micro-expression recognition network is used to identify micro-expressions in pre-processed videos and extract their spatiotemporal features. The macro-expression recognition network is used to identify macro-expressions in pre-processed videos and extract their spatiotemporal features. Both the micro-expression recognition network and the macro-expression recognition network share the same feature encoder structure and feature parameters, thereby extracting common features between the micro-expression spatiotemporal features and macro-expression spatiotemporal features.
[0053] Optionally, before step S30, that is, before extracting spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features, the following steps are included:
[0054] S303. Obtain a micro-expression video sample set and a macro-expression video sample set;
[0055] S304. Using the initial three-dimensional micro-expression recognition model, perform spatiotemporal feature sample extraction on the micro-expression video sample set and the macro-expression video sample set to obtain a micro-expression spatiotemporal feature sample set corresponding to the micro-expression video sample set and a macro-expression spatiotemporal feature sample set corresponding to the macro-expression video sample set.
[0056] S305. Construct a quadruplet loss function based on the micro-expression spatiotemporal feature sample set and the macro-expression spatiotemporal feature sample set;
[0057] S306. Determine the loss value based on the quadruplet loss function and the cross-entropy loss function;
[0058] S307. When the loss value does not meet the convergence condition, the initial parameters of the initial three-dimensional micro-expression recognition model are iteratively updated, and a new loss value is calculated based on the updated initial parameters; when the new loss value meets the convergence condition, the initial three-dimensional micro-expression recognition model corresponding to the new loss value is determined as the three-dimensional micro-expression recognition model.
[0059] Understandably, the micro-expression video sample set includes several micro-expression videos of various expressions. For example, the micro-expression video sample set includes a standard micro-expression video, a first micro-expression sample video, and a second micro-expression sample video. The standard micro-expression video and the first micro-expression sample video refer to different micro-expression videos corresponding to the first type of expression. The second micro-expression sample video refers to a micro-expression video corresponding to a second type of expression that belongs to a different category than the first type of expression. The macro-expression video sample set includes several macro-expression videos of various expressions. The initial 3D micro-expression recognition model is an untrained 3D convolutional neural network. This initial 3D micro-expression recognition model includes an initial micro-expression recognition network and an initial macro-expression recognition network, used to extract spatiotemporal feature samples from the micro-expression video sample set and the macro-expression video sample set. Specifically, the initial micro-expression recognition network is used to extract spatiotemporal feature samples from the micro-expression video sample set; the initial macro-expression recognition network is used to extract spatiotemporal feature samples from the macro-expression video sample set. The micro-expression spatiotemporal feature sample set includes several micro-expression spatiotemporal feature samples. The macro-expression spatiotemporal feature sample set includes several macro-expression spatiotemporal feature samples. To learn the shared spatiotemporal features between micro-expression spatiotemporal samples and macro-expression spatiotemporal samples, a quadruplet loss function I is constructed. q To address the imbalance of micro-expression samples, a cross-entropy loss function I is introduced. Fcoal According to the quadruple loss function I q and cross-entropy loss function I Fcoal Determine the total loss function I, then I = I q +I FcoalIn model training, a smaller loss value is better. Continuous training aims to make the model's loss value meet the convergence condition; that is, the closer the loss value is to the convergence condition, the better the model. The convergence condition can be less than a preset convergence threshold. For example, if the preset convergence threshold is 0.1, then the convergence condition is a loss value less than 0.1. When the loss value meets the convergence condition, training stops, and the initial 3D micro-expression recognition model corresponding to the loss value is determined as the 3D micro-expression recognition model. When the loss value does not meet the convergence condition, the initial parameters of the initial 3D micro-expression recognition model are iteratively updated. A new loss value is calculated based on the updated initial parameters, and it is determined whether the new loss value meets the convergence condition. Training stops when a new loss value meets the convergence condition, and the initial 3D micro-expression recognition model corresponding to this new loss value is determined as the 3D micro-expression recognition model.
[0060] In this embodiment, the loss value is determined by the quadruples loss function and the cross-entropy loss function. This not only learns the common features between micro-expressions and macro-expressions, but also solves the problem of imbalanced micro-expression samples, which can improve the recognition accuracy of the 3D micro-expression recognition model.
[0061] Optionally, the micro-expression video sample set includes standard micro-expression videos, first micro-expression sample videos, and second micro-expression sample videos; the standard micro-expression videos and the first micro-expression sample videos refer to different micro-expression videos corresponding to the first type of expression, and the second micro-expression sample videos refer to micro-expression videos corresponding to the second type of expression, which belongs to a different category than the first type of expression; the macro-expression video sample set includes macro-expression sample videos corresponding to the second type of expression;
[0062] In step S305, namely, constructing a four-tuple loss function based on the micro-expression spatiotemporal feature sample set and the macro-expression spatiotemporal feature sample set, the following steps are included:
[0063] S3051. From the micro-expression spatiotemporal feature sample set, obtain the first spatiotemporal feature of the micro-expression standard video, the second spatiotemporal feature of the first micro-expression sample video, and the third spatiotemporal feature of the second micro-expression sample video;
[0064] S3052. Obtain the fourth spatiotemporal feature of the macro expression sample video from the macro expression spatiotemporal feature sample set;
[0065] S3053. Construct the quadruple loss function based on the first spatiotemporal feature, the second spatiotemporal feature, the third spatiotemporal feature, and the fourth spatiotemporal feature.
[0066] Understandably, the first and second categories of expressions are different types of expressions. For example, if the first category of expression is happiness, then the second category of expression could be fear. The standard micro-expression video is used as the anchor sample for the first category of expressions, and the first micro-expression sample video is used as the positive sample for the first category of expressions. The second micro-expression sample video and the first micro-expression sample video are micro-expression videos of different expression categories.
[0067] In this embodiment, a quadruplet loss function is constructed by acquiring the spatiotemporal features of micro-expressions and macro-expressions of different expression types. This allows for the full learning of the common features among micro-expressions and macro-expressions of different expression types, thereby improving the recognition accuracy of the 3D micro-expression recognition model.
[0068] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0069] In one embodiment, a micro-expression recognition device is provided, which corresponds one-to-one with the micro-expression recognition method described in the above embodiments. For example... Figure 3 As shown, the micro-expression recognition device includes a video to be processed module 10, a preprocessed video module 20, a spatiotemporal feature extraction module 30, a shared spatiotemporal feature module 40, and a recognition result module 50. Detailed descriptions of each functional module are as follows:
[0070] Optionally, in step S203, namely generating the preprocessed video based on the face video frame sequence, the following steps are included:
[0071] The video to be processed module 10 is used to acquire videos containing micro-expressions and macro-expressions.
[0072] The preprocessing video module 20 is used to perform data preprocessing, including face detection, on the video to be processed to obtain a preprocessed video;
[0073] The spatiotemporal feature extraction module 30 is used to extract spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features.
[0074] The shared spatiotemporal feature module 40 is used to extract the shared spatiotemporal features of the micro-expression and the macro-expression from the micro-expression spatiotemporal features and the macro-expression spatiotemporal features;
[0075] The recognition result module 50 is used to classify the micro-expression according to the shared spatiotemporal features, the micro-expression spatiotemporal features and the macro-expression spatiotemporal features, and obtain the recognition result of the micro-expression.
[0076] Preprocessing video module 20 includes:
[0077] The facial landmark unit is used to perform face detection on the video frame sequence to be processed in the video to be processed using a visual library to obtain facial landmarks;
[0078] A face video frame sequence unit is used to perform face cropping based on the face key points to obtain a face video frame sequence;
[0079] A preprocessing video unit is used to generate the preprocessed video based on the face video frame sequence.
[0080] Optionally, the preprocessing video unit includes:
[0081] A face alignment unit is used to perform face alignment on face video frames in the face video frame sequence based on the face key points to generate a reference face video.
[0082] The temporal image interpolation unit is used to perform temporal image interpolation on the reference face video to obtain the preprocessed video.
[0083] Optionally, the three-dimensional micro-expression recognition model includes a micro-expression recognition network and a macro-expression recognition network;
[0084] The spatiotemporal feature extraction module 30 includes:
[0085] The micro-expression spatiotemporal feature unit is used to extract micro-expression spatiotemporal features from the video frame sequence to be processed in the preprocessed video through the micro-expression recognition network, and obtain the micro-expression spatiotemporal features.
[0086] The macro-expression spatiotemporal feature unit is used to extract macro-expression spatiotemporal features from the sequence of video frames to be processed in the preprocessed video through the macro-expression recognition network, so as to obtain the macro-expression spatiotemporal features.
[0087] Optionally, before the spatiotemporal feature extraction module 30, the following is included:
[0088] The video sample set module is used to acquire micro-expression video sample sets and macro-expression video sample sets;
[0089] The spatiotemporal feature sample set module is used to extract spatiotemporal feature samples from the micro-expression video sample set and the macro-expression video sample set through the initial three-dimensional micro-expression recognition model, so as to obtain the micro-expression spatiotemporal feature sample set corresponding to the micro-expression video sample set and the macro-expression spatiotemporal feature sample set corresponding to the macro-expression video sample set.
[0090] The quadruplet loss function module is used to construct a quadruplet loss function based on the micro-expression spatiotemporal feature sample set and the macro-expression spatiotemporal feature sample set.
[0091] The loss value module is used to determine the loss value based on the quadruplet loss function and the cross-entropy loss function.
[0092] The three-dimensional micro-expression recognition model module is used to iteratively update the initial parameters of the initial three-dimensional micro-expression recognition model when the loss value does not meet the convergence condition, and calculate a new loss value based on the updated initial parameters; when the new loss value meets the convergence condition, the initial three-dimensional micro-expression recognition model corresponding to the new loss value is determined as the three-dimensional micro-expression recognition model.
[0093] Optionally, the micro-expression video sample set includes standard micro-expression videos, first micro-expression sample videos, and second micro-expression sample videos; the standard micro-expression videos and the first micro-expression sample videos refer to different micro-expression videos corresponding to the first type of expression, and the second micro-expression sample videos refer to micro-expression videos corresponding to the second type of expression, which belongs to a different category than the first type of expression; the macro-expression video sample set includes macro-expression sample videos corresponding to the second type of expression;
[0094] That is, the quadruplet loss function module includes:
[0095] The micro-expression spatiotemporal feature acquisition unit is used to acquire, from the micro-expression spatiotemporal feature sample set, the first spatiotemporal feature of the standard micro-expression video, the second spatiotemporal feature of the first micro-expression sample video, and the third spatiotemporal feature of the second micro-expression sample video;
[0096] The macro-expression spatiotemporal feature acquisition unit is used to acquire the fourth spatiotemporal feature of the macro-expression sample video from the macro-expression spatiotemporal feature sample set;
[0097] The quadruplet loss function unit is used to construct the quadruplet loss function based on the first spatiotemporal feature, the second spatiotemporal feature, the third spatiotemporal feature, and the fourth spatiotemporal feature.
[0098] For specific limitations regarding the micro-expression recognition device, please refer to the limitations of the micro-expression recognition method above, which will not be repeated here. Each module in the aforementioned micro-expression recognition device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.
[0099] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 4As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes a readable storage medium and internal memory. The readable storage medium stores an operating system, computer-readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer-readable instructions in the readable storage medium. The database stores data related to the micro-expression recognition method. The network interface communicates with external terminals via a network connection. When the computer-readable instructions are executed by the processor, a micro-expression recognition method is implemented. The readable storage medium provided in this embodiment includes both non-volatile and volatile readable storage media.
[0100] In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor performs the following steps when executing the computer-readable instructions:
[0101] Obtain the video to be processed, which includes micro-expressions and macro-expressions;
[0102] The video to be processed is subjected to data preprocessing including face detection to obtain a preprocessed video;
[0103] Spatiotemporal features of the preprocessed video are extracted using a three-dimensional micro-expression recognition model to obtain spatiotemporal features of micro-expressions and spatiotemporal features of macro-expressions.
[0104] Extract the shared spatiotemporal features of the micro-expression and the macro-expression from the micro-expression spatiotemporal features;
[0105] The micro-expressions are classified according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features to obtain the micro-expression recognition results.
[0106] In one embodiment, one or more computer-readable storage media storing computer-readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media. The readable storage media stores computer-readable instructions, which, when executed by one or more processors, perform the following steps:
[0107] Obtain the video to be processed, which includes micro-expressions and macro-expressions;
[0108] The video to be processed is subjected to data preprocessing including face detection to obtain a preprocessed video;
[0109] Spatiotemporal features of the preprocessed video are extracted using a three-dimensional micro-expression recognition model to obtain spatiotemporal features of micro-expressions and spatiotemporal features of macro-expressions.
[0110] Extract the shared spatiotemporal features of the micro-expression and the macro-expression from the micro-expression spatiotemporal features;
[0111] The micro-expressions are classified according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features to obtain the micro-expression recognition results.
[0112] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing related hardware with computer-readable instructions. These computer-readable instructions can be stored in a non-volatile readable storage medium or a volatile readable storage medium. When executed, these computer-readable instructions can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
[0113] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
[0114] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A micro-expression recognition method, characterized in that, include: Obtain the video to be processed, which includes micro-expressions and macro-expressions; The video to be processed is subjected to data preprocessing including face detection to obtain a preprocessed video; Spatiotemporal features of the preprocessed video are extracted using a three-dimensional micro-expression recognition model to obtain spatiotemporal features of micro-expressions and spatiotemporal features of macro-expressions. From the micro-expression spatiotemporal features and the macro-expression spatiotemporal features, extract the shared spatiotemporal features of the micro-expression and the macro-expression; the shared spatiotemporal features are the same spatiotemporal features between the micro-expression spatiotemporal features and the macro-expression spatiotemporal features; The micro-expressions are classified according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features to obtain the micro-expression recognition results; The process, prior to extracting spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features, includes: Obtain a micro-expression video sample set and a macro-expression video sample set; the micro-expression video sample set includes a standard micro-expression video, a first micro-expression sample video, and a second micro-expression sample video; the standard micro-expression video and the first micro-expression sample video refer to different micro-expression videos corresponding to a first type of expression, and the second micro-expression sample video refers to micro-expression videos corresponding to a second type of expression that belongs to a different category than the first type of expression; the macro-expression video sample set includes macro-expression sample videos corresponding to the second type of expression; Using an initial 3D micro-expression recognition model, spatiotemporal feature samples are extracted from the micro-expression video sample set and the macro-expression video sample set to obtain a micro-expression spatiotemporal feature sample set corresponding to the micro-expression video sample set and a macro-expression spatiotemporal feature sample set corresponding to the macro-expression video sample set. A quadruplet loss function is constructed based on the micro-expression spatiotemporal feature sample set and the macro-expression spatiotemporal feature sample set; The loss value is determined based on the quadruple loss function and the cross-entropy loss function. When the loss value does not meet the convergence condition, the initial parameters of the initial three-dimensional micro-expression recognition model are iteratively updated, and a new loss value is calculated based on the updated initial parameters; when the new loss value meets the convergence condition, the initial three-dimensional micro-expression recognition model corresponding to the new loss value is determined as the three-dimensional micro-expression recognition model.
2. The micro-expression recognition method as described in claim 1, characterized in that, The step of performing data preprocessing on the video to be processed, including face detection, to obtain a preprocessed video includes: Face detection is performed on the video frame sequence to be processed using a visual library to obtain facial key points; Face cropping is performed based on the facial key points to obtain a face video frame sequence; The preprocessed video is generated based on the facial video frame sequence.
3. The micro-expression recognition method as described in claim 2, characterized in that, The step of generating the preprocessed video based on the face video frame sequence includes: Based on the facial key points, face alignment is performed on the face video frames in the face video frame sequence to generate a baseline face video; Temporal image interpolation is performed on the baseline face video to obtain the preprocessed video.
4. The micro-expression recognition method as described in claim 1, characterized in that, The three-dimensional micro-expression recognition model includes a micro-expression recognition network and a macro-expression recognition network; The step of extracting spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features includes: The micro-expression spatiotemporal features are extracted from the video frame sequence to be processed in the preprocessed video by the micro-expression recognition network to obtain the micro-expression spatiotemporal features. The macro-expression spatiotemporal features are obtained by extracting the spatiotemporal features of macro-expressions from the sequence of video frames to be processed in the preprocessed video through the macro-expression recognition network.
5. The micro-expression recognition method as described in claim 1, characterized in that, The step of constructing a four-tuple loss function based on the micro-expression spatiotemporal feature sample set and the macro-expression spatiotemporal feature sample set includes: From the micro-expression spatiotemporal feature sample set, obtain the first spatiotemporal feature of the standard micro-expression video, the second spatiotemporal feature of the first micro-expression sample video, and the third spatiotemporal feature of the second micro-expression sample video; From the macro-expression spatiotemporal feature sample set, obtain the fourth spatiotemporal feature of the macro-expression sample video; The quadruple loss function is constructed based on the first spatiotemporal feature, the second spatiotemporal feature, the third spatiotemporal feature, and the fourth spatiotemporal feature.
6. A micro-expression recognition device, characterized in that, include: The pending video module is used to acquire pending videos containing micro-expressions and macro-expressions; The preprocessing video module is used to perform data preprocessing on the video to be processed, including face detection, to obtain a preprocessed video. The spatiotemporal feature extraction module is used to extract spatiotemporal features from the preprocessed video using a three-dimensional micro-expression recognition model to obtain micro-expression spatiotemporal features and macro-expression spatiotemporal features. A shared spatiotemporal feature module is used to extract shared spatiotemporal features of micro-expressions and macro-expressions from the micro-expression spatiotemporal features and the macro-expression spatiotemporal features; the shared spatiotemporal features are the same spatiotemporal features between the micro-expression spatiotemporal features and the macro-expression spatiotemporal features; The recognition result module is used to classify the micro-expressions according to the shared spatiotemporal features, the micro-expression spatiotemporal features, and the macro-expression spatiotemporal features, and obtain the recognition result of the micro-expressions; Before the spatiotemporal feature extraction module, the following is included: The video sample set module is used to acquire micro-expression video sample sets and macro-expression video sample sets. The micro-expression video sample set includes a standard micro-expression video, a first micro-expression sample video, and a second micro-expression sample video. The standard micro-expression video and the first micro-expression sample video refer to different micro-expression videos corresponding to the first type of expression, and the second micro-expression sample video refers to micro-expression videos corresponding to the second type of expression, which belongs to a different category than the first type of expression. The macro-expression video sample set includes macro-expression sample videos corresponding to the second type of expression. The spatiotemporal feature sample set module is used to extract spatiotemporal feature samples from the micro-expression video sample set and the macro-expression video sample set through the initial three-dimensional micro-expression recognition model, so as to obtain the micro-expression spatiotemporal feature sample set corresponding to the micro-expression video sample set and the macro-expression spatiotemporal feature sample set corresponding to the macro-expression video sample set. The quadruplet loss function module is used to construct a quadruplet loss function based on the micro-expression spatiotemporal feature sample set and the macro-expression spatiotemporal feature sample set. The loss value module is used to determine the loss value based on the quadruplet loss function and the cross-entropy loss function. The three-dimensional micro-expression recognition model module is used to iteratively update the initial parameters of the initial three-dimensional micro-expression recognition model when the loss value does not meet the convergence condition, and calculate a new loss value based on the updated initial parameters; when the new loss value meets the convergence condition, the initial three-dimensional micro-expression recognition model corresponding to the new loss value is determined as the three-dimensional micro-expression recognition model.
7. The micro-expression recognition device as described in claim 6, characterized in that, The preprocessing video module includes: The facial landmark unit is used to perform face detection on the video frame sequence to be processed in the video to be processed using a visual library to obtain facial landmarks; A face video frame sequence unit is used to perform face cropping based on the face key points to obtain a face video frame sequence; A preprocessing video unit is used to generate the preprocessed video based on the face video frame sequence.
8. A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, characterized in that, When the processor executes the computer-readable instructions, it implements the micro-expression recognition method as described in any one of claims 1 to 5.
9. One or more readable storage media storing computer-readable instructions, characterized in that, When the computer-readable instructions are executed by one or more processors, the one or more processors cause the micro-expression recognition method as described in any one of claims 1 to 5 to be performed.