A micro-expression recognition method based on active three-dimensional imaging
By acquiring multimodal data using an active 3D camera and combining spatiotemporal self-attention and cross-modal cross-attention mechanisms, this method solves the problems of insufficient generalization ability and video trimming dependence of traditional micro-expression recognition methods under complex lighting conditions, and achieves efficient micro-expression detection and recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2025-06-16
- Publication Date
- 2026-06-26
AI Technical Summary
Traditional micro-expression recognition methods lack generalization ability under complex lighting conditions, rely on manual video editing leading to information loss, struggle to handle multiple emotional segments and nesting in long videos, and existing deep learning models face lighting dependence and background frame processing issues in real-world scenarios.
Active 3D cameras are used to collect multimodal data. End-to-end detection and classification are performed through spatiotemporal self-attention and cross-modal cross-attention mechanisms to construct a cross-modal micro-expression dataset. Infrared mode is used to eliminate ambient light interference, depth mode is used to record 3D structural dynamics, and RGB mode is used to preserve appearance texture, achieving illumination robustness and eliminating the need for video trimming.
It improves the model's feature robustness and adaptability under complex lighting conditions, significantly enhances the accuracy and generalization ability of micro-expression detection, and can directly process raw long video sequences, reducing localization errors and preprocessing costs.
Smart Images

Figure CN120635966B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of micro-expression recognition technology, and more specifically to a micro-expression recognition method based on active three-dimensional imaging. Background Technology
[0002] The brief duration and subtle facial deformation of micro-expressions make them difficult to capture effectively using traditional visual perception methods, placing extremely high demands on the spatiotemporal resolution and feature sensitivity of detection algorithms.
[0003] Current mainstream micro-expression datasets (such as SAMM and CASME) are mainly collected in controlled lighting environments in laboratories, relying on uniform white light illumination and fixed camera angles. While these datasets provide standardized benchmarks for early algorithm validation, they have significant limitations in real-world applications. On the one hand, lighting variations in real-world scenes (such as low illumination, strong reflections, and color temperature differences) severely interfere with the color consistency and texture clarity of RGB images, resulting in insufficient generalization ability of detection models based on appearance features. On the other hand, the natural occurrence of micro-expressions is sporadic and subtle, and the performance of expressions during artificially induced collection deviates from real-world scenes. Furthermore, annotating the start and end frames and categories of expressions frame by frame requires a significant amount of professional manpower, resulting in generally small-scale publicly available datasets that are insufficient to support the training needs of complex deep learning models.
[0004] Micro-expression analysis typically involves two core stages: micro-expression localization and micro-expression recognition. Early micro-expression localization methods relied on manually designed spatiotemporal features, using a sliding window to detect segments with significant inter-frame differences.
[0005] With the development of deep learning, temporal modeling methods based on Three Dimensional Convolutional Neural Networks (3DCNN) and Long Short-Term Memory (LSTM) networks have been proposed. However, existing methods generally suffer from two major drawbacks: most models assume that micro-expressions exist as single segments, making it difficult to handle the nesting and overlapping of multiple emotional segments in long videos; and they rely on manually pruned short segments as input, making it impossible to directly process original long video sequences. This necessitates pre-segmentation of videos in practical applications, increasing preprocessing costs and the risk of information loss. In recognition tasks, traditional methods classify micro-expressions by extracting static appearance features from peak frames, ignoring the temporal information in the dynamic changes of micro-expressions. In recent years, end-to-end models based on spatiotemporal networks have attempted to utilize the inter-frame motion information of video sequences, but they still face many challenges in real-world scenarios, such as strong dependence on laboratory lighting conditions and difficulty in handling invalid background frames in unpruned videos. Summary of the Invention
[0006] To overcome the shortcomings of the existing technologies, this invention provides a micro-expression recognition method based on active 3D imaging. It uses an active 3D camera to collect multimodal data and achieves end-to-end detection and classification through spatiotemporal self-attention and cross-modal cross-attention mechanisms. It has the characteristics of good illumination robustness and no need for video trimming.
[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0008] A micro-expression recognition method based on active 3D imaging includes the following steps;
[0009] Step 1: Acquire images using active 3D cameras to construct a micro-expression dataset with illumination robustness and multimodal features; active 3D cameras include time-of-flight cameras, structured light cameras, etc.
[0010] Step 2: Input the micro-expression dataset into a 3D convolutional network to obtain initial fusion features;
[0011] Step 3: Input the initial fused features into the spatiotemporal self-attention mechanism module, and use the multi-head attention mechanism to perform spatiotemporal joint modeling of the feature sequence to capture the long-range temporal dependence and dynamic change regions of micro-expressions; introduce the cross-attention mechanism to obtain fused features, and compress the fused features into compressed features through global average pooling. Preserve key spatiotemporal information;
[0012] Step 4: Input the compressed features into the dual-branch network to simultaneously achieve temporal localization and emotion classification of micro-expression fragments.
[0013] In step one, during the dataset construction phase, an active 3D camera, Intel RealSense LiDARL515, is used to acquire active infrared light intensity maps, depth maps, and RGB images.
[0014] By constructing multiple scenarios – a dark room with low illumination of <10 lux and a normal indoor lighting of 300-500 lux – participants of different ages, genders, and skin colors were invited to watch emotionally impactful videos to naturally induce micro-expressions, ensuring data diversity.
[0015] Frame-by-frame video analysis, labeling seven micro-expression categories (anger, disgust, fear, happiness, surprise, sadness, and others) and time boundaries (start frame, peak frame, and end frame).
[0016] During preprocessing, histogram equalization and other enhancement processes are applied to the active infrared light intensity map, median filtering is used to denoise the depth map, white balance correction is performed on the RGB image, and multimodal data time sequence alignment is achieved by analyzing timestamps. The video sequence is stabilized based on optical flow to reduce positional deviations caused by shaking or jitter.
[0017] In step two:
[0018] The window size is set according to the duration characteristics of micro-expressions. s Frames, with fixed step size Sliding (balancing detection real-time performance and accuracy, ensuring overlapping of adjacent windows to avoid missed detections) extracts continuous frame subsequences of active infrared light intensity map, depth map and RGB image, which are used as input to the 3D convolutional network branches to extract spatiotemporal features of each modality;
[0019] The RGB modal branch of the RGB image is input with 3 channels and the spatiotemporal features of the appearance texture are extracted by a 3D convolutional network;
[0020] The infrared modal branch of the active infrared light intensity map is input with 1 channel; the spatiotemporal features of intensity changes are extracted by a 3D convolutional network;
[0021] The depth modality branch of the depth map is input with 1 channel to extract the spatiotemporal features of facial 3D structural displacement;
[0022] The output features of each branch are concatenated through channels to form the initial fused features. .
[0023] Step two specifically involves:
[0024] The preprocessed multimodal sequence data (RGB, active infrared, depth map) is normalized and then input into the modality-specific feature extraction module;
[0025] Design independent 3D convolutional networks for different modal characteristics:
[0026] The first layer of the RGB modality branch is a 3D convolutional layer (kernel size) Step length ,filling The algorithm is configured with 64 convolutional kernels, followed by a batch normalization (BN) layer and a rectified linear unit (ReLU) activation function to extract the spatiotemporal features of the apparent texture (such as the spatiotemporal gradient of facial muscle texture changes).
[0027] Infrared / Depth Modality Branch First Layer 3D Convolutional Layer (Kernel Size) Step length ,filling Configure 32 convolutional kernels, and perform subsequent processing in the same way as the RGB branch to extract the spatiotemporal features of intensity change (infrared) and facial 3D structure displacement (depth);
[0028] After each branch undergoes two layers of 3D convolution, the output feature tensor size is unified to 1. (T is the sequence length, C is the number of channels, C=64 for the RGB branch, C=32 for the infrared / depth branch), the three different types of features are spliced together to form the initial fused features. .
[0029] Step three specifically involves:
[0030] A spatiotemporal self-attention mechanism module is employed to capture the long-range temporal dependencies and dynamically changing regions of the initial fused features:
[0031] First, spatiotemporal location coding is added to frames within each segment of the initial fused features. The temporal coding uses a sine function. Where k is the dimension index and D is the feature dimension; t Let x be the length of the time segment, and y be the width and height of the video, respectively; this is determined by the ratio of the dimension index k to the total dimension D. As an index, it enables different dimensions of positional encoding to have different frequencies, thereby distinguishing relative positional information in time and space in the spatiotemporal self-attention mechanism and capturing the long-range temporal dependence and spatial structural features of micro-expressions.
[0032] Then, a multi-head attention mechanism is used to perform spatiotemporal joint modeling on the time-encoded feature sequence, outputting features that include temporal dynamic priorities. Each encoder layer contains a multi-head self-attention and feed-forward network (FFN).
[0033] Multi-head self-attention projects the feature sequence into query Q, key K, and value V, and calculates the attention weights by scaling the dot product. Where Q, K, and V are generated by linear mapping of the input features. The dimension scaling factor is used; the activation function is GELU, followed by residual connections and layer normalization (LN) to prevent gradient vanishing; after encoder processing, the output contains features with temporal dynamic priorities. The regions with high attention weights correspond to micro-expression keyframes (such as the start frame and peak frame).
[0034] Design a cross-modal attention mechanism to achieve intermodal information complementarity, using RGB modal features Baseline query Infrared modal characteristics with depth modal features As key-value pairs , , , Unify dimensions to a linear transformation =64;
[0035] Modal fusion is achieved by calculating intermodal correlation weights, which are calculated using a learnable parameter matrix. ,in This indicates the dependence of the RGB modes on the infrared modes. In low-light scenarios, this weight automatically increases to enhance the compensation of apparent noise by infrared features. Finally, a gating parameter is introduced to dynamically balance cross-modal fusion and intrinsic modal features, avoiding information overload. The fused features are... The fused features are then compressed using global average pooling. Preserve key spatiotemporal information.
[0036] In step four, one branch passes through two fully connected layers followed by two regression layers to output standardized micro-expression start and end frames, and uses a smooth L1 loss function to calculate the localization loss; the other branch passes through three fully connected layers, with the last layer using the SoftMax function to output a probability distribution of seven emotion categories, and uses a weighted cross-entropy loss function to calculate the classification loss to address the class imbalance problem; the localization loss and classification loss are jointly optimized, and after training, the loss is continuously reduced through the backpropagation algorithm, thereby improving the localization accuracy of micro-expression segments and the recognition accuracy of emotion categories.
[0037] Step four specifically involves:
[0038] Joint detection and classification are achieved by inputting fused features into a dual-branch network.
[0039] One branch of the dual-branch network uses two fully connected layers (128→256→128), followed by two regression layers (128→64→2), outputting a normalized start frame. With the end frame (range [0,1], via) Mapped to actual frame number);
[0040] To address the ambiguity of micro-expression boundaries, a smoothed L1 loss is used as the localization loss: ,in For frame difference, For indicator functions;
[0041] The other path of the dual-branch network passes through three fully connected layers (128→512→256→8), and the final SoftMax layer outputs a probability distribution of 7 types of emotions.
[0042] To address the class imbalance issue in the micro-expression dataset, where there are significant differences in the number of samples for each emotion category, a weighted cross-entropy loss method is employed.
[0043] Weight The weight is calculated by the reciprocal of the number of samples in each category (if the proportion of neutral samples is high, the weight is reduced) to balance the gradient contribution of each emotion category during training.
[0044] The overall network model is trained end-to-end, and the joint loss function is: The optimizer chosen is AdamW (weight decay 0.01), with an initial learning rate of 1. Cosine annealing learning rate scheduling is adopted (200 rounds per cycle, minimum learning rate 1). .
[0045] The beneficial effects of this invention are:
[0046] By simultaneously acquiring active infrared light intensity maps, depth maps, and RGB images, a cross-modal micro-expression dataset is constructed. Utilizing the complementary properties of infrared modality to eliminate ambient light interference, depth modality to record 3D structural dynamics, and RGB modality to preserve apparent texture, this effectively improves the feature degradation problem of traditional single-modal RGB data under complex lighting conditions. In low-light scenarios, feature robustness is significantly enhanced, significantly improving the adaptability of the detection model to real-world environments.
[0047] A cross-modal micro-expression dataset is constructed by simultaneously acquiring active infrared light intensity maps, depth maps, and RGB images. The principle behind this approach is to leverage the physical characteristics of different modalities to achieve environmental adaptability and feature complementarity: the active infrared modality avoids interference from ambient visible light based on the thermal radiation principle of 860 nm near-infrared light, stably recording the dynamics of the facial temperature field to achieve illumination invariance; the depth modality uses the Time-of-Flight (ToF) method to calculate the 3D coordinates of the face by measuring the round-trip time of laser pulses, recording the spatial deformation trajectory of micro-expressions; and the RGB modality preserves facial texture and color information based on visible light reflectance characteristics. The cross-modal complementarity mechanism dynamically allocates weights and strengthens semantic associations through spatiotemporal self-attention and cross-modal cross-attention, forming composite features encompassing "illumination robustness + spatial structure + apparent details," thus solving the feature degradation problem of traditional single-modal RGB under complex lighting conditions from the physical signal level. The spatiotemporal joint detection and classification model proposed in this invention breaks through the traditional phased framework of "first localization, then recognition," automatically focusing on key frame intervals of micro-expressions through a spatiotemporal self-attention mechanism and achieving feature fusion by combining multimodal cross-attention, allowing direct processing of original long video sequences without manual video trimming. This method effectively reduces localization errors and improves the model's detection efficiency for randomly occurring, non-fixed-duration micro-expressions in natural scenes.
[0048] By collaboratively optimizing spatiotemporal self-attention and multimodal cross-attention, the model effectively captures the instantaneous dynamic features and cross-modal complementary information of micro-expressions, overcoming the dependence of existing methods on manually edited videos and single modalities. This technical solution significantly improves the robustness and generalization ability of the model while enhancing detection accuracy, providing an efficient solution for micro-expression detection in complex scenarios. Attached Figure Description
[0049] Figure 1 This is a schematic diagram comparing the present invention with other micro-expression recognition methods.
[0050] Figure 2 The distribution of seven emotions in micro-expressions and macro-expressions in the micro-expression dataset constructed in this invention is shown.
[0051] Figure 3 This is a distribution diagram showing the duration of facial expression segments in the micro-expression dataset collected by this invention.
[0052] Figure 4 This is a schematic diagram of the sliding window mechanism used in this invention.
[0053] Figure 5 This is the complete frame processing flow for facial expression segments in the micro-expression detection of this invention.
[0054] Figure 6 This is a comparison diagram of the confusion matrices of micro-expression detection models under different attention mechanism configurations in this invention. Detailed Implementation
[0055] The present invention will now be described in further detail with reference to the accompanying drawings.
[0056] A micro-expression recognition method based on active 3D imaging includes the following steps;
[0057] 1. Dataset Construction:
[0058] To construct a micro-expression dataset with illumination robustness and multimodal characteristics, this invention utilizes an Intel RealSense LiDAR L515 ToF laser radar depth camera for data acquisition. This device can simultaneously acquire active infrared light intensity maps, depth maps, and RGB images, providing rich modal information for the dataset. The camera's LiDAR module operates at an active infrared wavelength of 860 nm, employing a time-of-flight (ToF) method for depth acquisition. It calculates the target distance by measuring the round-trip time of the laser pulse, achieving a depth resolution of 1024×768 pixels with an accuracy of ±0.5 mm (within the 0.5-5m range) and a frame rate of 30 fps (depth map synchronized with infrared image). The RGB camera, using a color CMOS sensor, captures images with a resolution of 1920×1080 pixels and a spectral response range of 400-700 nm (visible light band), also at a frame rate of 30 fps, synchronized with the LiDAR module via hardware triggering.
[0059] During the data collection process, multiple lighting scenarios were constructed to simulate various lighting conditions in real-world environments. These included low-light scenarios (such as a dark room environment with <10 lux) and normal indoor lighting scenarios (approximately 300-500 lux). In each lighting scenario, multiple subjects were invited to participate in the collection of micro-expression data. Subjects included individuals of different ages, genders, and skin colors to ensure the diversity and representativeness of the dataset. Before collection, subjects were explained the purpose and process of micro-expression data collection to ensure their understanding and consent to the experiment. Subjects were required to sit in a fixed position, keeping their faces roughly within the center of the camera's field of view, with no obstructions (such as hair or hats) to ensure image quality. A natural induction method was used to guide subjects to produce micro-expressions. Subjects were shown emotionally impactful video clips to induce natural micro-expression responses. During the micro-expression process, the device simultaneously acquired active infrared light intensity maps, depth maps, and RGB images, recording the complete dynamic process of the micro-expression.
[0060] During the data standardization process, the annotation team analyzed the collected video sequences frame by frame to determine the category of micro-expressions (including anger, disgust, fear, happiness, surprise, sadness, and others). Based on the facial movement features and emotional expressions of micro-expressions, combined with relevant knowledge and experience in micro-expression recognition, the annotation team determined the accurate category of each micro-expression segment. Simultaneously, temporal boundaries were annotated: the start frame (Onset), peak frame (Apex), and end frame (Offset) of the micro-expression were labeled to determine the temporal boundaries of the micro-expression within the video sequence.
[0061] Because active infrared light intensity maps can be affected by environmental factors, leading to low image contrast and unclear details, an image enhancement algorithm employing histogram equalization and contrast-limited adaptive histogram equalization is used to improve the contrast and clarity of the active infrared light intensity map, enhancing the recognizability of facial thermal radiation features. However, depth maps may contain noise and rough areas, which could affect subsequent analysis of muscle deformation depth changes. A median filtering algorithm is used to smooth the depth map, removing noise points and making the depth map smoother and more natural while preserving key details of facial muscle deformation. Color correction is performed to address potential color deviations in RGB images under different lighting conditions. A white balance correction algorithm is used to adjust the color balance of the RGB image, ensuring accurate color information under various lighting scenarios and improving the visual quality and feature consistency of the RGB image.
[0062] Because data from different modalities (active infrared intensity maps, depth maps, and RGB images) may have slight temporal differences during acquisition, they need to be temporally aligned. By analyzing the timestamp information of different modalities, temporal calibration is performed on the data to ensure their consistency in the temporal dimension, so that relevant information in multimodal data can be accurately fused and analyzed subsequently. Simultaneously, the temporally aligned video sequence is stabilized to reduce image position deviations caused by factors such as slight head movements of the subject and camera shake. Using an optical flow-based approach, by subtracting the optical flow of the nose tip of the face from the global optical flow, each frame in the video sequence is registered with a reference frame (such as the first frame), the displacement and rotation parameters between the images are calculated, and corresponding geometric transformations are performed on subsequent frames to keep the entire video sequence spatially stable, improving data quality and usability.
[0063] 2. Spatiotemporal detection model training
[0064] The preprocessed multimodal sequence data (RGB, active infrared, depth map) is normalized and then input into the modality-specific feature extraction module. Independent 3D convolutional networks are designed for different modal characteristics: the RGB modality branch uses 3-channel input, and the first layer is a 3D convolutional layer (kernel size...). Step length ,filling The configuration includes 64 convolutional kernels, followed by a batch normalization (BN) layer and a rectified linear unit (ReLU) activation function to extract spatiotemporal features of the apparent texture (such as the spatiotemporal gradient of facial muscle texture changes); Infrared / depth modality branch: uses 1-channel input, with the first layer being a 3D convolutional layer (kernel size...). Step length ,filling The configuration uses 32 convolutional kernels, and subsequent processing is the same as the RGB branch, extracting spatiotemporal features of intensity changes (infrared) and facial 3D structure displacement (depth) respectively. After two layers of 3D convolution, the output feature tensor size of each branch is unified. (T is the sequence length, C is the number of channels, C=64 for the RGB branch, C=32 for the infrared / depth branch), the initial fusion feature is formed by channel splicing. .
[0065] To capture the long-term temporal dependencies and dynamically changing regions of micro-expressions, the model employs a spatiotemporal self-attention mechanism: integrating the time dimension... Divided into length Non-overlapping segments (set to 30 frames based on the average duration of micro-expressions, allowing coverage of the entire expression cycle) are used to avoid loss of boundary information. Spatiotemporal location coding is added to frames within each segment, with temporal coding using a sine function. Where k is the dimension index and D is the feature dimension. A multi-head attention mechanism is employed to perform spatiotemporal joint modeling of the feature sequence. Each encoder layer contains a multi-head self-attention network and a feed-forward network (FFN). The multi-head self-attention projects the features into a query Q, a key K, and a value V, and calculates the attention weights by scaling the dot product. Where Q, K, and V are generated by linear mapping of the input features. is the dimension scaling factor. The activation function is GELU, followed by residual connections and layer normalization (LN) to prevent gradient vanishing. After encoder processing, the output contains features with temporal dynamic priorities. The regions with high attention weights correspond to micro-expression keyframes (such as the start frame and peak frame).
[0066] Design a cross-modal attention mechanism to achieve intermodal information complementarity, using RGB modal features Baseline query Infrared modal characteristics with depth modal features As key-value pairs , , , Unify dimensions to a linear transformation =64. Intermodal correlation weights are calculated using the learnable parameter matrix. ,in This indicates the dependence of the RGB modes on the infrared modes. In low-light scenarios, this weight automatically increases to enhance the compensation of apparent noise by infrared features. Finally, a gating parameter is introduced to dynamically balance cross-modal fusion and intrinsic modal features, avoiding information overload. The fused features are... The fused features are then compressed using global average pooling. Preserve key spatiotemporal information.
[0067] Joint detection and classification are achieved by inputting fused features into a dual-branch network: two fully connected layers (128→256→128) are followed by two regression layers (128→64→2), and the output is a standardized start frame. With the end frame (range [0,1], via) (Mapped to actual frame number). To handle the ambiguity of micro-expression boundaries, a smoothed L1 loss is used as the localization loss: ,in For frame difference, This is the indicator function. Through three fully connected layers (128→512→256→8), the final SoftMax layer outputs a probability distribution for seven sentiment categories. To address the class imbalance problem, a weighted cross-entropy loss is used. Weight The weight is calculated by the reciprocal of the number of samples in each category (if the proportion of neutral samples is high, the weight is reduced) to balance the gradient contribution of each emotion category during training.
[0068] The model is trained end-to-end, and the joint loss function is: The optimizer chosen is AdamW (weight decay 0.01), with an initial learning rate of 1. Cosine annealing learning rate scheduling is used (200 rounds per cycle, minimum learning rate 1). .
[0069] Figure 1 This diagram illustrates a comparison between the present invention and other micro-expression recognition methods. Generally, the method for classifying micro-expression video emotions involves first manually cutting the original video into multiple video segments containing micro-expression instances, and then using a video classification model to identify the emotions in each segment. The present invention uses a temporal emotion detection method, directly employing a temporal emotion detection model to perform emotion detection on the entire uncropped video. This not only allows for the temporal location of multiple emotions in the video but also simultaneously identifies the category of each emotion.
[0070] Figure 2This paper presents the distribution of seven emotions (micro-expressions and macro-expressions) in the micro-expression dataset constructed in this invention, including two radar charts comparing the distribution from both quantity and proportion perspectives. The left radar chart displays the distribution by quantity: the blue portion represents macro-expressions, and the red portion represents micro-expressions. The seven emotion categories (fear, disgust, anger, other, surprise, sadness, and happiness) are evenly distributed across the axes. In contrast, the number of micro-expressions is relatively small. This chart allows for a direct comparison of the absolute difference in the number of macro-expressions and micro-expressions for each emotion. The right radar chart displays the distribution by proportion: again, the blue portion represents macro-expressions, and the red portion represents micro-expressions. This chart clearly presents the relative proportion of macro-expressions to micro-expressions for each emotion in the overall data, facilitating the analysis of the distribution characteristics of different emotions in the dataset.
[0071] Figure 3 This is a distribution graph showing the duration of facial expression segments in the micro-expression dataset collected by this invention. The horizontal axis represents segment length (unit: frames), ranging from 0 to 120 frames, corresponding to actual durations of 0 to 4 seconds (video frame rate 30 frames / second). The vertical axis represents the distribution, showing the frequency of different length facial expression segments in the dataset. According to the time definitions of micro-expressions and macro-expressions, micro-expressions last no more than 0.5 seconds, corresponding to segment lengths of no more than 15 frames; macro-expressions last from 0.5 to 4 seconds, corresponding to segment lengths of 15 to 120 frames. The frame length ranges for both types of expressions are clearly shown in the graph. From the distribution characteristics, the frequency of segments differs within each length range. This graph visually presents the distribution pattern of facial expression segment lengths, providing crucial information for distinguishing between micro-expressions and macro-expressions and analyzing the dataset composition. Furthermore, the identification of high-frequency ranges lays a data foundation for subsequent targeted extraction of typical facial expression features and optimization of micro-expression detection algorithms.
[0072] Figure 4 This is a schematic diagram of the sliding window mechanism used in this invention. For a video segment, a sliding window is used to detect sentiment instances. The size of the sliding window is set to... s The frame size is set based on the duration characteristics of the micro-expression to ensure that the complete micro-expression movement is captured. The sliding window moves in fixed increments. The image frame sequence is slid from left to right. The step size is set to take into account the real-time and accuracy requirements of detection. While ensuring that there is a certain overlap between adjacent windows to avoid missing emotional instances, the processing efficiency is improved.
[0073] Figure 5This diagram illustrates the complete frame processing workflow for facial expression segments in the micro-expression detection process of this invention. The top of the image shows a series of consecutively arranged image frames. The 3D area highlighted in green represents a specific example of a facial expression segment, clearly marked with three keyframes: #0138 as the start frame, marking the beginning of the micro-expression; #0143 as the peak frame, representing the moment the micro-expression intensity reaches its peak; and #0148 as the end frame, signifying the end of the micro-expression. These three keyframes clearly define the time range of the facial expression segment. Below the keyframe annotations, the ground truth section defines the standard positions of the start, peak, and end frames of the facial expression segment, providing a reference benchmark for subsequent processing and ensuring that the accuracy of the detection results has a measurable basis. The labeling section uses frame-level labels for a visual representation. Dark green rectangles represent frames belonging to facial expression segments, while light green rectangles represent frames not belonging to facial expression segments. This visualization clearly presents the label assignment for each frame, facilitating understanding of the model's initial frame classification. The probability component is presented using segment-level probability histograms, with each bar corresponding to a frame. The height of the bar reflects the probability that the frame belongs to an expression segment. This design intuitively presents the model's prediction confidence for each frame; a higher value indicates a greater likelihood that the frame belongs to an expression segment. The post-processing section optimizes the probability results, correcting potential errors or unreasonable ranges to further improve detection accuracy. The final prediction result, represented by black lines, indicates the expression segment detection range determined after a series of processing steps. It integrates ground truth references, frame-level labels, segment-level probabilities, and post-processing optimizations to arrive at an accurate and reliable detection range. This diagram comprehensively and meticulously presents the key processes of micro-expression detection, from original frame annotation, label definition, probability calculation, post-processing to the final prediction result. It clearly demonstrates how each step works collaboratively to determine the temporal location of expression segments, providing intuitive and comprehensive visual assistance for understanding the mechanism of micro-expression detection.
[0074] Figure 6This diagram compares the confusion matrices of micro-expression detection models under different attention mechanism configurations in this invention. The main framework of the micro-expression detection model of this invention is as follows: first, a self-attention mechanism is used to extract spatiotemporal features from videos of different modalities; then, a cross-attention mechanism is used to extract inter-modal difference information. The diagram shows the emotion prediction confusion matrices of the complete model (including self-attention and cross-attention mechanisms), the model without self-attention mechanism, and the model without cross-attention mechanism, from left to right. The horizontal axis represents predicted emotion, and the vertical axis represents true emotion. Emotion categories include anger, disgust, fear, happiness, sadness, surprise, and others. The comparison shows that the complete model, thanks to the effective extraction of spatiotemporal features by the self-attention mechanism and the in-depth mining of inter-modal difference information by the cross-attention mechanism, has a greater advantage in predicting multiple emotions. This visually demonstrates the key role of the two attention mechanisms in improving the accuracy of emotion recognition, providing a visual basis for evaluating the effectiveness of the attention mechanisms in the model framework of this invention. It is an important graphical representation of the model performance analysis in this patent's technical solution.
[0075] The specific embodiments of the present invention will be further described below with reference to the accompanying drawings:
[0076] To further clarify the purpose, technical solution, and core advantages of this invention, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. Please note that the specific embodiments described below are for illustrative purposes only and are not intended to limit the invention. Furthermore, the technical features involved in different implementation schemes in the embodiments can be combined with each other as long as they do not conflict with each other.
[0077] Step 1: Reference Figure 2 , Figure 3 As shown, in the dataset construction phase, this invention uses an active 3D camera, the IntelRealSense LiDAR L515 ToF, to acquire active infrared light intensity maps, depth maps, and RGB images. Multiple scenarios, including low-light (e.g., <10 lux dark room) and normal indoor lighting (approximately 300-500 lux), were constructed. Subjects of different ages, genders, and skin colors were invited to watch emotionally impactful videos to naturally induce micro-expressions, ensuring data diversity. The annotation team analyzed the videos frame by frame, labeling micro-expression categories (anger, disgust, fear, happiness, surprise, sadness, others) and temporal boundaries (start frame, peak frame, end frame). During preprocessing, histogram equalization and other enhancement processes were applied to the active infrared light intensity maps, median filtering was used for noise reduction in the depth maps, white balance correction was performed on the RGB images, and multimodal data temporal alignment was achieved by analyzing timestamps. Optical flow was used to stabilize the video sequences, reducing positional deviations caused by shaking or jitter.
[0078] Step Two: Reference Figure 1 , Figure 4As shown, the spatiotemporal detection model is trained. Figure 4 The sliding window mechanism sets the window size based on the duration characteristics of micro-expressions. s Frames, with fixed step size A sliding mechanism (balancing real-time detection and accuracy, ensuring overlapping of adjacent windows to avoid missed detections) extracts continuous frame subsequences. The RGB modality branch, with 3-channel input, extracts spatiotemporal features of apparent texture through 3D convolution; the infrared / depth modality branch, with 1-channel input, undergoes similar processing to extract spatiotemporal features of intensity changes and facial 3D structure displacement, respectively. The output features of each branch are concatenated to form an initial fusion feature. .
[0079] Step 3: Reference Figure 5 The model employs a spatiotemporal self-attention mechanism, integrating the time dimension... Divided into length For non-overlapping segments, add spatiotemporal location encoding to the frames within the segment. ,in k For dimensional indexing, D For feature dimension, t Let x be the length of the time segment, and y be the width and height of the video, respectively. A multi-head attention mechanism is used to jointly model the spatiotemporal features of the sequence, and the encoder outputs features with temporal dynamic priorities. Highlighting key frames of micro-expressions (such as the start frame and peak frames). A cross-modal cross-attention mechanism is designed, utilizing RGB modal features. Baseline query Infrared modal characteristics with depth modal features As key-value pairs , , , After linear transformation to unify dimensions, the intermodal correlation weights are calculated. ,in To mitigate gradient vanishing, a scaling factor is introduced. A gating parameter is used to dynamically balance cross-modal fusion and intrinsic modal features. The fused features are then compressed using global average pooling. Preserve key spatiotemporal information.
[0080] Step 4: Input the fused features into a dual-branch network. One branch passes through two fully connected layers followed by two regression layers, outputting standardized micro-expression start frames. With the end frame First, a smoothed L1 loss is used as the localization loss. Second, a three-layer fully connected architecture is used, with the final SoftMax layer outputting a probability distribution for seven emotion categories. Weighted cross-entropy loss is employed to address category imbalance. The model is trained end-to-end, using a joint loss function. , The loss is for classification and localization, respectively. The optimizer chosen is AdamW (weight decay of 0.01), with an initial learning rate of 1. Cosine annealing learning rate scheduling is used (200 rounds per cycle, minimum learning rate 1). After training and obtaining the loss, the backpropagation algorithm is used to continuously reduce the loss and improve the accuracy of localization and recognition.
[0081] The multimodal dataset constructed in Step 1 (including active infrared, depth, and RGB modalities) forms the basis for training the spatiotemporal detection model in Step 2. Step 2's "training the spatiotemporal detection model using the micro-expression dataset" directly relies on the multimodal data collected and preprocessed in Step 1. The dataset's illumination robustness and diversity determine the effectiveness of model training. The initial fusion features output in Step 2 are the input to the spatiotemporal self-attention mechanism in Step 3. Step 3's "inputting the initial fusion features into the spatiotemporal self-attention mechanism module" requires initial fusion features formed by channel concatenation of the spatiotemporal features extracted from each modality through 3D convolution in Step 2, providing a multimodal information foundation for subsequent spatiotemporal modeling. The compressed features generated in Step 3 are the input to the dual-branch network in Step 4. The "fusion features" in Step 4's "inputting the fusion features into the dual-branch network" refer to the features processed in Step 3 through spatiotemporal self-attention and cross-modal cross-attention. The key spatiotemporal information retained directly affects the accuracy of the dual-branch network in micro-expression localization and classification.
[0082] In step one (dataset construction): an Intel RealSense LiDAR L515 ToF camera is used to simultaneously acquire active infrared light intensity maps, depth maps, and RGB images. Its 860 nm near-infrared light and ToF technology respectively achieve resistance to ambient light interference and facial 3D structure measurement, thus constructing a multimodal dataset; the depth map is denoised and the time sequence is aligned to the camera characteristics during preprocessing.
[0083] Step 2 (Model Training): The infrared / depth modal branch is directly adapted to the ToF output data to extract intensity change and 3D structural displacement features, which are then spliced with the RGB features to form the initial fusion features.
[0084] Steps 3 and 4 (Modeling and Detection): Cross-modal attention mechanism enhances the interaction between ToF modality and RGB, infrared weights are increased in low light, depth localization is used to dynamically adjust spatial conditions, and dual-branch network improves detection robustness in complex scenes with the help of ToF data.
[0085] This invention utilizes the Intel RealSense LiDAR L515 active 3D imaging device to simultaneously acquire active infrared light intensity maps, depth maps, and RGB images, constructing a cross-modal micro-expression dataset. The active infrared modality uses an 860nm near-infrared light source to eliminate the influence of ambient light and preserve stable facial thermal radiation features; the depth modality acquires facial 3D coordinate information through a structured light camera, recording the depth changes in muscle deformation during micro-expression; and the RGB modality provides traditional facial texture features. The complementarity of these three modalities effectively improves the data's adaptability to complex lighting conditions. Even under varying lighting intensities, the multi-modal fusion features can still fully preserve the dynamic details of facial expressions, providing crucial data support for the development of lighting-robust algorithms.
[0086] The collection of facial data in this application complies with the Personal Information Protection Law, and the data subject consented to the use of the data for emotion analysis and micro-expression recognition.
[0087] This invention proposes a spatiotemporal joint detection and classification model that does not require video trimming, achieving unified modeling from raw video input to emotion category output. This model automatically focuses on time intervals with significant facial expression changes within a video sequence through a temporal attention mechanism. Simultaneously, it combines a multi-stage localization and regression network to accurately predict the start and end frames of micro-expressions, jointly optimizing segment localization and classification losses. Compared to the traditional two-stage "localization then recognition" framework, this method achieves joint optimization of localization and recognition, avoiding error accumulation in intermediate stages, and is particularly suitable for processing long raw videos containing complex emotional dynamics.
Claims
1. A micro-expression recognition method based on active three-dimensional imaging, characterized in that, Includes the following steps; Step 1: Use active 3D cameras to acquire images and construct a micro-expression dataset with illumination robustness and multimodal features. The active 3D cameras include time-of-flight cameras and structured light cameras. Step 2: Input the micro-expression dataset into a 3D convolutional network to obtain initial fusion features; Step 3: Input the initial fused features into the spatiotemporal self-attention mechanism module, and use the multi-head attention mechanism to perform spatiotemporal joint modeling of the feature sequence to capture the long-range temporal dependence and dynamic change regions of micro-expressions; introduce the cross-attention mechanism to obtain fused features, and compress the fused features into compressed features through global average pooling. Preserve key spatiotemporal information; Step 4: Input the compressed features into the dual-branch network to simultaneously achieve temporal localization and sentiment classification of micro-expression fragments; In step four, one branch passes through two fully connected layers followed by two regression layers to output standardized micro-expression start and end frames, and uses a smooth L1 loss function to calculate the localization loss; the other branch passes through three fully connected layers, with the last layer using the SoftMax function to output a probability distribution of seven emotion categories, and uses a weighted cross-entropy loss function to calculate the classification loss to address the class imbalance problem; the localization loss and classification loss are jointly optimized, and after training, the loss is continuously reduced through the backpropagation algorithm to improve the localization accuracy of micro-expression segments and the recognition accuracy of emotion categories; Step four specifically involves: Joint detection and classification are achieved by inputting fused features into a dual-branch network. One branch of the dual-branch network uses two fully connected layers, followed by two regression layers, to output a normalized start frame. With the end frame Range [0,1], through Mapped to the actual frame number; Smoothing L1 loss is used as the localization loss to handle the ambiguity of micro-expression boundaries: ; in For frame difference, For indicator functions; The other path of the dual-branch network passes through three fully connected layers, with the final SoftMax layer outputting a probability distribution of seven emotion categories. To address the class imbalance issue in the micro-expression dataset, where there are significant differences in the number of samples for each emotion category, a weighted cross-entropy loss method is employed. Weight The gradient contribution of each emotion category during training is calculated by reciprocal of the number of samples in each category.
2. The micro-expression recognition method based on active three-dimensional imaging according to claim 1, characterized in that, In step one, during the dataset construction phase, an active 3D camera, Intel RealSense LiDAR L515 ToF, is used to acquire active infrared light intensity maps, depth maps, and RGB images. By constructing multiple scenarios – a dark room with low illumination of <10 lux and a normal indoor lighting of 300-500 lux – participants of different ages, genders, and skin colors were invited to watch emotionally impactful videos to naturally induce micro-expressions, ensuring data diversity. Frame-by-frame video analysis was performed, and seven categories of micro-expressions and their time boundaries were labeled. During preprocessing, histogram equalization enhancement is performed on the active infrared light intensity map, median filtering is used to denoise the depth map, white balance correction is performed on the RGB image, and multimodal data temporal alignment is achieved by analyzing timestamps. The video sequence is stabilized based on optical flow to reduce positional deviations caused by shaking or jitter.
3. The micro-expression recognition method based on active three-dimensional imaging according to claim 2, characterized in that, In step two: Set the window size according to the duration characteristics of micro-expressions. s Frames, with fixed step size Slide to extract continuous frame subsequences of active infrared light intensity map, depth map and RGB image, which are used as input to the 3D convolutional network branches to extract spatiotemporal features of each modality; The RGB modal branch of the RGB image is input with 3 channels and the spatiotemporal features of the appearance texture are extracted by a 3D convolutional network; The infrared modal branch of the active infrared light intensity map is input with 1 channel; the spatiotemporal features of intensity changes are extracted by a 3D convolutional network; The depth modality branch of the depth map is input with 1 channel; the spatiotemporal features of facial 3D structural displacement are extracted. The output features of each branch are concatenated through channels to form the initial fused features. .
4. The micro-expression recognition method based on active three-dimensional imaging according to claim 3, characterized in that, Step two specifically involves: Design independent 3D convolutional networks for different modal characteristics: The first layer of the RGB modality branch is a 3D convolutional layer with 64 convolutional kernels, followed by a batch normalization layer and a linear rectified activation function to extract the spatiotemporal features of the apparent texture. The first 3D convolutional layer of the infrared modality branch is configured with 32 convolutional kernels, followed by a batch normalization layer and a linear rectified activation function to extract the spatiotemporal features of infrared changes and facial 3D depth displacement. The first 3D convolutional layer of the deep modality branch is configured with 32 convolutional kernels, followed by a batch normalization layer and a linear rectified activation function to extract the spatiotemporal features of intensity changes and facial 3D structure displacement. After each branch undergoes two layers of 3D convolution, the output feature tensor size is unified to 1. T is the sequence length, C is the number of channels, C=64 for the RGB branch and C=32 for the infrared / depth branch. The three different types of features are spliced together through channels to form the initial fused features. .
5. The micro-expression recognition method based on active three-dimensional imaging according to claim 4, characterized in that, Step three specifically involves: A spatiotemporal self-attention mechanism module is employed to capture the long-range temporal dependencies and dynamically changing regions of the initial fused features: First, spatiotemporal location coding is added to frames within each segment of the initial fused features. The temporal coding uses a sine function. Where k is the dimension index and D is the feature dimension; t Let x be the length of the time segment, and y be the width and height of the video, respectively; this is determined by the ratio of the dimension index k to the total dimension D. As an index, it enables different dimensions of positional encoding to have different frequencies, thereby distinguishing relative positional information in time and space in the spatiotemporal self-attention mechanism and capturing the long-range temporal dependence and spatial structural features of micro-expressions. Then, a multi-head attention mechanism is used to perform spatiotemporal joint modeling on the time-encoded feature sequence, outputting features that include temporal dynamic priorities. ; Design a cross-modal attention mechanism to achieve intermodal information complementarity, using RGB modal features Baseline query Infrared modal characteristics with depth modal features As key-value pairs , , , Unify dimensions through linear transformation; Modal fusion is achieved by calculating intermodal correlation weights, which are calculated using a learnable parameter matrix. ,in This indicates the degree of dependence of the RGB mode on the infrared mode. In low-light scenes, this weight is automatically increased to enhance the compensation of apparent noise by infrared features. This is the dimension scaling factor; Finally, dynamic balancing of gating parameters and cross-modal fusion with intrinsic modal features are introduced. The fused features are: The fused features are then compressed using global average pooling. Preserve key spatiotemporal information.
6. The micro-expression recognition method based on active three-dimensional imaging according to claim 5, characterized in that, Multi-head self-attention projects the feature sequence into query Q, key K, and value V, and calculates the attention weights by scaling the dot product. Where Q, K, and V are generated by linear mapping of the input features. The dimension scaling factor is used; the activation function is GELU, followed by residual connections and layer normalization to prevent gradient vanishing; after encoder processing, the output contains features with temporal dynamic priorities. The regions with high attention weights correspond to micro-expression keyframes.