An elderly micro-expression recognition neural network architecture and system based on decoupled state space model

By using a neural network architecture based on a decoupled state-space model, the problems of feature coupling and poor robustness in micro-expression recognition of the elderly are solved, and efficient and accurate micro-expression recognition is achieved.

CN122244919APending Publication Date: 2026-06-19SHANDONG XIEHE UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG XIEHE UNIV
Filing Date
2026-03-12
Publication Date
2026-06-19

Smart Images

  • Figure CN122244919A_ABST
    Figure CN122244919A_ABST
Patent Text Reader

Abstract

This application relates to a neural network architecture and system for micro-expression recognition in the elderly based on a decoupled state-space model. The system preprocesses acquired micro-expression video sequences and elderly attribute features; extracts spatial, temporal, and individual features from the preprocessing results using three parallel variational autoencoders; concatenates the decoupled features and inputs them into forward and backward Mamba architectures for temporal modeling; then, it fuses the forward and backward state features through an attention mechanism; generates a parameter modulation matrix based on the elderly attribute features and adaptively adjusts the fused state features; finally, it fuses the adaptive features through temporal attention and global pooling, and inputs the fused features into a classification network to obtain the probability distribution of micro-expression categories. This system significantly improves the accuracy and robustness of micro-expression recognition in the elderly population, taking into account the unique characteristics of their facial features.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of micro-expression recognition technology, and relates to a neural network architecture and system for micro-expression recognition in the elderly based on a decoupled state space model. Background Technology

[0002] Micro-expressions, as an important carrier of human emotional expression, can reflect an individual's true psychological state and are of great value in the early identification of mental health problems such as depression and cognitive impairment in the elderly.

[0003] Micro-expression recognition technology has evolved from traditional computer vision methods to deep learning methods. Traditional methods are mainly based on manual feature extraction, such as LBP-TOP and HOG-TOP, which have poor robustness in complex scenarios. In recent years, deep learning methods, especially the application of convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have significantly improved the performance of micro-expression recognition.

[0004] Despite the progress made in existing micro-expression recognition technology, the following limitations still exist when applied to the elderly population:

[0005] (1) Lack of targeted design: Existing algorithms are mostly trained based on data from young people, without considering the special characteristics of facial features of the elderly. When directly applied to the elderly population, their performance drops significantly.

[0006] (2) Feature coupling problem: Spatial features, temporal features and individual features are coupled with each other, making it difficult to analyze and optimize them separately, which affects the interpretability and generalization ability of the model.

[0007] (3) Low efficiency of time series modeling: Traditional time series modeling methods are inefficient when processing long sequences and cannot meet real-time requirements.

[0008] (4) Insufficient individual self-adaptation ability: There is a lack of effective mechanisms for handling individual differences, and it is impossible to make dynamic adjustments based on the individual characteristics of the elderly.

[0009] (5) Poor robustness: It is not robust enough to factors such as changes in illumination, changes in posture, and background interference, and its performance is unstable in actual application scenarios. Summary of the Invention

[0010] To address the problems existing in the aforementioned traditional systems, this invention proposes a neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model. This system significantly improves the accuracy and robustness of recognizing micro-expressions in the elderly population through innovative mechanisms such as feature decoupling, efficient temporal modeling, and adaptive parameters for the elderly.

[0011] To achieve the above objectives, the embodiments of the present invention adopt the following technical solutions: On the one hand, a neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model is provided, including: The data preprocessing unit is used to preprocess the acquired micro-expression video sequences and the attribute characteristics of the elderly. The feature decoupling unit is used to input the preprocessing results into the feature decoupling module, and extract spatial features, temporal features and individual features respectively through three parallel encoders based on the variational autoencoder architecture; The state feature extraction unit is used to concatenate the extracted spatial feature vector, temporal feature vector, and individual feature vector and input them into the state space modeling module. It uses the Mamba architecture to perform temporal modeling from the past to the future and from the future to the past on the concatenated features. The obtained forward state features and backward state features are fused to obtain the fused state features. The parameter adaptation unit is used to input the fused state features and elderly attribute features into the elderly parameter adaptation module, and obtain adaptive features through feature mapping, adaptive transformation and residual connection processing; The micro-expression recognition unit is used to input the adaptive features into the micro-expression classification module, extract weighted temporal features through a temporal attention mechanism, extract static temporal features through a global pooling operation, fuse the weighted temporal features and static temporal features, and then process them using a classification network to obtain the micro-expression classification result.

[0012] One of the above technical solutions has the following advantages and beneficial effects: The aforementioned neural network architecture and system for micro-expression recognition in the elderly, based on a decoupled state-space model, preprocesses the acquired micro-expression video sequences and elderly attribute features. It then extracts spatial, temporal, and individual features from the preprocessing results using three parallel variational autoencoders. The decoupled features are concatenated and input into forward and backward Mamba architectures for temporal modeling. An attention mechanism is then used to fuse the forward and backward state features. A parameter modulation matrix is ​​generated based on the elderly attribute features, and the fused state features are adaptively adjusted. Finally, the adaptive features are fused using temporal attention and global pooling, and input into a classification network to obtain the probability distribution of micro-expression categories. This system significantly improves the accuracy and robustness of micro-expression recognition in the elderly population, taking into account the unique characteristics of their facial features. Attached Figure Description

[0013] To more clearly illustrate the technical solutions in the embodiments of this application or the conventional technology, the drawings used in the description of the embodiments or the conventional technology will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0014] Figure 1 This is an architecture diagram of an elderly micro-expression recognition model based on a decoupled state-space model in one embodiment; Figure 2 This is a schematic diagram of the state fusion module structure in one embodiment. Detailed Implementation

[0015] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0016] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

[0017] It should be noted that, in this document, the reference to "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The presentation of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art will understand that the embodiments described herein can be combined with other embodiments. The term "and / or" as used herein refers to any combination of one or more of the associated listed items, and all possible combinations, including such combinations.

[0018] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0019] In one embodiment, such as Figure 1 As shown, a neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model is provided, specifically including: The data preprocessing unit is used to preprocess the acquired micro-expression video sequences and the attribute characteristics of the elderly.

[0020] The specific input preprocessing module is the entry point of the entire system. It is responsible for receiving the raw input and performing standardization processing to provide high-quality input data for subsequent modules.

[0021] The program receives a micro-expression video sequence and elderly attribute features as input, and performs standardization processing. The micro-expression video sequence is represented as a four-dimensional tensor X∈RT×H×W×3, where T is the time step, H and W are the image height and width, respectively, and 3 represents the RGB color channels. The preprocessing steps for the micro-expression video sequence X include: (1) Frame rate uniformity: The input video is uniformly adjusted to 30fps to ensure consistency in the time dimension.

[0022] (2) Size standardization: Adjust the video frame size to 128 × 128 pixels to form a four-dimensional tensor of T × 128 × 128 × 3.

[0023] (3) Facial region cropping: The MTCNN (Multi-Task Cascaded Convolutional Networks) face detection algorithm is used to locate and crop the facial region and remove background interference.

[0024] (4) Illumination normalization: The CLAHE (Contrast Limited Adaptive Histogram Equalization) algorithm is used to normalize the illumination and reduce the impact of illumination changes.

[0025] (5) Pixel value standardization: Normalize the pixel values ​​from the range of [0, 255] to the range of [-1, 1] to accelerate the convergence of network training.

[0026] The attribute features of the elderly are represented as a dp-dimensional vector Pelderly∈Rdp, containing individual information such as age, gender, facial feature descriptor, and health status. The preprocessing steps for the attribute features of the elderly include: (5) Numerical feature standardization: Z-score standardization is performed on numerical features such as age, height, and weight to make them follow a normal distribution with a mean of 0 and a standard deviation of 1.

[0027] (6) Category feature encoding: One-hot encoding or embedding is performed on category features such as gender, race, and health status.

[0028] (7) Facial feature descriptor extraction: Extract facial feature descriptors from facial images, such as facial key point coordinates, wrinkle density, skin elasticity, etc., as supplementary attribute features.

[0029] (8) Feature concatenation: The processed features are concatenated into a dp-dimensional vector, which is used as the input of the individual feature encoder.

[0030] The feature decoupling unit is used to input the preprocessing results into the feature decoupling module, which extracts spatial features, temporal features, and individual features through three parallel encoders based on a variational autoencoder architecture.

[0031] Specifically, encoders based on the variational autoencoder (VAE) architecture achieve effective separation of spatial features, temporal features, and individual features, including: Spatial Feature Encoder: Employs a 3D convolutional neural network structure to extract spatial features from micro-expression video sequences and output spatial feature vectors. .

[0032] Temporal feature encoder: First, calculate the temporal gradient Xgrad = X[t] between adjacent frames. X[t [1] Highlighting the changes in micro-expression movements, and then extracting time features through a 3D convolutional neural network, outputting a time feature vector ztemporal∈R512.

[0033] Individual feature encoder: Processes the attribute features of the elderly through a multilayer perceptron (MLP) and outputs an individual feature vector zindividual∈R512.

[0034] Decoder: Corresponds to the three encoders, which reconstruct the spatial, temporal and individual features of the input respectively, are used to calculate the reconstruction loss and supervise the feature learning process of the encoder.

[0035] The spatial feature encoder is used to extract spatial features from low to high levels by stacking multiple 3D convolutional and pooling layers on the preprocessed micro-expression video sequence. The extracted spatial features are mapped to a high-dimensional feature space through a fully connected layer. Then, the mean vector and logarithmic variance vector of the spatial features are output by two parallel fully connected layers respectively. The spatial feature vector is generated by reparameterization technique. The temporal feature encoder is used to obtain a temporal gradient tensor by calculating the temporal gradient of the preprocessed micro-expression video sequence. It extracts temporal features from the temporal gradient tensor by stacking multiple 3D convolutional layers and pooling layers. The extracted temporal features are mapped to a high-dimensional feature space through a fully connected layer. Then, the mean vector and logarithmic variance vector of the temporal features are output by two parallel fully connected layers respectively. The reparameterization technique is used for sampling to generate a temporal feature vector. The individual feature encoder is used to generate an individual feature vector by using a multilayer perceptron to process the preprocessed elderly attribute features into a fixed-dimensional feature space, and then outputting the mean vector and log-variance vector of the individual features through two parallel fully connected layers. The individual feature vector is generated by sampling using reparameterization technology.

[0036] The state feature extraction unit is used to concatenate the extracted spatial feature vector, temporal feature vector, and individual feature vector and input them into the state space modeling module. The concatenated features are modeled in a time sequence from the past to the future and from the future to the past using the Mamba architecture. The forward and backward state features are then fused to obtain the fused state features.

[0037] Specifically, efficient time-series modeling based on the Mamba architecture includes: After concatenating the extracted spatial feature vector, temporal feature vector, and individual feature vector, the concatenated feature is obtained as: zconcat = [zspatial, ztemporal, zindividual] ∈ R1536. The forward state space model (forward SSM) is used to perform temporal modeling from the past to the future through linear projection and sequence expansion, and the forward state feature Hforward ∈ RT×dmodel is output.

[0038] After reversing the spliced ​​features using a backward state space model (backward SSM), a time series model from the future to the past is performed, and the backward state features Hbackward ∈ RT ×dmodel are output.

[0039] The forward and backward state features are concatenated and input into the state fusion module. The features are fused through a multi-head self-attention mechanism, and the fused state features Hfused∈RT×dmodel are output.

[0040] The parameter adaptation unit is used to input the fused state features and elderly attribute features into the elderly parameter adaptation module, and obtain adaptive features through feature mapping, adaptive transformation and residual connection processing.

[0041] Specifically, the model parameters are dynamically adjusted based on the individual characteristics of the elderly, including: Elderly Feature Mapping: The elderly attribute features Pelderly are mapped to a parameter modulation matrix Wmodulate∈Rdmodel×dmodel using MLP.

[0042] Adaptive transformation: The fused state features Hfused are multiplied with the parameter modulation matrix Wmodulate to achieve adaptive adjustment of the features.

[0043] Residual connections and layer normalization: The original features are added to the adaptively transformed features through residual connections, and layer normalization is then performed to output the final adaptive features. .

[0044] The micro-expression recognition unit is used to input the adaptive features into the micro-expression classification module, extract weighted temporal features through a temporal attention mechanism, extract static temporal features through a global pooling operation, fuse the weighted temporal features and static temporal features, and then process them using a classification network to obtain the micro-expression classification result.

[0045] Specifically, the processed features are converted into micro-expression category probability distributions, including: a temporal attention mechanism: calculating the attention weights of features at different time steps to generate weighted temporal features. .

[0046] Global temporal pooling: Performs global average pooling or max pooling on adaptive features to generate static temporal features. .

[0047] Feature fusion: concatenating weighted temporal features and static temporal features to form the final classification features. .

[0048] Classification network: A multilayer perceptron maps classification features to probability distributions for seven types of micro-expressions. .

[0049] The elderly micro-expression recognition model based on a decoupled state space model, abbreviated as DSSM-EMER, consists of a feature decoupling module, a state space modeling module, an elderly parameter adaptation module, and a micro-expression classification module. The overall structure of the elderly micro-expression recognition model based on the decoupled state space model is as follows: Figure 1 As shown.

[0050] The aforementioned neural network architecture and system for micro-expression recognition in the elderly, based on a decoupled state-space model, preprocesses the acquired micro-expression video sequences and elderly attribute features. It then extracts spatial, temporal, and individual features from the preprocessing results using three parallel variational autoencoders. The decoupled features are concatenated and input into forward and backward Mamba architectures for temporal modeling. An attention mechanism is then used to fuse the forward and backward state features. A parameter modulation matrix is ​​generated based on the elderly attribute features, and the fused state features are adaptively adjusted. Finally, the adaptive features are fused using temporal attention and global pooling, and input into a classification network to obtain the probability distribution of micro-expression categories. This system significantly improves the accuracy and robustness of micro-expression recognition in the elderly population, taking into account the unique characteristics of their facial features.

[0051] In one embodiment, the feature decoupling module includes three encoders based on a variational autoencoder architecture: a spatial feature encoder, a temporal feature encoder, and an individual feature encoder. The feature decoupling unit is further configured to input the preprocessed micro-expression video sequence into the spatial feature encoder to obtain a spatial feature vector. The spatial feature encoder includes two 3D convolutional pooling modules, two fully connected layers, two parallel fully connected layers, and a reparameter sampling module. The 3D convolutional pooling modules include two 3D convolutional layers and one 3D max pooling layer. The preprocessed micro-expression video sequence is input into the temporal feature encoder to obtain a temporal feature vector. The spatial feature encoder includes a gradient calculation module, two 3D convolutional pooling modules, two fully connected layers, two parallel fully connected layers, and a reparameter sampling module. The 3D convolutional pooling module includes two 3D convolutional layers and one 3D max pooling layer. The preprocessed elderly attribute features are input into the individual feature encoder to obtain an individual feature vector. The volume feature encoder includes three fully connected layers, two parallel fully connected layers, and a reparameter sampling module.

[0052] Specifically, the feature decoupling module is based on a variational autoencoder architecture, which effectively separates spatial, temporal, and individual features, thereby improving the interpretability and generalization ability of the model.

[0053] (1) The spatial feature encoder is responsible for extracting the spatial features of micro-expressions, that is, the static change patterns of facial muscles at different time points. The specific structural parameters of the spatial feature encoder are shown in Table 1 below.

[0054] Table 1. Specific structural parameters of the spatial feature encoder

[0055] The working principle of a spatial feature encoder is as follows: The input is the preprocessed micro-expression video sequence X∈RT×H×W×3; By stacking multiple 3D convolutional layers and pooling layers, spatial features are extracted progressively from low to high levels. 3D convolution can capture information in both spatial and temporal dimensions simultaneously, but in the spatial feature encoder, the temporal dimension of the convolutional kernel is set relatively small, focusing primarily on the extraction of spatial features. The fully connected layer maps the convolutional features to a high-dimensional feature space, and then outputs the mean vectors through two parallel fully connected layers. Sum of logarithmic variance vector ; Sampling is performed using reparameterization techniques to generate spatial feature vectors:

[0056] in, For spatial feature vectors, It is standard normal noise; The spatial feature decoder adopts a structure symmetrical to the encoder. Through transposed convolution and upsampling operations, it reconstructs the spatial feature vector into the spatial part of the original video sequence, which is used to calculate the spatial feature reconstruction loss.

[0057] (2) The temporal feature encoder is responsible for extracting the temporal features of micro-expressions, that is, the dynamic change pattern of facial muscles in the time dimension. Its innovation lies in first calculating the temporal gradient to highlight the motion changes of micro-expressions.

[0058] The working principle of a time feature encoder is as follows: The input is the preprocessed micro-expression video sequence X∈RT×H×W×3; Calculate the time gradient Xgrad[t] = X[t] X[t 1], we get (T 1) A time gradient tensor of ×H×W×3 is used to highlight motion changes between adjacent frames; The subsequent 3D convolutional layer structure is similar to that of the spatial feature encoder, but the temporal dimension of the convolutional kernel is set to be larger in order to better capture temporal features; The mean vector is output through a fully connected layer. Sum of logarithmic variance vector ; Sampling is performed using reparameterization techniques to generate temporal feature vectors. .

[0059] The advantages of temporal gradient calculation are: enhancing the expression of motion features and highlighting the dynamic changes of micro-expressions; suppressing static background information and reducing the interference of redundant information; and improving robustness to illumination changes, since the illumination changes between adjacent frames are relatively small.

[0060] The temporal feature decoder adopts a structure symmetrical to the encoder, reconstructing the temporal feature vectors into temporal gradient tensors, which are used to calculate the temporal feature reconstruction loss.

[0061] (3) The individual feature encoder is responsible for extracting features related to the elderly individual, providing a basis for subsequent adaptive processing.

[0062] The working principle of an individual feature encoder is as follows: The input is the preprocessed feature vector of elderly attributes, Pelderly∈Rdp; The attribute features are mapped to a fixed-dimensional feature space through a multilayer perceptron, which includes a dropout layer to prevent overfitting. The mean vector is output through two parallel fully connected layers. Sum of logarithmic variance vector ; Sampling is performed using reparameterization techniques to generate individual feature vectors. .

[0063] The individual feature decoder adopts a structure symmetrical to the encoder, reconstructing the individual feature vector into the original attribute feature vector, which is used to calculate the individual feature reconstruction loss.

[0064] In one embodiment, the state space modeling module includes: a forward SSM, a backward SSM, and a state fusion module; a state feature extraction unit is further used to concatenate the extracted spatial feature vector, temporal feature vector, and individual feature vector to obtain concatenated features; the concatenated features are input into the forward SSM and backward SSM respectively to obtain forward state features and backward state features; the forward SSM is used to process temporal information from the past to the future; the backward SSM is used to process the reversed input sequence and capture the temporal dependencies from the future to the past; the forward state features and backward state features are input into the state fusion module to obtain fused state features; the state fusion module is used to concatenate the forward state features and backward state features and then perform feature fusion using a multi-head self-attention mechanism to obtain fused state features.

[0065] In one embodiment, the state fusion module includes: a feature splicing module, a multi-head self-attention mechanism, and a feedforward network; inputting forward and backward state features into the state fusion module to obtain fused state features includes: inputting forward and backward state features into the feature splicing module to obtain spliced ​​state features; inputting the spliced ​​state features into the multi-head self-attention mechanism to obtain attention features; adding the attention features and spliced ​​state features and then performing layer normalization processing to obtain normalized features; inputting the normalized features into the feedforward network, performing a nonlinear transformation on the normalized features, adding the nonlinearly transformed features and the normalized features, and performing layer normalization processing on the addition result to obtain fused state features.

[0066] Specifically, the state-space modeling module is based on the Mamba architecture and enables efficient temporal modeling, effectively capturing long-range temporal dependencies of micro-expressions while maintaining linear time complexity.

[0067] (1) Mamba architecture principle Mamba is a neural network architecture based on the Selective State Space Model, whose theoretical foundation is the continuous-time state space model.

[0068]

[0069] in, It is a hidden state vector. It is the input vector. It is the output vector. A , B , C , D It is the system parameter matrix.

[0070] Mamba's innovation lies in its selective mechanism, enabling the model to adaptively choose to retain or forget information based on the input content. This is achieved through parameterized matrices Δ, B , C To implement this mechanism, Δ controls the rate of state updates. B and C Control the weights of inputs and outputs.

[0071] The discretized state update formula is:

[0072]

[0073] Here, ⊙ represents element-wise multiplication. It is the sigmoid linear unit activation function. Control the rate of state updates, when When the size is large, the model retains more historical information; when When the size is smaller, the model is more inclined to accept new information.

[0074] (2) Forward SSM (Forward State Space Model) The forward state-space model (Forward SSM) processes temporal information from the past to the future. The detailed workflow of the forward SSM is as follows: 1) Feature stitching: combining spatial features Time characteristics and individual characteristics spliced ​​as .

[0075] 2) Linear projection: Projecting the stitched features onto the model dimension through a fully connected layer. ,get .

[0076] 3) Sequence expansion: Repeat xproj T times to form a sequence. .

[0077] 4) Selectivity parameter generation: This is achieved through linear transformation from... Generate selective parameters .

[0078] 5) State Update: Update the hidden state based on the selection parameters and the input sequence. .

[0079] 6) Output Calculation: Calculate the output based on the hidden state. The forward state features are obtained through linear projection and layer normalization. .

[0080] (3) Backward SSM (Backward State Space Model) The backward SSM has the same structure as the forward SSM, but it processes the input sequence in reverse order to capture temporal dependencies from the future to the past. The backward SSM is structured as a reversed Mamba module, including: Feature fusion: concatenating three decoupled features; Sequence reversal: reversing the input sequence; Projection unrolling: linear projection and sequence dimensionality transformation; Selective scanning: the core Mamba mechanism; Input projection and parameter generation; State-space model computation; Sequence recovery: restoring the output sequence to its forward order; Post-processing: output projection and regularization; Final output: reversed feature representation. .

[0081] Technical features of backward SSM: Bidirectional modeling: Reverse temporal modeling is achieved by reversing the sequence order; Context capture: Temporal dependencies from back to front are captured; Feature enhancement: Complementary to forward Mamba, providing a more comprehensive sequence representation; This backward Mamba module, combined with the forward module, can build a powerful bidirectional temporal modeling architecture.

[0082] The state update formula for backward SSM is:

[0083] in, The input sequence is in reverse order. After processing, the output sequence is restored to its original order to obtain the backward state features. .

[0084] (4) State fusion module The state fusion module fuses the forward and backward state features to obtain complete temporal information. A multi-head self-attention mechanism is used to achieve feature fusion. The network structure of the state fusion module is as follows: Figure 2 As shown. The mathematical expression of the multi-head self-attention mechanism is:

[0085]

[0086]

[0087] in, h It's the number of heads that attract attention. There are three projection matrices. It is the output projection matrix. It is the dimension of each attention head.

[0088] A feedforward network (FFN) consists of two linear transformations and a ReLU activation function:

[0089] The detailed steps for state fusion by the state fusion module are as follows: 1) Feature concatenation: Concatenate the forward state features Hforward and the backward state features. By concatenating along the feature dimension, we obtain .

[0090] 2) Bullish Self-Attention: Project the values ​​into query (Q), key (K), and value (V) matrices respectively, calculate the attention weights, and sum the values ​​in a weighted manner to obtain the attention output.

[0091] 3) Residual connections and layer normalization: Connecting attention outputs and inputs Add them together and then perform layer normalization.

[0092] 4) Feedforward network: The normalized features are transformed nonlinearly through a feedforward network.

[0093] 5) Residual connection and layer normalization: The output and input of the feedforward network are added together, and layer normalization is performed to obtain the fused state features. .

[0094] In one embodiment, the elderly parameter adaptation module includes: an elderly feature mapping unit and an adaptive transformation unit; the parameter adaptation unit is further configured to input elderly attribute features into the elderly feature mapping unit, map the elderly attribute features into a multidimensional vector through a multilayer perceptron, and then reshape the multidimensional vector into a perturbation of the parameter modulation matrix; initialize the parameter modulation matrix to the sum of the identity matrix and the perturbation of the parameter modulation matrix; perform matrix multiplication element-wise on the fused state features and the parameter modulation matrix to obtain the modulated features; perform residual connection between the modulated features and the fused state features, and perform layer normalization processing to obtain the adaptive features.

[0095] Specifically, the elderly parameter adaptive module is a personalized design of this invention for the elderly population. It can dynamically adjust the model parameters according to the individual characteristics of the elderly to improve recognition accuracy.

[0096] The working principle of the parameter adaptive module for the elderly is as follows: (1) Elderly feature mapping: The attribute features of the elderly are mapped through a multilayer perceptron. Mapped to a dimensional vector Then reshape it into a parametric modulation matrix. .

[0097] (2) Initialization of the parameter modulation matrix: Initialize Wmodulate to a perturbation of the identity matrix, i.e.: ,in It is the identity matrix. These are small perturbations that ensure the model's behavior does not change significantly in the initial state.

[0098] (3) Adaptive transformation: The fused state features With parameter modulation matrix Perform matrix multiplication to obtain the modulated features. .

[0099] (4) Residual connection and layer normalization: The modulated features are residually connected with the original features, and then layer normalization is performed.

[0100] in, These are learnable adaptive coefficients used to control the strength of adaptive adjustments.

[0101] The advantages of the adaptive parameter module for older adults are: it can dynamically adjust model parameters according to the individual characteristics of older adults, adapting to the differences in facial features among different older adults; it achieves feature space transformation through parameter modulation matrices, rather than simple feature weighting, resulting in stronger adjustment capabilities; residual connections ensure good model performance in the initial stage, and adaptive adjustment improves baseline performance; and it features learnable adaptive coefficients. It can automatically balance the intensity of adaptive adjustments, avoiding overfitting caused by over-adjustment.

[0102] In one embodiment, the micro-expression classification module includes: a temporal attention mechanism, a global temporal pooling layer, and a classification network; the classification network includes three fully connected layers and a Softmax activation function; the micro-expression recognition unit is further configured to use the temporal attention mechanism to calculate the attention weights of features at different time steps and generate temporal attention features; process the adaptive features using the global temporal pooling layer to obtain static temporal features; concatenate the static temporal features and the temporal attention features to obtain temporal concatenated features; and input the temporal concatenated features into the classification network to obtain the micro-expression classification result.

[0103] Specifically, the micro-expression classification module is responsible for converting the processed features into the final micro-expression category probability distribution. Its innovation lies in combining temporal attention mechanisms and global pooling to fully utilize the temporal information in adaptive features.

[0104] (1) Temporal attention mechanism Temporal attention mechanisms are used to calculate the importance weights of features at different time steps, highlighting features at time steps that contribute significantly to classification. The calculation process is as follows:

[0105]

[0106]

[0107] in, It is a time step t The attention score is calculated by a two-layer MLP; These are the normalized attention weights; It is a time series feature obtained by weighted summation.

[0108] (2) Global Temporal Pooling Global temporal pooling is used to capture the static features of the entire sequence, employing a combination of average pooling and max pooling:

[0109]

[0110]

[0111] in, It is an average pooling feature. It is a max pooling feature. It is the static temporal feature after splicing.

[0112] (3) Feature fusion and classification Temporal attention features and static timing characteristics The features are then combined to form the final classification characteristics: .

[0113] The classification network employs a multilayer perceptron architecture, with the following structure: First layer: Fully connected layer, input dimension... The first layer has an input dimension of 1024 and an output dimension of 512, with ReLU activation and Dropout (0.5). The second layer is a fully connected layer with an input dimension of 1024 and an output dimension of 512, with ReLU activation and Dropout (0.3). The third layer is a fully connected layer with an input dimension of 512 and an output dimension of 7, with no activation function. The softmax layer converts the output into a probability distribution of 7 types of micro-expressions. The seven categories of microexpressions include: happiness, sadness, anger, fear, surprise, disgust, and contempt, covering the basic emotional categories.

[0114] In one embodiment, an elderly micro-expression recognition model based on a decoupled state-space model is constructed, consisting of a feature decoupling module, a state-space modeling module, an elderly parameter adaptation module, and a micro-expression classification module. The total loss function used during the training of the elderly micro-expression recognition model is:

[0115] in, For the total loss, The classification loss uses the cross-entropy loss function to measure the difference between the predicted probability distribution and the true label. Reconstruction loss using the mean squared error loss function, The sum of the VAE losses of the three encoders based on the variational autoencoder architecture. Decoupling loss used to enhance the independence between different feature dimensions For regularization loss, These are the weight coefficients for the reconstruction loss, the sum of the VAE loss, the decoupling loss, and the regularization loss, respectively.

[0116] Specifically, In one embodiment, the VAE loss of the encoder based on the variational autoencoder architecture is:

[0117] in, The VAE loss of an encoder based on a variational autoencoder architecture. To reconstruct the loss, For KL divergence loss, It is a hyperparameter that balances the reconstruction loss and the KL divergence loss. This represents the conditional probability distribution output by the decoder. This is the approximate posterior distribution of the encoder output. For the prior distribution of latent variables, , These are the input data and the latent variables, respectively.

[0118] Specifically, each loss component in the total loss during training is as follows: (1) Classification loss The classification loss uses the cross-entropy loss function to measure the difference between the predicted probability distribution and the true label; the classification loss expression is:

[0119] in, It is a unique hot encoding of the real label. It is the predicted number i The probability of micro-expressions.

[0120] (2) Reconstruction loss The reconstruction loss uses the mean squared error (MSE) loss function to measure the decoder's ability to reconstruct the input data. The expression for the reconstruction loss is:

[0121] in, It is a video sequence reconstructed by a spatial and temporal feature decoder. It is the attribute feature vector reconstructed by the individual feature decoder.

[0122] (3) VAE loss The VAE loss is the sum of the VAE losses of the three encoders. The VAE loss of each variational autoencoder consists of the reconstruction loss and the KL divergence loss. The loss function of each variational autoencoder is:

[0123] Among them: the first item The first term is the reconstruction loss, which measures the decoder's ability to reconstruct the input data, using mean squared error (MSE) or cross-entropy loss; the second term... It is the KL divergence loss, which measures the distribution of the encoder output. With prior distribution The differences between them prompt the encoder to learn a smooth feature space; It is a hyperparameter that balances the reconstruction loss and the KL divergence loss, and is set as the preferred parameter. Experiments have shown that it can maintain good reconstruction capabilities while ensuring the decoupling effect of features.

[0124] The total VAE loss of the feature decoupling module is the sum of the VAE losses of the three encoders:

[0125] in, The VAE loss of the spatial feature encoder. The VAE loss of the temporal feature encoder. The VAE loss is for the individual feature encoder.

[0126] (4) Decoupling loss Decoupling loss is used to enhance the independence between different feature dimensions, employing a method that minimizes mutual information.

[0127] in, It is a feature and The mutual information between them is estimated through a neural network, specifically using the MINE (Mutual Information Neural Estimation) method. .

[0128] (5) Regularization loss L2 regularization is used for the loss to prevent overfitting caused by excessively large model parameters.

[0129] in, It is the set of all learnable parameters of the model. It is a parameter matrix W The Frobenius norm.

[0130] (6) Loss weighting coefficient The weighting coefficients for each loss were determined experimentally, and the specific values ​​are shown in Table 2.

[0131] Table 2 Loss Weight Coefficient Values

[0132] The specific training process for the micro-expression recognition model for the elderly based on the decoupled state-space model includes: (1) Training configuration The AdamW optimizer is used, combining the advantages of Adam's adaptive learning rate and weight decay: The exponential decay rate estimated by the first moment. The exponential decay rate estimated by the second moment. Numerical stability parameter; Weight Decay = 0.01.

[0133] A cosine annealing learning rate scheduler is used, combined with a learning rate warm-up strategy: the initial learning rate is... Warm-up phase: The learning rate is linearly increased to the initial learning rate for the first 10 epochs; Annealing phase: Afterwards, the learning rate is gradually decreased using a cosine function. .

[0134] (2) Training parameters Set the batch size to 32 and the maximum number of training epochs to 200; Early Stopping Strategy: Stop training when the validation set accuracy no longer improves for 20 consecutive rounds; Gradient Clipping: The upper limit of the gradient norm is set to 1.0 to prevent gradient explosion; Regularization: In addition to L2 regularization, Dropout is used in fully connected layers with a ratio between 0.2 and 0.5; Mixed precision training: Use FP16 mixed precision training to accelerate the training process and reduce memory usage.

[0135] (3) Data augmentation To improve the model's generalization ability, the following data augmentation operations were performed on the micro-expression video sequences: random flipping: horizontal flipping probability 50%; random rotation: random rotation between -10° and 10°; brightness adjustment: brightness randomly varied by ±15%; contrast adjustment: contrast randomly varied by ±15%; Gaussian noise: Gaussian noise with a standard deviation of 0.01 was added; temporal perturbation: a small number of frames (≤10%) were randomly deleted or repeated to simulate changes in the duration of micro-expressions.

[0136] (4) Initialization strategy The following parameter initialization strategy is adopted to ensure the stability and convergence speed of model training. Specifically: Fully connected layer initialization: The Xavier initialization method is used to ensure that the variance of the signal is consistent in the forward and backward propagation. Convolutional layer initialization: The Kaiming initialization method is used, which is particularly suitable for the ReLU activation function and avoids the gradient vanishing problem; Mamba parameter initialization: Matrix A is initialized as a diagonal matrix with diagonal elements being random values ​​between -1 and -5 to ensure initial state stability; the Δ parameter is initialized to 0 to ensure a moderate initial state update rate. Adaptive module initialization: The parameter modulation matrix is ​​initialized with a small perturbation of the identity matrix to ensure stable model behavior in the initial stage; Bias initialization: All bias parameters are initialized to 0.

[0137] (5) Model training process The model training process includes the following steps: 1) Data preparation: Collect micro-expression video data and corresponding elderly attribute data, and divide them into training set (70%), validation set (15%) and test set (15%). 2) Data preprocessing: Preprocessing the video sequences and attribute features; 3) Model Construction: A micro-expression recognition model for the elderly based on a decoupled state-space model; 4) Parameter initialization; 5) Training cycle: Forward propagation: Input preprocessed data, calculate model output and overall loss; Backpropagation: Calculate the gradient of the loss with respect to each parameter and apply gradient clipping; Parameter update: Update model parameters using the AdamW optimizer; Learning rate adjustment: Adjust the learning rate according to the cosine annealing scheduler; 6) Model evaluation: After each epoch, evaluate the model performance on the validation set and record metrics such as accuracy, precision, recall, and F1 score; 7) Early stop check: If the validation set performance does not improve for 20 consecutive rounds, stop training; 8) Model saving: Save the weights of the model that performs best on the validation set; 9) Testing and Evaluation: After training, evaluate the performance of the final model on the test set.

[0138] In one embodiment, the elderly attribute features include: age, height, weight, gender, ethnicity, health status, and facial image. The preprocessing process for the elderly attribute features includes: performing Z-score standardization on age, height, and weight to obtain standardized numerical features; performing one-hot encoding or embedding on gender, ethnicity, and health status to obtain categorical encoding features; extracting facial feature descriptors from the facial image as supplementary attribute features; and concatenating the standardized numerical features, categorical encoding features, and supplementary attribute features into a multi-dimensional vector as the preprocessed elderly attribute features.

[0139] In a verification embodiment, to verify the effectiveness of the invention, experiments were conducted on multiple micro-expression datasets, including: public datasets and micro-expression datasets of the elderly.

[0140] Publicly available datasets include CASME II, SMIC, and MEGC 2019. Specifically: CASME II contains 355 micro-expression samples from 25 subjects, with a resolution of 300×200 pixels and a frame rate of 200fps. SMIC contains 164 micro-expression samples from 16 subjects, categorized into high, medium, and low intensities. MEGC 2019 is the MEGC competition dataset, containing micro-expression data from multiple sources.

[0141] The Elderly Micro-expression Dataset was specifically designed to evaluate the recognition performance on the elderly population. The dataset included: Participants: 50 elderly individuals aged 60-85 years, with a balanced male-to-female ratio; Sample size: Each participant recorded 7 basic micro-expressions, with 3-5 samples per category, totaling approximately 1500 micro-expression samples; Recording conditions: Standardized lighting conditions, multi-angle shooting (frontal, 45° side view); Attribute information: Recorded attributes such as age, gender, health status, and facial feature descriptions for each participant.

[0142] The experiment used the following setup: Hardware environment: NVIDIA Tesla V100 GPU, Intel Xeon E5-2690 CPU, 64GB RAM; Software environment: PyTorch 1.10.0, CUDA 11.3, Python 3.8; Evaluation metrics: Accuracy, Precision, Recall, and F1 Score, all using macro-average. Comparison methods: including traditional methods (LBP-TOP, HOG-TOP) and deep learning methods (3DCNN, CNN-LSTM, Transformer, etc.).

[0143] (1) Performance on public datasets The comparison results with existing methods on the CASME II and SMIC datasets are shown in Table 3.

[0144] Table 3. Comparison results with existing methods on the CASME II and SMIC datasets.

[0145] As can be seen from Table 3, the performance of this method on public datasets is significantly better than existing methods, with significant improvements in accuracy and F1 score, thus verifying the effectiveness of the present invention.

[0146] (2) Performance on elderly datasets The experimental results on the elderly micro-expression dataset are shown in Table 4: Table 4. Experimental results on the micro-expression dataset of the elderly.

[0147] As can be seen from Table 4, the performance of all methods on the elderly dataset is lower than that on the public dataset, indicating that micro-expression recognition in the elderly is indeed more challenging.

[0148] Our method achieves an accuracy of 77.3% on the elderly dataset, significantly outperforming other comparative methods. Compared to the version without the adaptive module, our method improves accuracy by 8.8%, validating the effectiveness of the elderly parameter adaptive module.

[0149] (3) Ablation test To verify the contribution of each module, an ablation experiment was conducted, and the results are shown in Table 5.

[0150] Table 5 Ablation Experiment Results

[0151] Ablation experiments show that the feature decoupling module has the greatest impact on performance, with accuracy decreasing by 12.1% after removal, verifying the importance of feature decoupling. The elderly adaptive module and the Mamba architecture also have a significant impact on performance, with accuracy decreasing by 8.8% and 8.4% respectively after removal. The backward SSM and attention fusion modules contribute relatively little to performance, but still have a positive effect. All modules work together to form a high-performance DSSM-EMER model.

[0152] (4) Visualization of feature decoupling effect By using t-SNE dimensionality reduction to visualize spatial, temporal, and individual features, the results show that the three features can be clearly separated in the low-dimensional space, verifying the effect of feature decoupling. Spatial features mainly distinguish differences in facial structure, temporal features mainly distinguish differences in movement patterns, and individual features mainly distinguish different subjects.

[0153] (5) Computational efficiency analysis When processing micro-expression sequences of different lengths, a comparison of the computation time of this invention and Transformer shows that as the sequence length increases, the computation time of Transformer increases quadratically, while the computation time of this invention increases linearly, verifying the efficiency advantage of the Mamba architecture in processing long sequences.

[0154] Experimental results show that the proposed method achieves excellent recognition performance on both public datasets and elderly datasets, significantly outperforming existing methods. The feature decoupling module, Mamba temporal modeling, and elderly-adaptive module are key factors in the performance improvement. This invention exhibits good generalization ability and robustness, adapting to different datasets and experimental conditions. Compared to traditional methods, this method has a significant advantage in computational efficiency, especially when processing long sequences. These results fully validate the effectiveness and practicality of this invention, providing an advanced technical solution for micro-expression recognition in the elderly. The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0155] The above embodiments are merely illustrative of several implementation methods of this application, and their descriptions are relatively specific and detailed. However, they should not be construed as limiting the scope of protection of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and all such modifications and improvements fall within the scope of protection of this application.

Claims

1. A neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model, characterized in that, include: The data preprocessing unit is used to preprocess the acquired micro-expression video sequences and the attribute characteristics of the elderly. The feature decoupling unit is used to input the preprocessing results into the feature decoupling module, and extract spatial features, temporal features and individual features respectively through three parallel encoders based on the variational autoencoder architecture; The state feature extraction unit is used to concatenate the extracted spatial feature vector, temporal feature vector, and individual feature vector and input them into the state space modeling module. It uses the Mamba architecture to perform temporal modeling from the past to the future and from the future to the past on the concatenated features. The obtained forward state features and backward state features are fused to obtain the fused state features. The parameter adaptation unit is used to input the fused state features and elderly attribute features into the elderly parameter adaptation module, and obtain adaptive features through feature mapping, adaptive transformation and residual connection processing; The micro-expression recognition unit is used to input the adaptive features into the micro-expression classification module, extract weighted temporal features through a temporal attention mechanism, extract static temporal features through a global pooling operation, fuse the weighted temporal features and static temporal features, and then process them using a classification network to obtain the micro-expression classification result.

2. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model as described in claim 1, characterized in that, The feature decoupling module includes three encoders based on a variational autoencoder architecture: a spatial feature encoder, a temporal feature encoder, and an individual feature encoder. The feature decoupling unit is also used to input the preprocessed micro-expression video sequence into the spatial feature encoder to obtain spatial feature vectors. The spatial feature encoder includes two 3D convolutional pooling modules, two fully connected layers, two parallel fully connected layers, and a reparameter sampling module. The 3D convolutional pooling module includes two 3D convolutional layers and one 3D max pooling layer. The preprocessed micro-expression video sequence is input into the temporal feature encoder to obtain temporal feature vectors. The spatial feature encoder includes a gradient calculation module, two 3D convolutional pooling modules, two fully connected layers, two parallel fully connected layers, and a reparameter sampling module. The 3D convolutional pooling module includes two 3D convolutional layers and one 3D max pooling layer. The preprocessed elderly attribute features are input into the individual feature encoder to obtain individual feature vectors. The volume feature encoder includes three fully connected layers, two parallel fully connected layers, and a reparameter sampling module.

3. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model as described in claim 1, characterized in that, The state space modeling module includes a forward SSM, a backward SSM, and a state fusion module; a state feature extraction unit is further used to concatenate the extracted spatial feature vector, temporal feature vector, and individual feature vector to obtain concatenated features; the concatenated features are input into the forward SSM and the backward SSM respectively to obtain forward state features and backward state features; the forward SSM is used to process temporal information from the past to the future; the backward SSM is used to process the reversed input sequence and capture the temporal dependencies from the future to the past; the forward state features and the backward state features are input into the state fusion module to obtain fused state features; the state fusion module is used to concatenate the forward state features and the backward state features and then perform feature fusion using a multi-head self-attention mechanism to obtain fused state features.

4. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model according to claim 3, characterized in that, The state fusion module includes: a feature splicing module, a multi-head self-attention mechanism, and a feedforward network; The forward state features and the backward state features are input into the state fusion module to obtain fused state features, including: The forward state features and the backward state features are input into the feature splicing module to obtain the spliced ​​state features; The splicing state features are input into the multi-head self-attention mechanism to obtain attention features; The attention features and the splicing state features are added together and then processed by layer normalization to obtain the normalized features; The normalized features are input into the feedforward network. After a nonlinear transformation is performed on the normalized features, the nonlinearly transformed features and the normalized features are added together. The added result is then subjected to layer normalization processing to obtain the fused state features.

5. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model according to claim 1, characterized in that, The elderly parameter adaptation module includes: an elderly feature mapping unit and an adaptive transformation unit; the parameter adaptation unit is further configured to input elderly attribute features into the elderly feature mapping unit, map the elderly attribute features into a multidimensional vector through a multilayer perceptron, and then reshape the multidimensional vector into a perturbation of the parameter modulation matrix; initialize the parameter modulation matrix to the sum of the identity matrix and the perturbation of the parameter modulation matrix; perform matrix multiplication element-wise on the fused state features and the parameter modulation matrix to obtain the modulated features; perform residual connection between the modulated features and the fused state features, and perform layer normalization processing to obtain the adaptive features.

6. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model according to claim 1, characterized in that, The micro-expression classification module includes: a temporal attention mechanism, a global temporal pooling layer, and a classification network; the classification network includes three fully connected layers and a Softmax activation function; the micro-expression recognition unit is further configured to use the temporal attention mechanism to calculate the attention weights of features at different time steps and generate temporal attention features; process the adaptive features using the global temporal pooling layer to obtain static temporal features; and concatenate the static temporal features and the temporal attention features to obtain temporal concatenated features; The temporal splicing features are input into the classification network to obtain the micro-expression classification results.

7. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model according to claim 1, characterized in that, A micro-expression recognition model for the elderly, based on a decoupled state-space model, consists of a feature decoupling module, a state-space modeling module, an elderly parameter adaptation module, and a micro-expression classification module. The total loss function used during the training of this model is: in, For the total loss, For classification loss using the cross-entropy loss function, Reconstruction loss using the mean squared error loss function, The sum of the VAE losses of the three encoders based on the variational autoencoder architecture. Decoupling loss used to enhance the independence between different feature dimensions For regularization loss, These are the weight coefficients for the reconstruction loss, the sum of the VAE loss, the decoupling loss, and the regularization loss, respectively.

8. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model according to claim 7, characterized in that, The VAE loss of the encoder based on the variational autoencoder architecture is: in, The VAE loss of an encoder based on a variational autoencoder architecture. To reconstruct the loss, For KL divergence loss, It is a hyperparameter that balances the reconstruction loss and the KL divergence loss. This represents the conditional probability distribution output by the decoder. This is the approximate posterior distribution of the encoder output. For the prior distribution of latent variables, , These are the input data and the latent variables, respectively.

9. The neural network architecture and system for recognizing micro-expressions in the elderly based on a decoupled state-space model according to claim 1, characterized in that, The attributes of older adults include: age, height, weight, sex, race, health status, and facial features. The preprocessing of elderly attribute characteristics includes: Age, height, and weight are Z-score standardized to obtain standardized numerical features; One-hot encoding or embedding is performed on gender, race, and health status to obtain category encoding features; Extract facial feature descriptors from facial images as supplementary attribute features; The standardized numerical features, the category coding features, and the supplementary attribute features are concatenated into a multidimensional vector, which serves as the preprocessed attribute features of the elderly.