A cross-library facial expression recognition method and system

CN115471897BActive Publication Date: 2026-06-26NANJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV OF POSTS & TELECOMM
Filing Date
2022-09-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing emotion recognition technologies suffer from poor recognition performance in cross-database applications due to the large differences in data domain distribution, making it difficult to effectively train models and ignoring facial muscle movement information in videos.

Method used

We employ a method based on spatiotemporal feature point motion attention and subdomain adaptation. We extract spatiotemporal feature points using the Harris corner detection algorithm and combine it with Local Maximum Mean Difference (LMMD) to construct a deep subdomain adaptive network. This reduces the feature distribution difference between the source and target domains, captures fine-grained information, and improves recognition accuracy.

Benefits of technology

It effectively combines facial and motion information to accurately reflect muscle movement areas, reduce inter-domain differences, improve emotion recognition performance, and obtain better cross-database facial expression recognition results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115471897B_ABST
    Figure CN115471897B_ABST
Patent Text Reader

Abstract

The application discloses a cross-database facial expression recognition method and system based on space-time feature point motion attention and sub-domain self-adaption. The method comprises the following steps: intercepting the main part of a video in a database, satisfying the same length and framing it; detecting the space-time feature points of the face and action of each video, calculating the weight value of the corresponding feature points of each frame of picture, and forming a facial space-time feature point weight map; constructing a deep sub-domain self-adaptive network based on space-time feature point motion attention; inputting the obtained expression extraction features into an SVM classifier and a softmax layer, and outputting a classification result. The application fully utilizes the space-time feature points, effectively combines the face information and motion information, accurately reflects the face area where the muscle movement occurs, uses a local maximum mean difference (LMMD), reduces the feature distribution difference between the source domain and the target domain, captures more fine-grained information through sub-domain self-adaption, and obtains better recognition effect than the deep domain self-adaption method.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence technology, and in particular relates to a cross-database facial expression recognition method and a system for implementing the method based on spatiotemporal feature point motion attention and subdomain adaptation. Background Technology

[0002] In recent years, with the rapid development of artificial intelligence and affective computing technologies in academia and industry, emotion recognition has shown increasingly broad application prospects in multimedia entertainment, human-computer interaction, and machine intelligence. Especially in the field of intelligent human-computer interaction, emotion recognition technology plays a crucial role in intelligent perception and recognition between humans and computers. However, for many practical problems, emotion recognition technology cannot be trained using specific models immediately. This is because: firstly, training an emotion recognition model requires large datasets and powerful computing resources; secondly, for a specific recognition task, there are often only very limited training samples, and the environments in which training and test samples are obtained differ, resulting in significant differences in domain distribution between different data, making it impossible to train an effective emotion recognition model. Therefore, transferring a pre-trained emotion recognition model to a specific task is very valuable. Domain adaptation methods are used to handle domain distribution differences, which helps improve the model's generalization ability. Therefore, using domain adaptation methods to eliminate domain distribution differences caused by irrelevant factors in facial expression samples and improve the reliability of emotion recognition has significant research value.

[0003] Previous studies on emotion recognition based on transfer learning have mostly used facial image information, neglecting facial muscle movement information in a video. To combine facial and motion information, optical flow algorithms can be used to effectively depict motion information, cascaded CNN-LSTM networks can be used to incorporate temporal information from image sequences, or spatiotemporal feature point detection algorithms can be used to directly calculate the correlation between spatial and temporal information. Attention mechanisms in deep learning significantly improve the recognition performance of deep learning networks by weighting different features with learned attention weight maps. However, these learned weights are uncertain. Attention weight maps generated by methods such as optical flow and spatiotemporal feature points can accurately reflect the motion regions of facial expressions. Therefore, more and more researchers are introducing the concept of attention mechanisms from different perspectives, rather than simply embedding them into the network structure.

[0004] In view of this, it is necessary to design a cross-database facial expression recognition method based on spatiotemporal feature point motion attention and subdomain adaptation, and a system for implementing the method, in order to solve the above problems. Summary of the Invention

[0005] The main objective of this invention is to fully utilize spatiotemporal feature points, combining facial and motion information, to accurately reflect the facial regions where muscle movements occur; to use Local Maximum Mean Difference (LMMD) to reduce the feature distribution differences between the source and target domains; and to capture more fine-grained information through subdomain adaptation, thereby achieving better emotion recognition results.

[0006] To achieve the above objectives, this invention provides a cross-database facial expression recognition method based on spatiotemporal feature point motion attention and subdomain adaptation, comprising the following steps:

[0007] Step (1): Extract the main part of the video in the database, ensuring that they are of the same length and divide them into frames;

[0008] Step (2): Detect the spatiotemporal feature points of each video face and action, calculate the weight value of the corresponding feature point in each frame, and form a facial spatiotemporal feature point weight map.

[0009] Step (3): Construct a deep subdomain adaptive network based on spatiotemporal feature point motion attention;

[0010] Step (4): Extract features from facial expressions through the above steps, input the obtained features into the SVM classifier and softmax layer, and output the classification result.

[0011] A further improvement of the present invention is that, in step (1), the videos in the enterface and ravdess databases are divided into six categories, namely anger, disgust, fear, happiness, sadness and surprise. The main part of the video is extracted to the same length and divided into frames so that all videos have the same number of frames of 20.

[0012] A further improvement of the present invention is that step (2) uses the Harris corner detection algorithm to detect the spatiotemporal feature points of each video face and action, calculates the weight value of the corresponding feature point in each frame, and forms a facial spatiotemporal feature point weight map.

[0013] A further improvement of the present invention is that step (2) includes the following steps:

[0014] Step S201: By sliding a local window across the image, points with drastic changes in the first-order gradient of the image are selected, and spatiotemporal feature points of facial motion in two adjacent frames are detected. The Harris corner points are used to calculate the horizontal (x-direction) and vertical (y-direction) gradients, respectively:

[0015]

[0016] Using the two windows mentioned above, the gradients of an image I in the x and y directions can be calculated as follows: The gradient matrix of I is obtained as follows:

[0017]

[0018] This leads to the corner response:

[0019] R = det(M) - k × (trace(M)) 2 ),

[0020] Where det(M) is the determinant of M, trace(M) is the trace of M, and k is a constant;

[0021] To calculate the gradient matrix of the image sequence, the above method is extended to three dimensions:

[0022]

[0023] Among them, I t The gradient of image I on the time axis allows us to obtain the corresponding values ​​of the corner points in the image sequence:

[0024] R3 = det(M3) - k × (trace(M3)) 3 );

[0025] Step S202: Use local non-maximum suppression to select the maximum value of the local region as the candidate corner point, and compare it with a threshold to filter out the final corner point position. This position is where the image gradient change is most obvious and contains the richest gradient representation information. Then, construct a cube with the key point as the center. Its size is determined by two parameters: signa represents the size of the outward expansion and filling in the spatial dimension, and tau represents the size of the expansion and filling in the time axis. For 3 frames of 224*224 grayscale image, a (3+2tau)*(224+2sigma)*(224+2sigma) cube can be obtained. The cube is directly transformed into a one-dimensional vector for representation. Then, the k-nearest neighbor algorithm is used to cluster a large number of cubes. Finally, the histogram of each type of descriptor of the clustered cube is calculated as the feature description.

[0026] A further improvement of the present invention is that step (2) further includes: after obtaining the spatiotemporal feature point weight map of the face, assigning the spatiotemporal feature point weight map calculated from the previous frame to each frame image, the grayscale image of the face in each frame image, and the spatiotemporal feature point weight map calculated from each frame image to the R, G, and B dimensions of the RGB image respectively, to generate a spatiotemporal motion attention expression feature representation map.

[0027] A further improvement of the present invention is that step (3) includes:

[0028] (1) Local maximum mean difference (LMMD) is introduced as the base network in a specific layer of the ResNet50 network. The ResNet50 consists of five residual modules stacked together. Each module contains a certain number of convolutional layers, pooling layers, BN layers and ReLU layers. CONV1 is used as a preprocessing layer for the input, and the last four residual modules are composed of Bottleneck.

[0029] (2) After the above five residual modules, a 1000d result is obtained by mean pooling. After passing through the fully connected layer and the softmax activation function, the output is obtained. After the ResNet50 basic network is constructed, LMMD is introduced in different layers to align the relevant subdomains.

[0030] (3) Align relevant subdomains using Local Maximum Mean Difference (LMMD);

[0031] (4) After introducing LMMD into a specific layer of the base network, the loss function of the entire network can be defined as follows:

[0032]

[0033] A further improvement of the present invention is that the parameters of CONV1 are as follows:

[0034] CONV1:f=7×7,c=64,s=2,p=3

[0035] maxpool: f = 3 × 3, s = 2

[0036] Where f represents the size of the convolutional kernel or pooling layer, c represents the number of convolutional kernels, s represents the stride of the convolutional kernel or pooling layer, and p is the padding value. The formula for calculating the output size of the convolutional layer is:

[0037]

[0038] Where output_size represents the output size of the convolutional layer, and input_size represents the input size of the convolutional layer;

[0039] The parameters of the residual modules CONV2_x, CONV3_x, CONV4_x, and CONV5_x are as follows:

[0040]

[0041]

[0042]

[0043]

[0044] A further improvement of the present invention is that the formula for LMMD is as follows:

[0045]

[0046] Where, x s and x t They are source domain D s and target domain D t In the example, p (c) and q (c) They are D s and D t The distribution of; H is the reproducing kernel Hilbert space (RKHS) with feature kernel k, φ(·) represents some feature mapping that maps the original sample to the RKHS, and the feature kernel k is denoted as k(x s ,x t )=<φ(x s ),φ(x t )>, where <·,·> denote the inner product of vectors; by minimizing d in a deep network H (p,q), reducing the distribution of related subdomains within the same category; assuming each sample is based on weight ω c It belongs to a certain category, and then the above formula is transformed into:

[0047]

[0048] in, and These respectively represent those belonging to class C. and The weight, It is a weighted sum over class C; instance x i weight The calculation formula is Among them, y ic It is the label of class C; LMMD is introduced in the Lth layer of the deep network, and its expansion is:

[0049]

[0050] Among them, z l The activation generated by the Lth layer of the deep network is defined as the adaptive loss of the Lth layer, and LMMD can be implemented by most feedforward network models.

[0051] A further improvement of the present invention is that, in step (4), the enterface and ravdess datasets are processed in step (2) to generate spatiotemporal motion-focused facial expression features, which are source domain data and target domain data, respectively, and are input into the deep subdomain adaptive network in step (3) to obtain the classification result.

[0052] To achieve the above-mentioned objectives, the present invention also provides a cross-database facial expression recognition system, which can implement the methods described in any of the foregoing descriptions.

[0053] The beneficial effects of this invention are as follows: This invention performs cross-database facial expression recognition based on spatiotemporal feature point motion attention and subdomain adaptation methods. It fully utilizes spatiotemporal feature points, effectively combining facial and motion information to accurately reflect the facial regions where muscle movements occur. Furthermore, it uses Local Maximum Mean Difference (LMMD) to reduce the feature distribution difference between the source and target domains, and captures more fine-grained information through subdomain adaptation, achieving better emotion recognition results than depth domain adaptation methods. Attached Figure Description

[0054] Figure 1 This is a network structure block diagram of the present invention;

[0055] Figure 2 This is a block diagram of the spatiotemporal feature point detection module of the present invention;

[0056] Figure 3 This is a block diagram comparing the global domain adaptation and subdomain adaptation concepts of this invention;

[0057] Figure 4 This is a block diagram of the subdomain adaptive LMMD structure of the present invention. Detailed Implementation

[0058] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0059] It should be emphasized that, in describing this invention, various formulas and constraints are distinguished by consistent reference numerals, but it is not excluded that different reference numerals may be used to identify the same formulas and / or constraints. The purpose of this arrangement is to more clearly illustrate the features of this invention.

[0060] like Figure 1 As shown, the cross-database facial expression recognition method based on spatiotemporal feature point motion attention and subdomain adaptation of the present invention mainly includes the following steps:

[0061] Step (1): Extract the main part of the video in the database, ensuring that they are of the same length and divide them into frames;

[0062] Step (2): Detect the spatiotemporal feature points of each video face and action, calculate the weight value of the corresponding feature point in each frame, and form a facial spatiotemporal feature point weight map.

[0063] Step (3): Construct a deep subdomain adaptive network based on spatiotemporal feature point motion attention;

[0064] Step (4): Extract features from facial expressions through the above steps, input the obtained features into the SVM classifier and softmax layer, and output the classification result.

[0065] Each step will be described in detail below with reference to the embodiments.

[0066] Step (1) Divide the videos in the Enterface and Ravdess databases into six categories: anger, disgust, fear, happiness, sadness, and surprise. Extract the main part of the video to the same length and divide it into frames so that all videos have the same number of frames (20 frames).

[0067] Step (2) uses the Harris corner detection algorithm to detect the spatiotemporal feature points of each video face and action, calculates the weight value of the corresponding feature point in each frame, and forms a facial spatiotemporal feature point weight map. The specific steps are as follows:

[0068] S201: By sliding a local window across the image, points with drastic changes in the first-order gradient of the image are filtered out, and spatiotemporal feature points of facial motion in two adjacent frames are detected. The Harris corner points are used to calculate the horizontal (x-direction) and vertical (y-direction) gradients through the following windows:

[0069]

[0070] Using the two windows mentioned above, the gradients of an image I in the x and y directions can be calculated as follows: The gradient matrix of I is obtained as follows:

[0071]

[0072] This leads to the corner response:

[0073] R = det(M) - k × (trace(M)) 2 ),

[0074] Where det(M) is the determinant of M, trace(M) is the trace of M, and k is a constant.

[0075] To calculate the gradient matrix of the image sequence, the above method is extended to three dimensions:

[0076]

[0077] Among them, I t The gradient of image I on the time axis allows us to obtain the corresponding values ​​of the corner points in the image sequence:

[0078] R3 = det(M3) - k × (trace(M3)) 3 ).

[0079] S202: Local non-maximum suppression is used to select the maximum value of a local region as a candidate corner point, and it is compared with a threshold to filter out the final corner point position. This position is where the image gradient change is most obvious and contains the richest gradient representation information. Then, a cube is constructed with the key point as the center. Its size is determined by two parameters: signa represents the size of the outward expansion and filling in the spatial dimension, and tau represents the size of the expansion and filling in the time axis. For a 3-frame 224*224 grayscale image, a (3+2tau)*(224+2sigma)*(224+2sigma) cube can be obtained. The cube is directly transformed into a one-dimensional vector for representation. Then, the k-nearest neighbor algorithm is used to cluster a large number of cubes. Finally, the histogram of each type of descriptor of the clustered cube is calculated as the feature description.

[0080] like Figure 2 As shown, step (2) further includes: after obtaining the spatiotemporal feature point weight map of the face, assigning the spatiotemporal feature point weight map calculated from the previous frame, the grayscale image of the face in each frame, and the spatiotemporal feature point weight map calculated from the next frame to the R, G, and B dimensions of the RGB image respectively, to generate a spatiotemporal motion attention expression feature representation map.

[0081] Step (3) Construct a deep subdomain adaptive network as the base network of this invention, such as Figure 1 As shown, the functions of each part are as follows:

[0082] S301: This invention introduces Local Maximum Mean Difference (LMMD) as the base network in a specific layer of the ResNet50 network. The ResNet50 consists of five stacked residual modules, each containing a fixed number of convolutional layers, pooling layers, BN layers, and ReLU layers. Among them, CONV1 has a relatively simple structure and can be regarded as a preprocessing of the input. The latter four residual modules are all composed of Bottleneck and have similar structures.

[0083] The parameters of CONV1 are as follows:

[0084] CONV1:f=7×7,c=64,s=2,p=3

[0085] maxpool: f = 3 × 3, s = 2

[0086] Where f represents the size of the convolutional kernel or pooling layer, c represents the number of convolutional kernels, s represents the stride of the convolutional kernel or pooling layer, and p is the padding value. The formula for calculating the output size of the convolutional layer is:

[0087]

[0088] Where output_size represents the output size of the convolutional layer, and input_size represents the input size of the convolutional layer.

[0089] The parameters of the residual modules CONV2_x, CONV3_x, CONV4_x, and CONV5_x are as follows:

[0090]

[0091]

[0092]

[0093]

[0094] S302: After passing through the above five residual modules, mean pooling is used to obtain a 1000d result, which is then passed through a fully connected layer and a softmax activation function to obtain the output. After constructing the ResNet50 basic network, only LMMD needs to be introduced in different layers to align the relevant subdomains.

[0095] S303: This invention uses Local Maximum Mean Difference (LMMD) to align relevant subdomains. As a nonparametric distance estimate between two distributions, Maximum Mean Difference (MMD) has been widely used to measure the distributional difference between a source and a target domain, primarily focusing on the alignment of global distributions, but neglecting the relationship between two subdomains within the same category. Considering the relationship between relevant subdomains, aligning the distributions of subdomains within the same category between the source and target domains using LMMD is also crucial. A comparison of these two ideas is as follows: Figure 3 As shown. The formula for LMMD is as follows:

[0096]

[0097] Where, x s and x t They are source domain D s and target domain D t In the example, p (c) and q (c) They are D s and D t The distribution of H is the reproducing kernel Hilbert space (RKHS) with feature kernel k, where φ(·) represents some feature mapping that maps the original sample to the RKHS, and the feature kernel k is denoted as k(x s ,x t )=<φ(x s ),φ(x t)>, where <·,·> denote the inner product of vectors. This is achieved by minimizing d in a deep network. H (p,q), reducing the distribution of related subdomains within the same category. Assume each sample is weighted according to ω. c It belongs to a certain category, and then the above formula is transformed into:

[0098]

[0099] in, and These respectively represent those belonging to class C. and The weight, It is a weighted sum over class C. Example x i weight The calculation formula is Among them, y ic This refers to the label of class C. The original domain data uses the true labels, while the target domain data uses the probability distribution predicted by the network. LMMD is introduced in the Lth layer of the deep network, such as... Figure 4 As shown, its expansion is:

[0100]

[0101] Among them, z l The activation generated by the Lth layer of the deep network is defined as the adaptive loss of the Lth layer, and LMMD can be implemented by most feedforward network models.

[0102] S304: After introducing LMMD into a specific layer of the base network, the loss function of the entire network can be defined as follows:

[0103]

[0104] The first term above uses cross-entropy as the classification loss, and the second term uses LMMD as the adaptive loss. It can be seen that the optimization objective is simpler compared to most UDA methods.

[0105] In step (4), the enterface and ravdess datasets are processed in step (2) to generate spatiotemporal motion-focused facial expression features, which are the source domain data and the target domain data, respectively. These features are then input into the deep subdomain adaptive network in step (3) to obtain the classification results.

[0106] Based on the above inventive concept, the present invention also discloses a cross-database facial expression recognition system based on spatiotemporal feature point motion attention and subdomain adaptation, including at least one computing device. The computing device includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the computer program is loaded onto the processor, it can implement the above-mentioned cross-database facial expression recognition method based on spatiotemporal feature point motion attention and subdomain adaptation.

[0107] This invention utilizes a spatiotemporal feature point motion attention and subdomain adaptation method for cross-database facial expression recognition. It fully leverages spatiotemporal feature points, effectively combining facial and motion information to accurately reflect facial regions where muscle movements occur. Furthermore, it employs Local Maximum Mean Difference (LMMD) to reduce the feature distribution difference between the source and target domains, capturing more fine-grained information through subdomain adaptation, achieving better emotion recognition results than depth domain adaptation methods.

[0108] The above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A cross-database facial expression recognition method, based on spatiotemporal feature point motion attention and subdomain adaptation, characterized in that: Includes the following steps: Step (1): Extract the main part of the video in the database, ensuring that they are of the same length and divide them into frames; Step (2): Detect the spatiotemporal feature points of each video face and action, calculate the weight value of the corresponding feature points in each frame, and form a facial spatiotemporal feature point weight map; this step specifically includes: Step S201: By sliding a local window across the image, points with drastic changes in the first-order gradient of the image are selected, and spatiotemporal feature points of facial motion in two adjacent frames are detected. The Harris corner points are used to calculate the horizontal (x-direction) and vertical (y-direction) gradients, respectively: An image can be calculated using the two windows mentioned above. The gradients in the x and y directions are: , ,get The gradient matrix is: This leads to the corner response: , in, for The determinant, yes traces, It is a constant; To calculate the gradient matrix of the image sequence, the above method is extended to three dimensions: in, For image The gradient on the time axis allows us to obtain the corresponding values ​​of the corner points in the image sequence: ; Step S202: Use local non-maximum suppression to select the maximum value of the local region as the candidate corner point, and compare it with a threshold to filter out the final corner point position. Construct a cube with the key point as the center, use the k-nearest neighbor algorithm to cluster a large number of cubes, and finally calculate the histogram of various descriptors of the clustered cubes as the feature description. Step S203: After obtaining the spatiotemporal feature point weight map of the face, assign the spatiotemporal feature point weight map calculated from the previous frame to the R, G, and B dimensions of the RGB image, respectively, and generate a spatiotemporal motion attention expression feature representation map. Step (3): Construct a deep subdomain adaptive network based on spatiotemporal feature point motion attention. The deep subdomain adaptive network uses ResNet50 as the base network and introduces Local Maximum Mean Difference (LMMD) in a specific layer of ResNet50 to align related subdomains. The expansion of the LMMD in the Lth layer of the deep network is as follows: in, It is the activation generated by the Lth layer of the deep network. and Let represent the weights of samples belonging to class C in the source and target domains, respectively, and k be the kernel function. The loss function of the entire network is defined as: ; Step (4): Input the spatiotemporal motion attention expression feature representation map generated in step (2) into the deep subdomain adaptive network constructed in step (3) for feature extraction. Input the obtained features into the SVM classifier and softmax layer, and output the classification result.

2. The method according to claim 1, characterized in that: Step (1) Divide the videos in the Enterface and Ravdess databases into six categories: anger, disgust, fear, happiness, sadness, and surprise. Extract the main part of the video to the same length and divide it into frames so that all videos have the same number of frames, which is 20.

3. The method according to claim 1, characterized in that: Step (3) includes: (1) Local Maximum Mean Difference (LMMD) is introduced as the base network in a specific layer of the ResNet50 network. The ResNet50 consists of five stacked residual modules, each containing a fixed number of convolutional layers, pooling layers, batch normalization (BN) layers, and ReLU layers. As a preprocessing step for the input, the last four residual modules consist of Bottleneck; (2) After the above five residual modules, a 1000d result is obtained by mean pooling. After passing through the fully connected layer and the softmax activation function, the output is obtained. After the ResNet50 basic network is constructed, LMMD is introduced in different layers to align the relevant subdomains. (3) Align relevant subdomains using Local Maximum Mean Difference (LMMD); (4) After introducing LMMD into a specific layer of the base network, the loss function of the entire network can be defined as follows: 。 4. The method according to claim 3, characterized in that: The parameters are as follows: in, Indicates the size of the convolution kernel or pooling. Indicates the number of convolution kernels. Indicates the stride of the convolution kernel or pooling. The formula for calculating the output size of the convolutional layer, where padding is used, is as follows: in, Indicates the output size of the convolutional layer. Indicates the input size of the convolutional layer; , , , The parameters of the residual module are as follows: 。 5. The method according to claim 4, characterized in that: The formula for LMMD is as follows: in, and They are the source domains and target domain Examples in, and They are and Distribution; It is a regenerated kernel Hilbert space (RKHS) with characteristic kernel k. This represents mapping the original samples to some feature maps in the RKHS, where the feature kernel k is denoted as... ,in, Represents the inner product of vectors; minimized in a deep network. This reduces the distribution of related subdomains within the same category; assuming each sample is weighted according to... It belongs to a certain category, and then the above formula is transformed into: in, and These respectively represent those belonging to class C. and The weight, It is a weighted sum over class C; instance weight The calculation formula is ,in, It is the label of class C; LMMD is introduced in the Lth layer of the deep network, and its expansion is: in, The activation generated by the Lth layer of the deep network is defined as the adaptive loss of the Lth layer, and LMMD can be implemented by most feedforward network models.

6. The method according to claim 2, characterized in that: In step (4), the enterface and ravdess datasets are processed in step (2) to generate spatiotemporal motion-focused facial expression features, which are the source domain data and target domain data, respectively. These features are then input into the deep subdomain adaptive network in step (3) to obtain the classification results.

7. A cross-database facial expression recognition system, which can implement the method as described in any one of claims 1 to 6.