A gait emotion recognition method and system based on a double-flow network
By combining global and local spatiotemporal features with a dual-stream network model and fusing features using self-attention and cross-attention modules, the problem of low accuracy in gait emotion recognition in existing technologies has been solved, achieving higher-precision emotion recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ARTIFICIAL INTELLIGENCE RES INST OF HEFEI COMPREHENSIVE NAT SCI CENT (ANHUI ARTIFICIAL INTELLIGENCE LAB)
- Filing Date
- 2023-08-25
- Publication Date
- 2026-06-23
AI Technical Summary
Existing gait emotion recognition methods fail to effectively combine global and local spatiotemporal features, resulting in low accuracy in emotion recognition.
A two-stream network-based approach is adopted, which extracts global and local spatiotemporal features of 3D gait data through a global capture module and a local capture module, respectively. The feature fusion module is used to perform feature fusion, and the feature connectivity is enhanced by combining a self-attention module and a cross-attention module to output the emotion recognition type.
It improves the accuracy of gait emotion recognition, enabling it to better capture global and local information about human movement and thus enhance the accuracy of emotion recognition.
Smart Images

Figure CN117115912B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of emotion recognition technology, and in particular to a gait emotion recognition method and system based on a two-stream network. Background Technology
[0002] With the rapid development of computer vision technology, human-computer interaction has gradually gained attention, and emotion recognition, as an important part of human-computer interaction, has become a research hotspot in the field of computer vision. Among existing technologies, some have proposed methods for emotion recognition based on human gait.
[0003] Patent [CN114863563A] provides an emotion recognition method and apparatus based on gait information, which decomposes video data of a target object into a sequence of image frames; obtains three-dimensional posture features of the target object based on the image frame sequence and an emotion recognition model's posture extraction module; obtains first and second motion trajectory features based on the image frame sequence and an emotion recognition model's motion trajectory feature extraction module; obtains fused features based on the three-dimensional posture features, the first and second motion trajectory features, and the feature fusion layer of the emotion recognition model; and obtains the emotion recognition result based on the fused features and the output layer of the emotion recognition model.
[0004] Patent [CN202210789016.X] relates to a gait emotion recognition method, device, electronic device, and storage medium based on Transformer. The gait emotion recognition network described in this invention is an autoencoder network model based on SpatialTemporalTransformer. The gait emotion recognition method based on Transformer described in this invention introduces and applies the Transformer algorithm, designing an autoencoder network model based on SpatialTemporalTransformer, which significantly improves the algorithm's performance.
[0005] Patent [CN111950449A] provides a method for emotion recognition based on walking posture, relating to the field of emotion recognition technology. The embodiment of this invention first obtains a video of the gait of the subject, then obtains the positions of key limb nodes of the subject, further obtaining an array of key limb nodes, and finally inputs the array of key limb nodes into a pre-trained emotion recognition model to obtain the emotion category of the subject.
[0006] However, the existing emotion recognition methods mentioned above do not consider both global and local aspects simultaneously, resulting in low accuracy in identifying emotion categories. Summary of the Invention
[0007] Based on the technical problems existing in the background technology, this invention proposes a gait emotion recognition method and system based on a dual-stream network, which improves the accuracy of emotion recognition for gait data by combining global spatiotemporal features and local spatiotemporal features.
[0008] This invention proposes a gait emotion recognition method based on a two-stream network, comprising the following steps:
[0009] Feature extraction is performed on the acquired walking video to obtain three-dimensional gait data of skeleton points;
[0010] The three-dimensional gait data is input into a pre-trained two-stream network model, which includes a global capture module, a local capture module, and a feature fusion module.
[0011] Global spatiotemporal features of 3D gait data are extracted using a global capture module.
[0012] Local spatiotemporal features of 3D gait data are extracted using a local capture module.
[0013] The feature fusion module is used to fuse global and local spatiotemporal features to output the predicted emotion recognition type.
[0014] Furthermore, the 3D gait data is split into two channels and input into the two-stream network model. In one channel, the coordinates of the extracted 3D gait data are mapped onto a plane to form a planar image, which is then input into the global capture module of the two-stream network model. In the other channel, the coordinates of the 3D gait data are extracted and converted into a graph structure, which is then input into the local capture module.
[0015] Furthermore, the global capture module includes a first global block, a second global block, a first max pooling layer, a third global block, a fourth global block, a second max pooling layer, and a fully connected layer connected in sequence. The fully connected layer outputs global spatiotemporal features, and the first global block inputs a planar image.
[0016] Furthermore, the first global block, the second global block, the third global block, and the fourth global block each include a globally connected two-dimensional convolutional layer (Conv2D), a globally activated layer (ReLU), and a globally normalized layer (BatchNorm). The first global block's Conv2D receives a planar image as input, the output of the first global block's BatchNorm is connected to the input of the second global block's Conv2D, the output of the second global block's BatchNorm is connected to the input of the first max pooling layer, the input of the third global block's Conv2D is connected to the output of the first max pooling layer, the output of the third global block's BatchNorm is connected to the input of the fourth global block's Conv2D, and the output of the fourth global block's BatchNorm is connected to the input of the second max pooling layer.
[0017] Furthermore, the local capture module includes a first local block, a second local block, a third local block, a local average pooling layer, a first local normalization layer, a local two-dimensional convolutional layer Conv2D, and a second local normalization layer connected in sequence; the first local block inputs the coordinates of the three-dimensional gait data, and the second local normalization layer outputs local spatiotemporal features.
[0018] Furthermore, the first local block, the second local block, and the third local block each include an ST-GCN layer, a local activation layer ReLU, and a third local normalization layer BatchNorm connected in sequence;
[0019] The first local block's ST-GCN layer takes the coordinates of the 3D gait data as input. The output of the third local normalization layer BatchNorm of the first local block is connected to the input of the ST-GCN layer of the second local block. The output of the third local block normalization layer BatchNorm of the second local block is connected to the input of the ST-GCN layer of the third local block. The output of the third local normalization layer BatchNorm is connected to the input of the local average pooling layer.
[0020] Furthermore, the feature fusion module includes two self-attention modules (SAMs) and three cross-attention modules (CAMs). The inputs of the two parallel SAMs are connected one-to-one to the outputs of the global capture module and the local capture module, respectively. The outputs of the two SAMs are connected to the two parallel CAMs. The outputs of the two CAMs are connected to the input of the last CAM. The output of the last CAM is connected to the input of the classifier, which outputs the emotion recognition type.
[0021] Furthermore, the self-attention module SAM includes multi-head self-attention and self-residual. The multi-head self-attention is input with global or local spatiotemporal features, and the output of the multi-head self-attention is connected to the self-residual.
[0022] Furthermore, the cross-attention module CAM includes a multi-head cross-attention, a first cross-residual connection, a feedforward network, and a second cross-residual connection connected in sequence. The output of the self-residual is connected to the input of the multi-head cross-attention and the input of the first cross-residual connection, respectively. The output of the first cross-residual connection is connected to the input of the feedforward network and the input of the second cross-residual connection, respectively. The output of the second cross-residual is connected to the multi-head cross-attention of the last cross-attention module CAM.
[0023] A gait emotion recognition system based on a two-stream network model includes a feature extraction module, an input module, a global capture module, a local capture module, and a feature fusion module.
[0024] The feature extraction module is used to extract features from the acquired walking video to obtain three-dimensional gait data of the skeleton points;
[0025] The input module is used to input 3D gait data into a pre-trained two-stream network model, which includes a global capture module, a local capture module, and a feature fusion module.
[0026] The global capture module is used to extract global spatiotemporal features from 3D gait data;
[0027] The local capture module is used to extract local spatiotemporal features from 3D gait data;
[0028] The feature fusion module is used to fuse global and local spatiotemporal features to output a three-dimensional predicted emotion recognition type.
[0029] The advantages of the gait emotion recognition method and system based on a two-stream network provided by this invention are as follows: The gait emotion recognition method and system based on a two-stream network provided in the structure of this invention... Attached Figure Description
[0030] Figure 1 This is a schematic diagram of the structure of the present invention;
[0031] Figure 2 This is a schematic diagram of the global capture module, where orange represents the global 2D convolutional layer Conv2D, yellow represents the global activation layer ReLU, gray represents the global normalization layer BatchNorm, and green represents the second max pooling layer.
[0032] Figure 3This is a schematic diagram of the local capture module, where blue represents the ST-GCN layer, yellow represents the ReLU local activation layer, gray represents the BatchNorm local normalization layer, and green represents the Avepool2D local average pooling layer.
[0033] Figure 4 This is a schematic diagram of the feature fusion module. Detailed Implementation
[0034] The technical solution of the present invention will now be described in detail through specific embodiments. Many specific details are set forth in the following description to provide a thorough understanding of the invention. However, the present invention can be implemented in many other ways different from those described herein, and those skilled in the art can make similar modifications without departing from the spirit of the invention. Therefore, the present invention is not limited to the specific embodiments disclosed below.
[0035] like Figure 1 As shown, the gait emotion recognition method based on a two-stream network proposed in this invention includes the following steps:
[0036] Feature extraction is performed on the acquired walking video to obtain three-dimensional gait data of skeleton points;
[0037] The three-dimensional gait data is input into a pre-trained two-stream network model, which includes a global capture module, a local capture module, and a feature fusion module.
[0038] Global spatiotemporal features of 3D gait data are extracted using a global capture module.
[0039] Local spatiotemporal features of 3D gait data are extracted using a local capture module.
[0040] The feature fusion module is used to fuse global and local spatiotemporal features to output the predicted emotion recognition type.
[0041] This embodiment extracts gait data from walking videos using skeleton points and processes it into inputs for a suitable two-channel feature extraction network, capturing global and local spatiotemporal features respectively. A self-attention module (SAM) and a cross-attention module (CAM) are introduced into the feature fusion network to enhance feature connectivity across the two channels, thereby improving the accuracy of emotion recognition.
[0042] The 3D gait data is input into the dual-stream network model through two channels. For global spatiotemporal features, the gait data is preprocessed before being input into the global capture module to obtain the 3D coordinates of the joints of the human skeletal frame. These 3D coordinates are mapped onto a planar image, which is then used as the input to the global capture module. The global capture module extracts the required global spatiotemporal features from the planar image, and then extracts the temporal information through the BiLSTM module. Finally, the global spatiotemporal features with extracted spatial and temporal information are sent to the feature fusion module for feature fusion.
[0043] For local features, the gait data is preprocessed before being input into the local capture module to obtain the three-dimensional coordinates of the joints of the human motion skeleton. The three-dimensional coordinates form a graph structure, which is used as the input to the local capture module. The local capture module extracts the spatial and temporal information of the local features from the graph structure and sends the obtained local spatiotemporal features to the feature fusion module for feature fusion.
[0044] The feature fusion module improves the accuracy of emotion recognition in the two-stream network model. This module includes two self-attention modules (SAM) to refine the characteristics of each channel, and three cross-attention modules (CAM) to fuse the obtained global and local channel features, which are explained in detail below.
[0045] (A) such as Figures 1 to 3 As shown, the two-stream network model mainly consists of a CNN-LSTM network and an ST-GCN network, specifically:
[0046] (A1) The global capture module is built on the basis of the CNN-LSTM network. Specifically, it includes a first global block, a second global block, a first max pooling layer, a third global block, a fourth global block, a second max pooling layer, and a fully connected layer connected in sequence. The fully connected layer outputs global spatiotemporal features, and the first global block inputs a planar image. The first, second, third, and fourth global blocks each include a globally connected 2D convolutional layer (Conv2D), a globally activated layer (ReLU), and a globally normalized layer (BatchNorm). The first global block's Conv2D input is a planar image; the output of the first global block's BatchNorm is connected to the input of the second global block's Conv2D convolutional layer; the output of the second global block's BatchNorm is connected to the input of the first max pooling layer; the third global block's Conv2D input is connected to the output of the first max pooling layer; the third global block's BatchNorm is connected to the input of the fourth global block's Conv2D convolutional layer; and the fourth global block's BatchNorm is connected to the input of the second max pooling layer.
[0047] It should be noted that in order to gradually extract the required features from the input data, more complex features can be extracted by combining multiple convolutional kernels. Although the first, second, third, and fourth global blocks have the same structure, the kernel size of the global 2D convolutional layer Conv2D in the blocks is different. The kernel size of the global 2D convolutional layer Conv2D in the first global block is 8, while the kernel size of the global 2D convolutional layer Conv2D in the second, third, and fourth global blocks is 32.
[0048] like Figure 2 As shown, in the CNN-LSTM model, the extracted skeleton joint coordinates are mapped to a plane to form a planar image, which is then input into the global capture module in the form of a planar image. Global spatiotemporal features are extracted through a convolutional network, and then temporal features are extracted using a Bi-LSTM network.
[0049] (A2) The local capture module is built on the ST-GCN model. It adds a time edge to the spatial edge of the original ST-GCN model, extending GCN to the time dimension, so that the local capture module can extract spatial and temporal features at the same time, and extract local spatiotemporal features using the local capture module.
[0050] The local capture module specifically includes a first local block, a second local block, a third local block, a local average pooling layer, a first local normalization layer, a local 2D convolutional layer (Conv2D), and a second local normalization layer, connected in sequence. The first local block takes the coordinates of the 3D gait data as input, and the second local normalization layer outputs local spatiotemporal features. Each of the first, second, and third local blocks includes an ST-GCN layer, a local activation layer (ReLU), and a third local normalization layer (BatchNorm), connected in sequence. The ST-GCN layer of the first local block takes the coordinates of the 3D gait data as input, the output of the third local normalization layer (BatchNorm) of the first local block is connected to the input of the ST-GCN layer of the second local block, the output of the third local normalization layer (BatchNorm) of the second local block is connected to the input of the ST-GCN layer of the third local block, and the output of the third local normalization layer (BatchNorm) is connected to the input of the local average pooling layer.
[0051] Similarly, in order to extract more complex features through the combination of multiple convolutional kernels, although the first local block, the second local block, and the third local block have the same structure, the size of the convolutional kernel of the ST-GCN layer in the block is different. The size of the convolutional kernel of the ST-GCN layer in the first local block is 32, while the size of the convolutional kernel of the ST-GCN layer in the second local block and the third local block is 64.
[0052] The above global and local spatiotemporal features enable the two-stream network model to not only focus on local information such as the relationship between skeleton points and skeleton edges, but also to focus on global information about the changes in the entire human body's motion during the pedestrian's walking process, thereby enabling it to better predict pedestrian emotions.
[0053] (B) such as Figures 1 to 4 As shown, the feature fusion module includes two self-attention modules (SAMs) and three cross-attention modules (CAMs). The inputs of the two parallel SAMs are connected one-to-one to the outputs of the global capture module and the local capture module, respectively. The outputs of the two SAMs are connected to the two parallel CAMs. The outputs of the two CAMs are connected to the input of the last CAM. The output of the last CAM is connected to the input of the classifier, which outputs the emotion recognition type.
[0054] By employing self-attention and mutual attention, SAM learns information from different feature locations using multi-head attention and residuals. CAM fuses two feature sets from different channels using cross-attention and a fully connected feedforward network (FFN), including two linear transformations and a ReLU layer. Specifically, the self-attention module SAM includes multi-head self-attention and self-residuals. The multi-head self-attention takes global or local spatiotemporal features as input, and its output is connected to the self-residuals. The cross-attention module CAM includes a multi-head cross-attention, a first cross-residual connection, a feedforward network, and a second cross-residual connection, all connected in sequence. The output of the self-residuals is connected to the inputs of the multi-head cross-attention and the first cross-residual connection, respectively. The output of the first cross-residual connection is connected to the inputs of the feedforward network and the second cross-residual connection, respectively. The output of the second cross-residual connection is connected to the multi-head cross-attention of the last cross-attention module CAM.
[0055] The above settings for self-residual, first cross residual, and second cross residual can effectively solve the problem of gradient vanishing when updating coefficients during backpropagation of network gradients in the feature fusion module.
[0056] As an example:
[0057] Step 1: Obtain a pedestrian walking video dataset and use OpenPose pose extraction technology to extract the 3D coordinates of the pedestrian skeleton joints.
[0058] Step 2: Map the 3D coordinates of these key points onto a planar image, and input the planar image into the global capture module of the two-stream network model to output global spatiotemporal features;
[0059] Step 3: Convert the 3D coordinates of these joints into a graph structure and input it into the local capture module of the two-stream network model to capture local spatiotemporal features.
[0060] Step 4: Input the captured global and local spatiotemporal features into the designed feature fusion module, and output the predicted emotion recognition type through self-attention and mutual attention.
[0061] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A gait emotion recognition method based on a two-stream network, characterized in that, Includes the following steps: Feature extraction is performed on the acquired walking video to obtain three-dimensional gait data of skeleton points; The three-dimensional gait data is input into a pre-trained two-stream network model, which includes a global capture module, a local capture module, and a feature fusion module. Global spatiotemporal features of 3D gait data are extracted using a global capture module. Local spatiotemporal features of 3D gait data are extracted using a local capture module. The feature fusion module is used to fuse global and local spatiotemporal features to output the predicted emotion recognition type. The 3D gait data is split into two channels and input into the two-stream network model. In one channel, the coordinates of the extracted 3D gait data are mapped onto a plane to form a planar image, which is then input into the global capture module of the two-stream network model. In the other channel, the coordinates of the 3D gait data are extracted and converted into a graph structure, which is then input into the local capture module.
2. The gait emotion recognition method based on a two-stream network according to claim 1, characterized in that, The global capture module includes a first global block, a second global block, a first max pooling layer, a third global block, a fourth global block, a second max pooling layer, and a fully connected layer connected in sequence. The fully connected layer outputs global spatiotemporal features, and the first global block inputs a planar image.
3. The gait emotion recognition method based on a two-stream network according to claim 2, characterized in that, The first, second, third, and fourth global blocks each include a globally connected two-dimensional convolutional layer (Conv2D), a globally activated layer (ReLU), and a globally normalized layer (BatchNorm). The first global block's Conv2D receives a planar image as input, the output of the first global block's BatchNorm is connected to the input of the second global block's Conv2D, the output of the second global block's BatchNorm is connected to the input of the first max pooling layer, the input of the third global block's Conv2D is connected to the output of the first max pooling layer, the output of the third global block's BatchNorm is connected to the input of the fourth global block's Conv2D, and the output of the fourth global block's BatchNorm is connected to the input of the second max pooling layer.
4. The gait emotion recognition method based on a two-stream network according to claim 1, characterized in that, The local capture module includes a first local block, a second local block, a third local block, a local average pooling layer, a first local normalization layer, a local two-dimensional convolutional layer (Conv2D), and a second local normalization layer, connected in sequence. The first local block inputs the coordinates of the three-dimensional gait data, and the second local normalization layer outputs local spatiotemporal features.
5. The gait emotion recognition method based on a two-stream network according to claim 4, characterized in that, The first local block, the second local block, and the third local block each include an ST-GCN layer, a local activation layer ReLU, and a third local normalization layer BatchNorm connected in sequence; The first local block's ST-GCN layer takes the coordinates of the 3D gait data as input. The output of the third local normalization layer BatchNorm of the first local block is connected to the input of the ST-GCN layer of the second local block. The output of the third local block normalization layer BatchNorm of the second local block is connected to the input of the ST-GCN layer of the third local block. The output of the third local normalization layer BatchNorm is connected to the input of the local average pooling layer.
6. The gait emotion recognition method based on a two-stream network according to claim 1, characterized in that, The feature fusion module includes two self-attention modules (SAMs) and three cross-attention modules (CAMs). The inputs of the two parallel SAMs are connected one-to-one to the outputs of the global capture module and the local capture module, respectively. The outputs of the two SAMs are connected to the two parallel CAMs. The outputs of the two CAMs are connected to the input of the last CAM. The output of the last CAM is connected to the input of the classifier, which outputs the emotion recognition type.
7. The gait emotion recognition method based on a two-stream network according to claim 6, characterized in that, The self-attention module SAM includes multi-head self-attention and self-residual. The multi-head self-attention is input with global or local spatiotemporal features, and the output of the multi-head self-attention is connected to the self-residual.
8. The gait emotion recognition method based on a two-stream network according to claim 7, characterized in that, The cross-attention module (CAM) includes a multi-head cross-attention, a first cross-residual connection, a feedforward network, and a second cross-residual connection connected in sequence. The output of the self-residual is connected to the input of the multi-head cross-attention and the input of the first cross-residual connection, respectively. The output of the first cross-residual connection is connected to the input of the feedforward network and the input of the second cross-residual connection, respectively. The output of the second cross-residual is connected to the multi-head cross-attention of the last cross-attention module (CAM).
9. A gait emotion recognition system based on a two-stream network model, characterized in that, It includes a feature extraction module, an input module, a global capture module, a local capture module, and a feature fusion module; The feature extraction module is used to extract features from the acquired walking video to obtain three-dimensional gait data of the skeleton points; The input module is used to input 3D gait data into a pre-trained two-stream network model, which includes a global capture module, a local capture module, and a feature fusion module. The global capture module is used to extract global spatiotemporal features from 3D gait data; The local capture module is used to extract local spatiotemporal features from 3D gait data; The feature fusion module is used to fuse global and local spatiotemporal features to output the predicted emotion recognition type in three dimensions; The 3D gait data is divided into two channels and input into the two-stream network model. In one channel, the coordinates of the extracted 3D gait data are mapped onto a plane to form a planar image, which is then input into the global capture module of the two-stream network model. In the other channel, the coordinates of the 3D gait data are extracted and converted into a graph structure, which is then input into the local capture module.