A sign language emotion recognition system based on electrocardio and skeleton multi-modal fusion

The sign language emotion recognition system, which integrates ECG and skeletal multimodal data collection, uses a monocular camera and a single-lead ECG sensor to collect data. This solves the problems of low accuracy in high-arousal emotion recognition and high cost of multimodal data synchronization in existing technologies, and achieves high-precision, low-cost, and robust emotion recognition for hearing-impaired individuals.

CN122196867APending Publication Date: 2026-06-12DALIAN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DALIAN UNIV OF TECH
Filing Date
2026-02-05
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for sign language emotion recognition in hearing-impaired individuals suffer from several problems, including low accuracy in high-arousal emotion recognition, high cost and easy failure of multimodal data synchronization, poor generalization ability of subjects, and conflicts between privacy and computing power.

Method used

A sign language emotion recognition system based on ECG and skeleton multimodal fusion is adopted. Data is collected through a monocular high-definition camera and a portable single-lead ECG sensor. By utilizing a synchronous acquisition module, a preprocessing alignment module, a skeleton extraction module, a physiological coding module, a multimodal feature extraction unit, and a cross-attention fusion module, the system achieves synchronization and feature fusion of ECG and skeleton data, thereby improving the accuracy and robustness of emotion recognition.

🎯Benefits of technology

Eliminating the impact of hardware clock drift significantly improves the recognition accuracy of similar emotions, reduces system hardware costs and deployment complexity, enhances the generalization ability to different individuals, and achieves privacy protection and edge computing deployment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196867A_ABST
    Figure CN122196867A_ABST
Patent Text Reader

Abstract

The application relates to the field of artificial intelligence and auxiliary interaction technology, and provides a sign language emotion recognition system based on electrocardio and skeleton multi-modal fusion, which comprises a collection device, a main control computing device and an interaction output device; the collection device comprises a visual collection device and a physiological collection device; the main control computing device is in communication connection with the collection device; the main control computing device comprises a synchronous collection module, a pretreatment alignment module, a skeleton extraction module, a physiological coding module, a multi-modal feature extraction unit, a cross attention fusion module and an emotion classification module; and the interaction output device is used for feeding back the finally recognized emotion category to a user. The application can improve the accuracy and reliability of emotion recognition and enhance the generalization ability to different individuals.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence and assisted interaction technology, and in particular to a sign language emotion recognition system based on electrocardiogram and skeleton multimodal fusion. Background Technology

[0002] Currently, emotion recognition for hearing-impaired individuals primarily employs single-modal sign language recognition methods based on computer vision. This involves capturing sign language videos using RGB cameras or depth cameras, extracting key skeletal points of the hands and limbs through human pose estimation algorithms (such as OpenPose or MediaPipe), or directly processing video frames using convolutional neural networks (CNNs). Subsequently, long short-term memory networks (LSTM) or graph convolutional networks (GCNs) are used to perform spatiotemporal modeling of the skeletal sequence, classifying the meaning of sign language or inferring emotions based on the trajectory, amplitude, and speed of limb movements. Some advanced solutions attempt to combine facial expression recognition to assist in emotion assessment.

[0003] Although the aforementioned existing technologies have made some progress in sign language semantic translation, they still have the following major shortcomings in emotion recognition tasks:

[0004] Defect 1: Low accuracy in recognizing similar emotions with high arousal levels.

[0005] Current technologies rely solely on visual appearance features. However, in sign language, high-arousal emotions such as "anger" and "happiness" are both expressed through large, rapid movements. Using only visual skeleton data, algorithms struggle to capture the essential physiological differences between the two (such as heart rate changes and levels of tension), leading to frequent confusion and a high rate of misidentification.

[0006] Defect 2: Multimodal data synchronization is costly and prone to failure.

[0007] To address the limitations of vision, a few existing technologies attempt to introduce physiological sensors (multimodal). However, these solutions typically rely on expensive hardware synchronization boards to force alignment between sensors of different frequencies (such as cameras and ECG monitors). Without hardware synchronization, long-term data acquisition can lead to severe system clock drift, causing data misalignment on the timeline and resulting in algorithm failure. This limits their application in low-cost consumer devices.

[0008] Defect 3: Poor generalization ability of subjects (low robustness).

[0009] Existing models are prone to overfitting to the movement styles of specific training subjects. When faced with unseen subjects, the recognition performance of a single visual model often drops significantly due to the huge differences in sign language habits and movement amplitude among different people, lacking robustness across different groups.

[0010] Defect 4: Conflict between privacy and computing power.

[0011] If facial expressions are relied upon for assisted recognition, clear images of the user's face need to be collected, which poses a risk of privacy leakage; moreover, processing high-resolution facial videos requires extremely high computing power from the terminal device, making it difficult to achieve real-time operation on portable edge devices. Summary of the Invention

[0012] This invention primarily addresses the technical problems in existing emotion recognition technologies, such as the difficulty of distinguishing highly aroused but similar emotions due to a single visual modality, leading to misjudgments; the inability to achieve low-cost, accurate alignment due to clock drift between the camera and ECG sensor in the absence of expensive hardware synchronization boards; and the tendency for models to overfit to the movement habits of specific subjects, resulting in poor robustness in recognizing users they have not seen. The invention proposes a sign language emotion recognition system based on ECG and skeletal multimodal fusion. It incorporates ECG data reflecting the state of the autonomic nervous system and designs a cross-attention fusion module, using physiological features as query vectors to dynamically weight visual features, thereby improving the accuracy and reliability of emotion recognition and enhancing its generalization ability across different individuals.

[0013] This invention provides a sign language emotion recognition system based on electrocardiogram and skeleton multimodal fusion, comprising: a data acquisition device, a main control computing device, and an interactive output device;

[0014] The acquisition device includes a visual acquisition device and a physiological acquisition device; the visual acquisition device acquires video frame image data, and the physiological acquisition device simultaneously acquires electrocardiogram data.

[0015] The main control computing device is communicatively connected to the acquisition device; the main control computing device includes a synchronous acquisition module, a preprocessing alignment module, a skeleton extraction module, a physiological coding module, a multimodal feature extraction unit, a cross-attention fusion module, and an emotion classification module;

[0016] The synchronous acquisition module acquires video frame image data transmitted by the visual acquisition device and electrocardiogram data transmitted by the physiological acquisition device in parallel; at the moment of receiving each video frame image data and each electrocardiogram data, it adds an image timestamp to the image data and adds an electrocardiogram data timestamp to the electrocardiogram data.

[0017] The preprocessing alignment module performs time-series alignment between image timestamps and ECG data timestamps.

[0018] The skeleton extraction module integrates a pose estimation algorithm to extract the coordinates of key points of the human upper body skeleton from the aligned video frame data and construct a spatiotemporal skeleton atlas.

[0019] The physiological coding module is used to construct a physiological sequence atlas containing temporal changes in the subject's physiological arousal level from the aligned electrocardiogram data.

[0020] A multimodal feature extraction unit, including an ES-GCN model, is used to input the processed skeleton spatiotemporal atlas and physiological sequence atlas into the ES-GCN model to extract skeleton action feature vectors and physiological feature vectors.

[0021] The ES-GCN model includes a parallel visual feature extraction module and a physiological feature extraction module;

[0022] The visual feature extraction module inputs the skeleton spatiotemporal map into the ST-GCN module; the ST-GCN module encodes the skeleton spatiotemporal map into a visual feature vector containing spatiotemporal dynamic information, and then uses the visual feature vector to extract the skeleton action feature vector through spatiotemporal map convolution operation;

[0023] The physiological feature extraction module inputs the aligned physiological sequence map into a one-dimensional convolutional network, extracts the emotional arousal features hidden in the electrocardiogram data waveform through convolution operation, and encodes the emotional arousal features into a physiological feature vector containing emotional arousal information.

[0024] The cross-attention fusion module uses a cross-attention mechanism to fuse the skeleton action feature vector from the visual feature extraction module and the physiological feature vector from the physiological feature extraction module to obtain a fused emotion feature vector.

[0025] The interactive output device is used to provide feedback to the user on the finally identified emotion category.

[0026] Preferably, the visual acquisition device uses a monocular high-definition camera; the physiological acquisition device uses a portable single-lead electrocardiogram sensor.

[0027] Preferably, the synchronous acquisition module, upon receiving each video frame image data and each electrocardiogram (ECG) data, adds an image timestamp to the image data and an ECG data timestamp to the ECG data, including:

[0028] Upon receiving each frame of image data and each ECG data, the system master clock is invoked to immediately read the current system time in milliseconds, and an image timestamp is added to the image data and an ECG data timestamp is added to the ECG data. Based on the image timestamp, the continuous video frame image data is divided into video frame image data sample segments with a duration of L seconds, and based on the ECG data timestamp, the continuous ECG data is divided into ECG data sample segments with a duration of L seconds.

[0029] Preferably, the preprocessing alignment module performs time-series alignment of image timestamps and ECG data timestamps, including:

[0030] The preprocessing alignment module extracts the coordinates of key points of the upper body skeleton from each video frame image data using a human pose estimation algorithm; and interpolates or downsamples the ECG data according to the ECG data timestamp to ensure that the ECG data timestamp is strictly aligned with the video frame data timestamp, generating time-aligned multimodal data pairs.

[0031] Preferably, in the skeleton extraction module, the spatiotemporal graph of the skeleton is an undirected graph. form;

[0032] V represents the set of nodes in an undirected graph, which contains all the skeleton keypoints in a frame; Let represent a set of edges, which includes spatial edges and temporal edges;

[0033] The spatial edge is a joint that connects adjacent parts of the human physiological structure;

[0034] The time edge is the position of the same joint connecting consecutive frames.

[0035] Preferably, the spatiotemporal graph convolution formula of the visual feature extraction module is:

[0036] The input of the ST-GCN module Output The calculation formula is:

[0037]

[0038] in: This indicates the output of the ST-GCN module; This represents the skeleton feature tensor input to the ST-GCN module; This represents the weight matrix that the network needs to learn. Represents the adjacency matrix that describes the connection relationships of the skeleton; This represents the degree matrix used for normalization; This indicates different spatial partitioning strategies.

[0039] Preferably, the emotional arousal characteristics include the R-wave interval change rate, instantaneous heart rate, RR interval standard deviation, and root mean square of the difference between adjacent RR intervals.

[0040] Preferably, the cross-attention mechanism of the cross-attention fusion module includes the following three stages:

[0041] Phase 1: Constructing the feature mapping and query matrix between physiological feature vectors and skeletal movement feature vectors;

[0042] The cross-attention fusion module receives physiological feature vectors from the physiological feature extraction module and skeleton action feature vectors from the visual feature extraction module, respectively.

[0043] The physiological feature vector is mapped to a query matrix through a built-in linear projection layer, and the skeletal motion feature vector is mapped to a key matrix and a value matrix, respectively.

[0044] The query matrix contains feature vectors representing the baseline of intrinsic emotion output by the physiological feature extraction module; the key matrix contains feature vectors representing the external spatiotemporal action pattern output by the visual feature extraction module; and the value matrix contains skeleton action feature vectors representing the high-dimensional semantic content of sign language output by the visual feature extraction module.

[0045] The second stage involves calculating the attention weights of the physiological feature vector and the skeletal motion feature vector, and then performing weighted fusion.

[0046] The attention weights and fusion formulas for calculating the correlation between physiological feature vectors and skeletal motion feature vectors are as follows:

[0047]

[0048] Where: Q represents the feature matrix output by the physiological feature extraction module, representing the baseline of internal emotion; K represents the feature matrix output by the visual feature extraction module, representing the external spatiotemporal action pattern; V represents the feature matrix output by the visual feature extraction module, representing the high-dimensional semantic content of sign language. Indicates the scaling factor; This represents the normalized exponential function;

[0049] The third stage: Output the classification results of the cross-attention mechanism.

[0050] Preferably, the emotion classification module includes a global average pooling layer, a fully connected layer, and a Softmax classifier;

[0051] The emotion classification module takes the emotion feature vector fused by the cross-attention fusion module, passes it through a global average pooling layer, and inputs it into a fully connected layer for linear mapping; it uses the Softmax function to calculate the probability value of each emotion category; and outputs the category with the highest probability as the final recognition result; the emotion categories include anger, happiness, sadness, and calmness.

[0052] Preferably, the interactive output device includes a display screen and / or a speech synthesis speaker.

[0053] The present invention provides a sign language emotion recognition system based on electrocardiogram and skeleton multimodal fusion, which has the following advantages compared with the prior art:

[0054] 1. Eliminate the effects of hardware clock drift.

[0055] This invention acquires video streams of a subject's sign language gestures and real-time electrocardiogram (ECG) signals in parallel using an image acquisition device (monocular camera) and a physiological monitoring device (single-lead ECG sensor). To achieve low-cost and accurate synchronization, this invention employs an alignment method based on the master control software timestamp. Specifically, the system's current time is acquired in real-time within the data acquisition thread, and each frame of image data and each ECG data packet is labeled with a uniform millisecond-level timestamp. Heterogeneous data is then downsampled and time-aligned based on this timestamp, thereby eliminating the impact of hardware clock drift at the software level.

[0056] 2. The intrinsic driving mechanism of physiological responses on limb expression was simulated at the data processing level.

[0057] This invention utilizes a pose estimation algorithm to extract skeleton sequences containing human spatial topological information from videos and employs one-dimensional convolution to process electrocardiogram waveform sequences. The core step of this invention lies in performing a feature fusion process based on cross-attention: the system maps extracted physiological and emotional features to a query matrix, and visual skeleton action features to a key and value matrix. By calculating the correlation weight between the two, the visual features are dynamically weighted using arousal features from physiological signals. This process simulates the intrinsic driving mechanism of physiological responses on limb expression at the data processing level, suppressing ineffective background motion interference. Finally, the system inputs the fused feature vector into a classifier and outputs the corresponding sign language emotion category control signal or label.

[0058] 3. Significantly improved the accuracy of similar emotion recognition.

[0059] The present invention includes a synchronous preprocessing module for performing the aforementioned software labeling and alignment operations to construct multimodal data pairs; a feature extraction network module comprising parallel spatiotemporal graph convolutional network units for encoding the spatial dynamics of the skeleton and the temporal morphology of the electrocardiogram; and an emotion interaction module for driving the display terminal to respond to the recognized emotional intent based on the final classification result, or playing corresponding emotional speech through a speech synthesis unit, thereby assisting hearing-impaired individuals in conducting barrier-free communication containing emotional nuances. This invention introduces electrocardiogram (ECG) signals reflecting the state of the autonomic nervous system and designs a cross-attention fusion module, using physiological features as query vectors to dynamically weight visual features. This effectively solves the problem of confusion when distinguishing between high-arousal emotions such as "anger" and "happiness" using a single visual modality. Even with similar limb movement amplitudes, the system can make correct judgments by capturing intrinsic physiological differences such as heart rate variability, thereby significantly improving the accuracy and reliability of emotion recognition.

[0060] 4. Reduced system hardware costs and deployment complexity.

[0061] This invention eliminates the need for expensive external hardware synchronization boards, innovatively employing a software marking and alignment algorithm based on the main control system clock. While ensuring millisecond-level synchronization accuracy, the system requires only a standard USB camera and a portable ECG module to operate, eliminating the need for complex wiring and additional synchronization hardware. This significantly reduces the overall system construction cost and simplifies the installation and maintenance process, facilitating large-scale deployment.

[0062] 5. It enhances the ability to generalize to different individuals.

[0063] This invention constructs a visual and physiological dual-stream complementary network. Since physiological responses (such as heart rate characteristics during stress) share certain biological commonalities across different populations, this network can compensate for visual feature biases caused by individual sign language habits and non-standard movements. This means that the system can maintain stable recognition performance even when facing unseen users, overcoming the overfitting problem of traditional models.

[0064] 6. It achieves privacy protection and edge computing deployment.

[0065] This invention employs skeleton key point extraction technology to replace full image processing and uses only a single-lead ECG instead of facial expression recognition. It does not collect or store high-resolution facial images of users, maximizing the protection of portrait privacy for people with disabilities and ensuring high security. The data dimensionality is significantly reduced (from tens of millions of pixels to coordinate point matrix level), greatly reducing computational power consumption. This allows the high-precision algorithm to run smoothly on laptops or embedded edge devices, achieving low-latency real-time interaction.

[0066] 7. This invention uses wearable devices and cameras to capture the movements and physiological states of sign language users in real time, and translates simple sign language movements into semantic outputs containing emotional colors such as "joy, anger, sorrow, and happiness". It can be applied to scenarios such as sign language emotion recognition and barrier-free communication for hearing-impaired people, smart wearable devices and multimodal emotion computing interfaces, and can be widely used in smart disability assistance terminals, barrier-free education systems and human-computer emotion interaction interfaces. Attached Figure Description

[0067] Figure 1 This is a schematic diagram of the module composition of the sign language emotion recognition system based on ECG and skeleton multimodal fusion provided by the present invention;

[0068] Figure 2 This is a flowchart illustrating the implementation of the sign language emotion recognition system based on ECG and skeleton multimodal fusion provided by the present invention.

[0069] Figure 3This is a schematic diagram of the cross-attention mechanism. Detailed Implementation

[0070] To make the technical problems solved by this invention, the technical solutions adopted, and the technical effects achieved clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings, not all of them.

[0071] like Figure 1-2 As shown in the figure, an embodiment of the present invention provides a sign language emotion recognition system based on ECG and skeleton multimodal fusion, including: a data acquisition device 100, a main control computing device 200 and an interactive output device 300.

[0072] The acquisition device 100 includes a visual acquisition device 101 and a physiological acquisition device 102; the visual acquisition device 101 acquires video frame image data, and the physiological acquisition device 102 simultaneously acquires electrocardiogram data.

[0073] The visual acquisition device 101 employs a monocular high-definition camera (such as a USB camera or an industrial camera) to capture real-time video streams of the subject's sign language movements. The visual acquisition device 101 is wired to the main control computing device 200 via a data transmission interface (such as a USB 3.0 interface).

[0074] The physiological data acquisition device 102 employs a portable single-lead electrocardiogram sensor (e.g., a wearable device integrating a BMD101 chip) to acquire real-time electrocardiogram (ECG) data of the subject. The physiological data acquisition device 102 communicates with the main control computing device 200 via a wireless communication module (e.g., Bluetooth / Wi-Fi) or a serial interface.

[0075] The main control computing device 200 is communicatively connected to the acquisition device 100 and serves as the core execution entity of the system, responsible for performing data synchronization, preprocessing, feature extraction, and emotion recognition tasks. The main control computing device 200 is a computer terminal (such as a high-performance workstation or embedded edge computing box) that includes memory, a processor (CPU / GPU), and a communication bus.

[0076] The main control computing device 200 includes a synchronous acquisition module 201, a preprocessing alignment module 202, a skeleton extraction module 203, a physiological coding module 204, a multimodal feature extraction unit, a cross-attention fusion module 207, and an emotion classification module 208.

[0077] The main control computing device 200 starts the data acquisition thread and initializes the system main clock.

[0078] The input end of the synchronous acquisition module 201 is connected to the visual acquisition device 101 and the physiological acquisition device 102, and the output end is connected to the preprocessing alignment module 202.

[0079] The synchronous acquisition module 201 starts receiving threads for video frame data and ECG data in parallel. It calls the system master clock to append a uniform millisecond-level system timestamp to each received video frame data and ECG data.

[0080] Specifically, the synchronous acquisition module 201 acquires video frame image data transmitted by the visual acquisition device 101 and electrocardiogram data transmitted by the physiological acquisition device 102 in parallel; and upon receiving each frame of image data... And every ECG data At that moment, the system master clock is invoked, and the current system time in milliseconds is immediately read to obtain the image data. Additional image timestamp ECG data Additional ECG data timestamps Based on image timestamps The continuous video frame image data is divided into segments with a duration of [duration value]. Second-by-second video frame image data sample segments, based on ECG data timestamps Continuous ECG data is divided into segments with a duration of [duration value missing]. A sample fragment of ECG data in seconds. For example, 4 seconds.

[0081] The input end of the preprocessing alignment module 202 is connected to the synchronous acquisition module 201; the output end is connected to the skeleton extraction module 203 and the physiological coding module 204.

[0082] The preprocessing alignment module 202 performs time-series alignment between image timestamps and ECG data timestamps: based on the ECG data timestamps, it downsamples or interpolates the high-sampling-rate ECG data to keep the ECG data consistent with the video frame rate, generating time-aligned multimodal data pairs; and performs filtering processing on the ECG data (such as removing baseline drift).

[0083] Specifically, the preprocessing alignment module 202 extracts the coordinates of key points of the upper body skeleton from each video frame image data using a human pose estimation algorithm; and based on the ECG data timestamps... Interpolate or downsample the ECG data to make the ECG data timestamp With video frame data timestamps Strict alignment is achieved to generate time-aligned multimodal data pairs, forming a one-to-one sequence relationship.

[0084] The human pose estimation algorithm uses the MediaPipe framework. The key points include hand joints, elbows, shoulders, and facial contour points.

[0085] Based on ECG data timestamps Before interpolating or downsampling ECG data, the ECG data can also be filtered and denoised, for example, to remove baseline drift.

[0086] The input end of the skeleton extraction module 203 is connected to the preprocessing alignment module 202, and the output end is connected to the visual feature extraction module 205.

[0087] The skeleton extraction module 203 integrates a pose estimation algorithm (MediaPipe) to extract the coordinates of keypoints of the upper body skeleton from the aligned video frame data and construct a spatiotemporal skeleton atlas.

[0088] The skeleton spacetime graphs in the skeleton spacetime graph set are undirected graphs. Form. V represents the set of nodes in an undirected graph, which contains all the skeletal keypoints (such as left and right hand, arm nodes) in a frame. The set of edges consists of two parts: spatial edges, which connect adjacent joints in the human body's physiological structure (such as the wrist connecting to the elbow); and temporal edges, which connect the positions of the same joint between consecutive frames.

[0089] The physiological coding module 204 is used to map and format the aligned electrocardiogram data in the feature space; it processes the one-dimensional electrocardiogram data as a time-series vector to construct a physiological sequence map containing the time-series changes in the subject's physiological arousal, thereby providing structured data input for subsequent multimodal feature fusion and emotional arousal feature extraction.

[0090] The multimodal feature extraction unit includes the ES-GCN model. The multimodal feature extraction unit inputs the processed skeleton spatiotemporal atlas and physiological sequence atlas into the ES-GCN (Emotion-Aware Sign Language Recognition via Cross-Modal Fusion) model to extract skeleton action feature vectors and physiological feature vectors.

[0091] The ES-GCN model includes a parallel visual feature extraction module 205 and a physiological feature extraction module 206.

[0092] The visual feature extraction module 205 inputs the skeleton spatiotemporal graph into the ST-GCN module; the ST-GCN module encodes the skeleton spatiotemporal graph into a visual feature vector containing spatiotemporal dynamic information, and then uses the visual feature vector to extract the skeleton action feature vector through spatiotemporal graph convolution operation. The ST-GCN (Spatial Temporal Graph Convolutional Networks) module is a deep learning network for processing skeleton sequence data, capable of extracting spatial and temporal features simultaneously.

[0093] Spatiotemporal graph convolution formula:

[0094] The input of the ST-GCN module Output The calculation formula is:

[0095]

[0096] in: This indicates the output of the ST-GCN module; This represents the skeleton feature tensor (a known quantity) input to the ST-GCN module. This represents the weight matrix (an unknown quantity, obtained through training) that the network needs to learn. This represents the adjacency matrix (a known quantity, defined according to human anatomy) that describes the connection relationships of the skeleton. This represents the degree matrix (a known quantity) used for normalization. This indicates different spatial division strategies (such as centripetal and centrifugal). The purpose of this step is to capture the spatial posture changes and temporal trajectory of sign language movements.

[0097] The physiological feature extraction module 206 inputs the aligned physiological sequence map into a one-dimensional convolutional network, extracts the emotional arousal features hidden in the electrocardiogram data waveform through convolution operation, and encodes the emotional arousal features into a physiological feature vector (Physiological Embedding) containing emotional arousal information.

[0098] The emotional arousal features include the R-wave interval variation rate, instantaneous heart rate, standard deviation of RR interval (SDNN), and root mean square of the difference between adjacent RR intervals (RMSSD). These emotional arousal features reflect the subject's internal emotional fluctuations.

[0099] The cross-attention fusion module 207 is connected to the visual feature extraction unit 205 and the physiological feature extraction unit 206, respectively. The cross-attention fusion module 207 executes a cross-modal fusion algorithm. The cross-attention fusion module 207 maps physiological feature vectors to query matrices; maps skeletal action feature vectors to key and value matrices; calculates attention weight matrices; weights the skeletal action feature vectors using physiological feature vectors; and outputs the fused emotion feature vector.

[0100] The cross-attention fusion module 207 utilizes a cross-attention mechanism to fuse the skeleton action feature vector from the visual feature extraction module 205 and the physiological feature vector from the physiological feature extraction module 206 to obtain a fused emotion feature vector. The cross-attention fusion module 207 performs cross-modal cross-attention fusion to solve the challenge of recognizing visually similar actions but different emotions.

[0101] like Figure 3 As shown, the cross-attention mechanism includes the following three stages:

[0102] Phase 1: Constructing the feature mapping and query matrix between physiological feature vectors and skeletal motion feature vectors.

[0103] The cross-attention fusion module 207 receives physiological feature vectors from the physiological feature extraction module 206 and skeletal action feature vectors from the visual feature extraction module 205, respectively.

[0104] The physiological feature vectors are mapped to a query matrix (Q) through built-in linear projection layers, while the skeletal motion feature vectors are mapped to a key matrix (K) and a value matrix (V) respectively.

[0105] The query matrix (Query, The key matrix (Key, ...) contains feature vectors representing the baseline of intrinsic emotion, output by the physiological feature extraction module 206. The value matrix (Value, ...) contains feature vectors representing external spatiotemporal action patterns output by the visual feature extraction module 205. It includes the skeleton action feature vector representing the high-dimensional semantic content of sign language, output by the visual feature extraction module 205.

[0106] The second stage involves calculating the attention weights of the physiological feature vector and the skeletal motion feature vector, and then performing weighted fusion.

[0107] The attention weights and fusion formulas for calculating the correlation between physiological feature vectors and skeletal motion feature vectors are as follows:

[0108]

[0109] Where: Q represents the feature matrix output by the physiological feature extraction module 206, representing the baseline of the inner emotion; K represents the feature matrix output by the visual feature extraction module 205, representing the external spatiotemporal action pattern; and V represents the feature matrix output by the visual feature extraction module 205, representing the high-dimensional semantic content of sign language. This represents the scaling factor, which is usually the square root of the feature dimension (a known constant). This represents the normalized exponential function.

[0110] The attention weights and fusion formula calculate the correlation between physiological signals and each visual action moment. If the electrocardiogram characteristics (such as a sudden increase in heart rate) are significant at a certain moment, the corresponding visual action weight will be amplified. This allows the model to "focus" on those micro-movements that truly express emotions, while ignoring irrelevant background movements.

[0111] The third stage: Output the classification results of the cross-attention mechanism.

[0112] The input of the sentiment classification module 208 is connected to the cross-attention fusion module 207, and the output is connected to the interactive output device 300. The sentiment classification module 208 includes a global average pooling layer, a fully connected layer (FC), and a Softmax classifier, which maps the fused sentiment feature vector to specific sentiment category probabilities and outputs the label with the highest probability.

[0113] The emotion classification module 208 linearly maps the emotion feature vector fused by the cross-attention fusion module 207 through a global average pooling layer to a fully connected layer (FC layer); it calculates the probability value of each emotion category using the Softmax function; and outputs the category with the highest probability as the final recognition result, which is then displayed or announced through the interactive output device 300. The emotion categories include anger, happiness, sadness, and calmness.

[0114] The interactive output device 300 includes a display screen and / or a voice synthesis speaker.

[0115] The interactive output device 300 is connected to the output port of the main control computing device 200 and is used to provide feedback to the user on the finally identified emotion category, such as displaying the text "angry" or "happy".

[0116] To further verify the effectiveness of the present invention, it was compared with existing mainstream time-series processing models (including LSTM, Bi-LSTM, Transformer) and single-modal benchmark models under the same experimental environment.

[0117] 4.1 Comparison of experimental results

[0118] The experiment statistically analyzed the performance of each model on four metrics: accuracy, precision, recall, and F1 score. Key data comparisons (means) are shown in the table below:

[0119]

[0120] Analysis of the above experimental data:

[0121] 1. Superiority of Multimodal Fusion: The ES-GCN model (bimodal) proposed in this invention achieves a recognition accuracy of up to 86.67%, significantly outperforming the single-modal Hand-ST-GCN (85.25%) and ECG-ST-GCN (84.63%). This demonstrates that introducing and effectively fusing ECG signals can indeed compensate for the deficiencies of visual modality and improve overall recognition performance.

[0122] 2. The superiority of the cross-attention mechanism: Under the same "ECG+Hand" bimodal input conditions, the ES-GCN architecture adopted in this invention improves accuracy by approximately 6.38% (86.67% vs 80.29%) compared to the traditional Transformer fusion architecture, and by approximately 10.15% (86.67% vs 76.52%) compared to the Bi-LSTM fusion architecture. This directly proves that the cross-attention mechanism designed in this invention, which "uses physiological features to query visual features," is more effective in capturing emotional features than simple feature concatenation or temporal attention.

[0123] 3. High-precision emotion discrimination capability: The accuracy of this invention reaches 89.69%, the highest among all comparison groups. In practical applications, the false alarm rate is extremely low, and it can accurately identify the correct emotion category.

[0124] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications to the technical solutions described in the foregoing embodiments, or equivalent substitutions for some or all of the technical features, do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A sign language emotion recognition system based on electrocardiogram and skeleton multimodal fusion, characterized in that, include: Acquisition device (100), main control computing device (200) and interactive output device (300); The acquisition device (100) includes a visual acquisition device (101) and a physiological acquisition device (102); the visual acquisition device (101) acquires video frame image data, and the physiological acquisition device (102) simultaneously acquires electrocardiogram data; The main control computing device (200) is communicatively connected to the acquisition device (100); the main control computing device (200) includes a synchronous acquisition module (201), a preprocessing alignment module (202), a skeleton extraction module (203), a physiological coding module (204), a multimodal feature extraction unit, a cross-attention fusion module (207), and an emotion classification module (208). The synchronous acquisition module (201) acquires video frame image data transmitted by the visual acquisition device (101) and electrocardiogram data transmitted by the physiological acquisition device (102) in parallel; at the moment of receiving each video frame image data and each electrocardiogram data, it adds an image timestamp to the image data and adds an electrocardiogram data timestamp to the electrocardiogram data. The preprocessing alignment module (202) performs time-series alignment of image timestamps and ECG data timestamps; The skeleton extraction module (203) integrates a pose estimation algorithm to extract the coordinates of key points of the human upper body skeleton from the aligned video frame data and construct a spatiotemporal skeleton atlas. The physiological coding module (204) is used to construct a physiological sequence atlas containing temporal changes in the subject's physiological arousal level from the aligned electrocardiogram data. A multimodal feature extraction unit, including an ES-GCN model, is used to input the processed skeleton spatiotemporal atlas and physiological sequence atlas into the ES-GCN model to extract skeleton action feature vectors and physiological feature vectors. The ES-GCN model includes a parallel visual feature extraction module (205) and a physiological feature extraction module (206). The visual feature extraction module (205) inputs the skeleton spatiotemporal map into the ST-GCN module; the ST-GCN module encodes the skeleton spatiotemporal map into a visual feature vector containing spatiotemporal dynamic information, and then uses the visual feature vector to extract the skeleton action feature vector through spatiotemporal map convolution operation; The physiological feature extraction module (206) inputs the aligned physiological sequence diagram into a one-dimensional convolutional network, extracts the emotional arousal features hidden in the electrocardiogram data waveform through convolution operation, and encodes the emotional arousal features into a physiological feature vector containing emotional arousal information. The cross-attention fusion module (207) uses the cross-attention mechanism to fuse the skeleton action feature vector of the visual feature extraction module (205) and the physiological feature vector of the physiological feature extraction module (206) to obtain the fused emotion feature vector; The interactive output device (300) is used to provide feedback to the user on the finally identified emotion category.

2. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 1, characterized in that, The visual acquisition device (101) uses a monocular high-definition camera; the physiological acquisition device (102) uses a portable single-lead electrocardiogram sensor.

3. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 1, characterized in that, The synchronous acquisition module (201), upon receiving each video frame image data and each electrocardiogram (ECG) data, adds an image timestamp to the image data and an ECG data timestamp to the ECG data, including: Upon receiving each frame of image data and each ECG data, the system master clock is invoked to immediately read the current system time in milliseconds, and an image timestamp is added to the image data and an ECG data timestamp is added to the ECG data. Based on the image timestamp, the continuous video frame image data is divided into video frame image data sample segments with a duration of L seconds, and based on the ECG data timestamp, the continuous ECG data is divided into ECG data sample segments with a duration of L seconds.

4. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 3, characterized in that, The preprocessing alignment module (202) performs time-series alignment of image timestamps and ECG data timestamps, including: The preprocessing alignment module (202) extracts the coordinates of key points of the upper body skeleton from each video frame image data using a human pose estimation algorithm; and interpolates or downsamples the ECG data according to the ECG data timestamp to ensure that the ECG data timestamp is strictly aligned with the video frame data timestamp, generating time-aligned multimodal data pairs.

5. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 1, characterized in that, In the skeleton extraction module (203), the skeleton spatiotemporal graph is an undirected graph. form; V represents the set of nodes in an undirected graph, which contains all the skeleton keypoints in a frame; Let represent a set of edges, which includes spatial edges and temporal edges; The spatial edge is a joint that connects adjacent parts of the human physiological structure; The time edge is the position of the same joint connecting consecutive frames.

6. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 1, characterized in that, The spatiotemporal graph convolution formula of the visual feature extraction module (205) is as follows: The input of the ST-GCN module Output The calculation formula is: ; in: This indicates the output of the ST-GCN module; This represents the skeleton feature tensor input to the ST-GCN module; This represents the weight matrix that the network needs to learn. Represents the adjacency matrix that describes the connection relationships of the skeleton; This represents the degree matrix used for normalization; This indicates different spatial partitioning strategies.

7. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 1, characterized in that, The emotional arousal characteristics include the R-wave interval change rate, instantaneous heart rate, standard deviation of RR interval, and root mean square of the difference between adjacent RR intervals.

8. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 1, characterized in that, The cross-attention mechanism of the cross-attention fusion module (207) includes the following three stages: Phase 1: Constructing the feature mapping and query matrix between physiological feature vectors and skeletal movement feature vectors; The cross-attention fusion module (207) receives physiological feature vectors from the physiological feature extraction module (206) and skeleton action feature vectors from the visual feature extraction module (205), respectively. The physiological feature vector is mapped to a query matrix through a built-in linear projection layer, and the skeletal motion feature vector is mapped to a key matrix and a value matrix, respectively. The query matrix contains feature vectors representing the baseline of inner emotions output by the physiological feature extraction module (206); the key matrix contains feature vectors representing the external spatiotemporal action patterns output by the visual feature extraction module (205); and the value matrix contains skeleton action feature vectors representing the high-dimensional semantic content of sign language output by the visual feature extraction module (205). The second stage involves calculating the attention weights of the physiological feature vector and the skeletal motion feature vector, and then performing weighted fusion. The attention weights and fusion formulas for calculating the correlation between physiological feature vectors and skeletal motion feature vectors are as follows: ; Where: Q represents the feature matrix representing the baseline of inner emotion output by the physiological feature extraction module (206); K represents the feature matrix representing the external spatiotemporal action pattern output by the visual feature extraction module (205); V represents the feature matrix representing the high-dimensional semantic content of sign language output by the visual feature extraction module (205); Indicates the scaling factor; This represents the normalized exponential function; The third stage: Output the classification results of the cross-attention mechanism.

9. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 8, characterized in that, The sentiment classification module (208) includes a global average pooling layer, a fully connected layer, and a Softmax classifier; The emotion classification module (208) passes the emotion feature vector fused by the cross-attention fusion module (207) through a global average pooling layer and inputs it into a fully connected layer for linear mapping; it uses the Softmax function to calculate the probability value of each emotion category; and outputs the category with the highest probability as the final recognition result; the emotion categories include anger, happiness, sadness, and calmness.

10. The sign language emotion recognition system based on ECG and skeleton multimodal fusion according to claim 1, characterized in that, The interactive output device (300) includes a display screen and / or a speech synthesis speaker.