An audio-visual bimodal emotion recognition system and method for assisting robot decision making
By using audiovisual feature encoding and dynamic fusion modules to isolate the speaker's identity, the robustness of the model is enhanced. This solves the problems of poor generalization ability and insufficient robustness of audiovisual emotion recognition models in cross-domain and cross-person recognition, thereby improving the rationality of robot decision-making and the interactive experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANKAI UNIV
- Filing Date
- 2026-02-03
- Publication Date
- 2026-06-26
Smart Images

Figure CN122290643A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and pattern recognition technology, and in particular to an audiovisual bimodal emotion recognition system and method for assisting robot decision-making. Background Technology
[0002] Emotion recognition has significant application value in fields such as human-computer interaction, robot-assisted decision-making, and mental health monitoring. Existing emotion recognition methods are mainly based on single-modal or multi-modal fusion of vision, speech, or text, with audiovisual bimodal methods becoming a research hotspot due to their complementary information. Traditional methods typically rely on large amounts of labeled data, performing well on specific datasets, but exhibiting poor generalization ability across domains, individuals, and real-world scenarios. In existing technologies, deep learning-based emotion recognition methods are prone to overfitting to speaker identity features, leading to performance degradation on unseen individuals. Furthermore, speech information in real-world scenarios is often missing or subject to noise interference, and existing multi-modal fusion mechanisms lack robustness to missing modalities.
[0003] The defects and shortcomings of the existing technology are as follows: 1. The model is susceptible to interference from speaker identity features and has poor cross-individual generalization ability. Existing deep learning-based audiovisual emotion recognition models tend to learn and rely on static identity features (such as facial structure and timbre) related to specific speakers during training, rather than more universal dynamic emotion expression patterns. This leads to a significant drop in recognition performance when facing new users outside the training set, severely limiting the usability of assistive robots in diverse real-world service scenarios.
[0004] 2. Insufficient robustness to modal loss or impairment. In real-world human-computer interaction, audio information is often partially or completely lost due to environmental noise, user silence, or device limitations. Most existing multimodal fusion methods assume that bimodal information is always complete and lack effective mechanisms to handle situations where single-modal (especially speech) information is incomplete or has an extremely low signal-to-noise ratio, leading to system performance crashes when critical modalities fail.
[0005] 3. Multimodal information fusion strategies are static and coarse, failing to fully explore the complementary relationships between modalities. Mainstream methods often employ simple early feature concatenation or late decision score averaging, which are static fusion approaches. These strategies cannot dynamically evaluate and weigh the contributions of different modalities based on the specific content and quality of the input signal, nor can they achieve deep cross-modal interaction and semantic alignment at the feature level, thus limiting the efficiency of information fusion and the final discrimination accuracy.
[0006] 4. The recognition model is disconnected from decision-making applications, failing to form a task-oriented optimization loop. Existing research often optimizes emotion recognition as an independent perception module, with its training objectives (such as classification accuracy) not directly related to the optimal performance of downstream robot decision-making tasks (such as user satisfaction and interaction fluency). This "perception-decision" separation design means that the feature representation of the recognition model may not be the most favorable representation for triggering optimal robot behavior, making it difficult to achieve system-level optimization of the interactive experience. Summary of the Invention
[0007] The purpose of this invention is to overcome the shortcomings of the prior art and propose an audiovisual bimodal emotion recognition system and method to assist robot decision-making, so as to solve the problems of weak generalization ability, poor robustness, static fusion strategy and perception-decision disconnect in the prior art.
[0008] The technical problem solved by this invention is achieved through the following technical solution: An audiovisual bimodal emotion recognition system for assisting robot decision-making includes an audiovisual feature encoder, a domain adversarial de-identification module, a masked modality contrastive learning module, a video-guided multimodal attention-gated fusion module, an emotion classifier, and a robot decision-making strategy module. These modules are sequentially connected. The audiovisual feature encoder acquires and preprocesses synchronized audiovisual data, extracting and aligning visual and audio features. The domain adversarial de-identification module learns cross-domain invariant emotion features. The masked modality contrastive learning module enhances the model's robustness to modality loss. The video-guided multimodal attention-gated fusion module enables adaptive multimodal feature fusion guided by video. The emotion classifier outputs an emotion category prediction. The robot decision-making strategy module generates decision actions based on the emotion recognition results.
[0009] A recognition method for an audiovisual bimodal emotion recognition system that assists robot decision-making includes the following steps: Step 1: The audiovisual feature encoder acquires synchronized visual and audio data and performs preprocessing; extracts the preprocessed visual and audio features respectively, and projects them onto a common feature space for alignment to obtain aligned visual and audio features. Step 2: Input the aligned visual features and aligned audio features into the domain adversarial de-identification module. By jointly optimizing the emotion classification task and the adversarial domain discrimination task, cross-domain invariant emotion features stripped of speaker identity information are learned. Step 3: During the training phase, the mask modality contrastive learning module aligns the audio features with random masks and uses the contrastive learning loss function to constrain the relationship between the audio features before and after masking and the corresponding visual features in the representation space, so as to enhance the robustness of the model to missing audio information. Step 4: Input the features processed in Steps 3 and 4 into the video-guided multimodal attention gating fusion module. Intermodal information interaction is performed through the cross-attention mechanism, and visual and audio features are dynamically fused using adaptively generated gating vectors to obtain the final emotion fusion features. Step 5: Input the emotion fusion features into the emotion classifier to obtain the emotion category prediction results; Step 6: Input the emotion category prediction results into the robot decision-making strategy module to generate robot interaction actions.
[0010] Furthermore, step 1 includes the following steps: Step 1.1: The audiovisual feature encoder acquires synchronized visual and audio data, and segments the video data into a fixed-length sequence of image frames. ,in This indicates the number of visual frames sampled, each It is a dimension of A tensor representing the visual content of a time slice; the original audio waveform synchronized with the video. It is processed into a fixed length, where Determined by the audio sampling rate and duration; Step 1.2: The audiovisual feature encoder uses a pre-trained Vision Transformer network to extract visual features from the visual data; Step 1.3: The audio-visual feature encoder uses a pre-trained Audio Spectrogram Transformer to extract audio features from the audio data.
[0011] Furthermore, step 2 includes the following steps: Step 2.1: The domain adversarial de-identification module uses a lightweight multilayer perceptron. It receives the spliced audiovisual features and extracts the preliminarily fused emotion-related features: in, This indicates concatenation along the feature dimension. , , For preprocessed visual modal features, For preprocessed audio modal features, Multilayer perceptron The bias vector, The time step of the feature sequence. For the fused emotional feature dimensions, for A dimensional real vector space; Step 2.2: The domain adversarial de-identification module uses an identity classifier. From features Predicting speaker identity tags : in Represents cross-entropy loss, Loss due to identity classification; Step 2.3: The domain adversarial de-identification module uses the domain discriminator. It distinguishes whether features come from the training set or from simulated unknown users or environments, and uses a gradient inversion layer. Achieving confrontation: final, The optimization objective of the (multilayer perceptron) is to minimize the emotion classification loss while maximizing the domain discrimination loss, thereby stripping away domain-specific information and outputting de-identified emotion features. .
[0012] Furthermore, step 3 includes the following steps: Step 3.1: The masked modality contrast learning module enhances the model's robustness when speech information is missing: randomly pair speech feature sequences... Masking certain time segments yields the masked features. ,Will Treat them as positive sample pairs, Consider them as difficult negative sample pairs; Step 3.2: The mask modality contrastive learning module introduces contrastive learning at the feature level: using a projection head. (like The features are mapped to the contrastive learning space for a single sample within a batch. : , , where superscript The features are primarily derived from visual or auditory information. The loss function is the normalized temperature-scale cross-entropy loss: , , in, It's a temperature over-parameter. It refers to the batch size.
[0013] Moreover, the specific implementation method of step 4 is as follows: using visual features as query vectors and audio features as key vectors and value vectors, calculate visually guided audio attention to obtain context-modulated audio features; generate gating vectors based on the modulated visual features and audio features; and perform weighted summation of the modulated visual features, audio features, and their interaction features according to the gating vectors to obtain the emotion fusion features.
[0014] Furthermore, the specific implementation method of step 5 is as follows: the emotion classifier will fuse features. Input Emotion Classifier Predicting the probability distribution of emotion categories The loss is .
[0015] Furthermore, the specific implementation method of step 6 is as follows: the robot decision-making strategy module processes the emotion recognition results. and original features As input, output a decision action. Decision-making strategies Based on predefined rules.
[0016] Furthermore, in step 3 of the training phase, a reward signal based on interaction effects is introduced and jointly optimized with the emotion recognition loss. The total loss function is: in, It is a weighted hyperparameter that balances the various losses.
[0017] The advantages and positive effects of this invention are: 1. This invention introduces a domain adversarial training mechanism based on gradient inversion layers (GRL) to actively and explicitly remove domain-specific features related to speaker identity and the acquisition environment. This forces the model to learn more universal and cross-domain invariant core emotion expression patterns, thereby fundamentally achieving robust generalization across identities and scenarios. This directly overcomes the problem of existing models experiencing a sharp performance drop in new users and environments due to overfitting to individual identity features. It ensures the consistency and reliability of the auxiliary robot's emotion perception when facing diverse users and complex scenarios, laying a solid foundation for subsequent decision-making.
[0018] 2. The masked modal contrastive learning strategy proposed in this invention actively constructs challenging samples with missing speech features during the training phase. Through a carefully designed contrastive learning task, it forces the model to mine and strengthen the deep complementary correlations between audiovisual modalities, thereby significantly enhancing the model's inherent robustness to missing or damaged speech information. This enables the system to maintain stable and reliable emotion discrimination capabilities even in real-world complex scenarios (such as high noise or user silence), relying only on incomplete audiovisual input. It effectively reduces robot decision-making delays or misjudgments caused by missing information, ensuring the immediacy and accuracy of interactive responses.
[0019] 3. The video-guided attention-gated fusion module designed in this invention completely abandons static fusion strategies such as early splicing or late averaging. It innovatively achieves bidirectional guidance and deep fusion of contextual information between modalities through a cross-attention mechanism, and utilizes learnable adaptive gating to dynamically calibrate the contribution weights of each modality. This perceptually adaptive dynamic fusion mechanism greatly improves the utilization efficiency and discrimination accuracy of multimodal information. When applied to assistive robots, it can intelligently adjust the fusion strategy based on real-time signal quality and emotional content, thereby achieving more refined and reliable emotion perception and supporting more reasonable and context-adaptive robot decision-making.
[0020] 4. This invention innovatively integrates an emotion recognition model with a robot decision-making strategy module through task-driven end-to-end joint training, constructing a closed-loop optimization framework that combines perception and decision-making. This framework introduces reward signals based on interaction effects, directly guiding the learning process of emotion features with the goal of optimizing the final robot behavior utility, ensuring that the learned feature representations are most conducive to triggering reasonable and effective robot responses. This system-level optimization breaks through the limitations of module fragmentation in traditional recognition-decision pipelines, achieving a qualitative leap from "accurate perception" to "humanized interaction." For example, it can autonomously and smoothly initiate comfort strategies based on detected sadness, greatly enhancing the overall experience and application value of human-computer interaction. Attached Figure Description
[0021] Figure 1 This is a flowchart of the present invention; Figure 2 This is a schematic diagram illustrating the deployment of the model and its integration with the robot in this invention. Detailed Implementation
[0022] The present invention will be further described in detail below with reference to the accompanying drawings.
[0023] An audiovisual bimodal emotion recognition system for assisting robot decision-making includes an audiovisual feature encoder, a domain adversarial de-identification module, a masked modality contrastive learning module, a video-guided multimodal attention-gated fusion module, an emotion classifier, and a robot decision-making strategy module. These modules are sequentially connected. The audiovisual feature encoder acquires and preprocesses synchronized audiovisual data, extracting and aligning visual and audio features. The domain adversarial de-identification module learns cross-domain invariant emotion features. The masked modality contrastive learning module enhances the model's robustness to modality loss. The video-guided multimodal attention-gated fusion module enables adaptive multimodal feature fusion guided by video. The emotion classifier outputs an emotion category prediction. The robot decision-making strategy module generates decision actions based on the emotion recognition results.
[0024] A recognition method for an audiovisual bimodal emotion recognition system to assist robot decision-making, such as Figure 1 As shown, it includes the following steps: Step 1: The audiovisual feature encoder acquires synchronized visual and audio data and performs preprocessing; extracts the preprocessed visual and audio features respectively, and projects them into a common feature space for alignment to obtain aligned visual and audio features.
[0025] Step 1.1: The audiovisual feature encoder acquires synchronized visual and audio data, and segments the video data into a fixed-length sequence of image frames. ,in This indicates the number of visual frames sampled, each It is a dimension of A tensor representing the visual content of a time slice; the original audio waveform synchronized with the video. It is processed into a fixed length, where It is determined by the audio sampling rate and duration.
[0026] Step 1.2: The audiovisual feature encoder uses a pre-trained Vision Transformer network to extract visual features from the visual data.
[0027] A pre-trained Vision Transformer (ViT) network was used as the visual backbone encoder. Each frame Independently input ViT and extract its high-level semantic features. For a batch Frame image, model output is Its shape is ,in It is the feature dimension output by ViT.
[0028] To capture dynamic emotional changes in video sequences, the frame-level features (shape ) of each video segment are analyzed. The input is a multi-scale temporal feature aggregator. This aggregator uses multiple 1D convolutional layers with different kernel sizes (e.g., kernel sizes of 3, 5, and 7) to process the temporal dimension in parallel, then performs multi-scale feature fusion and adaptive pooling, finally outputting a time-independent aggregated visual feature vector that can represent the emotional context of the entire video. ,in .
[0029] Through a linear projection layer Aggregate visual features Mapping to a common dimension space aligned with other modal features yields... ,in The alignment dimension is set (256 in subsequent modules).
[0030] Step 1.3: The audio-visual feature encoder uses a pre-trained Audio Spectrogram Transformer to extract audio features from the audio data.
[0031] A pre-trained Audio Spectrogram Transformer (AST) is used as the audio backbone encoder. AST receives the raw audio waveform. (or a preprocessed Mel spectrogram), outputting high-level semantic feature vectors. ,in .
[0032] In parallel, a set of low-level acoustic statistical features are extracted from the original audio waveform to supplement the semantic information captured by the AST. These features include: MFCC (Mel-frequency cepstral coefficients): 20th-order MFCCs are calculated and their mean is taken to obtain a 20-dimensional feature. RMS (Root Mean Square Energy): The root mean square energy of the audio signal is calculated to obtain a 1-dimensional feature. ZCR (Zero Crossing Rate): The zero crossing rate of the audio signal is calculated to obtain a 1-dimensional feature.
[0033] After concatenating the above features, a 22-dimensional statistical feature vector is obtained. .
[0034] AST semantic features Statistical characteristics By concatenating along the feature dimension, we obtain Subsequently, through a linear projection layer Mapping it to a common dimensional space aligned with visual features yields... .
[0035] Step 2: Input the aligned visual features and aligned audio features into the domain adversarial de-identification module. By jointly optimizing the emotion classification task and the adversarial domain discrimination task, cross-domain invariant emotion features stripped of speaker identity information are learned.
[0036] The domain-adversarial de-identification module aims to learn emotion-discriminating features from raw audiovisual data through a domain-adversarial training mechanism, while suppressing speaker-identity-related features, ultimately obtaining a high-level emotion representation with cross-domain invariance. Specifically, this module constructs a multi-task learning framework that, by jointly optimizing the emotion classification task and the domain classification task, forces the feature encoder to learn feature representations that can effectively distinguish emotion categories but remain robust to changes in speaker identity and the acquisition domain.
[0037] Step 2.1: The domain adversarial de-identification module uses a lightweight multilayer perceptron. It receives the spliced audiovisual features and extracts the preliminarily fused emotion-related features: in, This indicates concatenation along the feature dimension. , , For preprocessed visual modal features, For preprocessed audio modal features, Multilayer perceptron The bias vector, The time step of the feature sequence. For the fused emotional feature dimensions, for A dimensional real vector space; Step 2.2: The domain adversarial de-identification module uses an identity classifier. From features Predicting speaker identity tags : in Represents cross-entropy loss, Loss due to identity classification.
[0038] Step 2.3: The domain adversarial de-identification module uses the domain discriminator. The present invention aims to distinguish whether the features originate from the "source domain" (training set) or the "target domain" (simulated unknown user / environment). The extracted features can "deceive" people. This makes it impossible for the domain to be distinguished.
[0039] This invention utilizes a gradient inversion layer. To achieve confrontation. In forward propagation, It is an identity transformation; in backpropagation, it will be passed to... About The gradient is multiplied by a negative coefficient. : final, The optimization objective of the (multilayer perceptron) is to minimize the emotion classification loss while maximizing the domain discrimination loss, thereby stripping away domain-specific information and outputting de-identified emotion features. .
[0040] Step 3: During the training phase, the mask modality contrastive learning module aligns the audio features and performs random masking. It then uses the contrastive learning loss function to constrain the relationship between the audio features before and after masking and the corresponding visual features in the representation space, thereby enhancing the model's robustness to missing audio information.
[0041] This step receives the original features from step 2 and enhances them to improve their robustness, thus providing a more robust feature input for the final fusion classification in step 4.
[0042] Step 3.1: The masked modality contrast learning module enhances the model's robustness when speech information is missing: randomly pair speech feature sequences... Masking certain time segments (replacing them with zero vectors or learnable ones) Vector), to obtain the masked features .Will Treat them as positive sample pairs, These are considered hard negative pairs (or negative pairs are constructed from other samples in the batch).
[0043] Step 3.2: The mask modality contrastive learning module introduces contrastive learning at the feature level: using a projection head. (like The features are mapped to the contrastive learning space for a single sample within a batch. : , , where superscript The features are primarily derived from visual or auditory information. The loss function is the normalized temperature-scale cross-entropy loss: , , in, It's a temperature over-parameter. It refers to the batch size.
[0044] Steps 3.1 and 3.2 constitute a complete closed loop of contrastive learning, with "policy construction" and "target computation" being interdependent. Step 3.1 constructs positive and negative sample pairs through masking to simulate speech loss scenarios; step 3.2 maps these samples to the contrast space and calculates the loss function, establishing the "learning standard" at the mathematical level and driving the optimization of model parameters.
[0045] Step 4: Input the features processed in Steps 3 and 4 into the video-guided multimodal attention gating fusion module. Intermodal information interaction is performed through the cross-attention mechanism, and visual and audio features are dynamically fused using adaptively generated gating vectors to obtain the final emotion fusion features.
[0046] Constructing cross-modal attention: visual-guided speech attention, where visual features are used as queries and audio is used as key / value to calculate the attention weights that should be assigned to visual features at each time step. , This generates speech features modulated by visual context. .
[0047] Constructing a gating fusion mechanism: Designing a gating vector Its value is arrive The ratio between visual and speech contributions determines the final fused features. ,in, yes Function. Final fused features: , in, This indicates element-wise multiplication. The last term... Capture intermodal interaction information. These are learnable weights. This design allows the model to dynamically select the dominant mode based on signal quality and content importance.
[0048] Step 5: Input the emotion fusion features into the emotion classifier to obtain the emotion category prediction results.
[0049] The emotion classifier will fuse features Input Emotion Classifier Predicting the probability distribution of emotion categories The loss is .
[0050] Step 6: Input the emotion category prediction results into the robot decision-making strategy module to generate robot interaction actions.
[0051] The robot's decision-making strategy module will use the emotion recognition results and original features As input, output a decision action. Decision-making strategies Based on predefined rules.
[0052] During the model training phase, a reward signal based on the interaction effect is introduced. (e.g., estimation of positive user feedback, decrease in stress physiological indicators), jointly optimized with emotion recognition loss, the total loss function is: in, These are the weight hyperparameters that balance the various losses. Through joint training, the emotion recognition model is explicitly optimized to produce feature representations that are most conducive to triggering correct decisions by the robot.
[0053] Based on the above-mentioned audiovisual bimodal emotion recognition system and method for assisting robot decision-making, data testing was conducted to verify the effectiveness of the present invention.
[0054] Step 1: Data Collection. The data should cover diverse emotion categories (e.g., happiness, sadness, anger, surprise, neutrality), speakers of different genders and ages, and individuals from different cultural / ethnic backgrounds (e.g., samples from East Asia, Western Europe, North America, Africa, etc.). It should also encompass various lighting and background environments to simulate real-world scenarios. When constructing the training set, ensure a relatively balanced number of samples from different cultural backgrounds to avoid the model being biased towards the emotional expression patterns of a particular culture.
[0055] Step 1.1, Video preprocessing: Perform face detection and alignment on the video sequence to ensure that the facial area is centered in the image.
[0056] Step 1.2, Timing Normalization: Downsample the video to a fixed frame rate (e.g., 25fps) and crop or interpolate it to a fixed duration (e.g., 3 seconds) to obtain the image frame sequence. ,in .
[0057] Step 1.3, Image Standardization: Standardize each frame of the image (scale to 224×224 pixels, normalize pixel values).
[0058] Step 1.4, Signal Cleaning: Perform preprocessing on the synchronized audio, such as noise reduction and silence removal.
[0059] Step 1.5, Resampling: Resample the audio to a fixed sampling rate (e.g., 16kHz) and divide it into segments of the same length as the video to obtain the audio waveform. .
[0060] Step 1.6: Generate a Mel spectrogram as input to the audio encoder.
[0061] Step 1.7: Process each preprocessed frame of image. Type ViT to retrieve The output of the label is used as a 1024-dimensional feature vector for that frame.
[0062] Step 1.8, the implementation of the multi-scale temporal aggregator consists of three parallel 1D convolutional layers (kernel sizes of 3, 5, and 7, with padding to preserve output length), for... The features are processed. The three outputs are averaged along the feature dimension, and then compressed along the time dimension by an adaptive average pooling layer to output the aggregated visual feature vector. .
[0063] Step 1.9, through the linear layer Will Projecting onto the alignment dimension yields... .
[0064] Step 1.10: Input the audio waveform or its Mel-spectrum into the AST, and take the output after global average pooling as the audio semantic feature. .
[0065] Step 1.11: For each audio waveform, use the librosa library to calculate the 20th-order MFCC mean (20-dimensional), RMS energy mean (1-dimensional), and zero-crossing rate mean (1-dimensional), and concatenate them to obtain a 22-dimensional statistical feature vector. .
[0066] Step 1.12, and By concatenating the features along the dimension, we obtain .
[0067] Step 1.13, through the linear layer Project it onto the alignment dimension to obtain .
[0068] Step 2: Train a shared emotion feature encoder that can strip away identity information.
[0069] Step 2.1: Build a shared encoder The input is the spliced audiovisual features. The output is a preliminary emotional feature. .
[0070] Step 2.2: Construct an identity classifier That is, linear classification layer , This represents the total number of speakers in the training set.
[0071] Step 2.3: Construct the domain discriminator , .
[0072] Step 2.4: Construct the Gradient Reversal Layer (GRL), which... The function returns the input itself. The function multiplies the input gradient by (For example Then return.
[0073] Step 2.5, Forward Propagation: Identity classification loss Depend on calculate.
[0074] Step 2.6, After passing through the GRL, the data is input into the domain discriminator to calculate the domain classification loss. .
[0075] Step 2.7, during backpropagation, GRL ensures Received from The inverse gradient is used to update the parameters in the direction of the "confusion" domain discriminator, while the identity classifier... Then a gradient is normally provided to preserve emotional information.
[0076] Step 2.8: The final output is de-identified emotional features. .
[0077] Step 3: Improve the model's robustness to missing speech through self-supervised contrastive learning.
[0078] Step 3.1: In each training batch, with probability... Randomly select audio features For a subset of samples, replace the features of all time steps with a learnable one. Vectors are used to obtain mask features. .
[0079] Step 3.2: Define the projection head for visual and audio features respectively. and Both are two-layer MLPs, which integrate 128 dimensions. Projected onto a 64-dimensional contrastive learning space.
[0080] Step 3.3: For samples within a batch Its visual projection With full audio projection Constructing positive sample pairs, and projecting them onto the masked audio. These also constitute positive sample pairs. The loss function adopts the InfoNCE form, with a temperature coefficient. The loss It participates in optimization together with the main loss.
[0081] Step 4: Implementation of the video-guided multimodal attention gating fusion module.
[0082] Step 4.1, Implementation of cross-modal attention, firstly... Remodeling into a sequence (e.g.) ),in, This refers to the sequence length. Visual-guided audio attention (with visual features as the query and audio features as the key and value) and audio-guided visual attention are calculated separately to obtain mutually reinforcing features. and .
[0083] Step 4.2, , and their element-wise products By concatenating the vectors, we obtain the fused basis vectors.
[0084] Step 4.3, Gating Vector It is generated by adding sigmoid activation to a linear layer to weight the contributions of visual and audio enhancement features.
[0085] Step 4.4, Final Fusion Features ,in It is a learnable linear transformation matrix.
[0086] Step 5: Complete emotion recognition and drive appropriate robot behavior.
[0087] Step 5.1, Emotion Classifier Implemented as a three-layer MLP, i.e. The input is The output is the probability of the emotion category. .
[0088] Step 5.2, Robot Decision Network A deep Q-network (DQN) or a simple policy network, with sentiment probabilities Given a state as input, output the probability distribution of various comforting, reminding, and entertaining actions.
[0089] Step 5.3: Define reward signals in simulated or real human-computer interaction. For example, if a user's mood shifts towards a positive direction shortly after an action is performed, a positive reward is given.
[0090] Step 5.4: Calculate the loss of the policy network. Total loss of emotion recognition Weighted summation is used for end-to-end optimization, enabling feature learning to directly serve optimization decisions.
[0091] Step 6: Model Deployment and Robot System Integration like Figure 2 As shown, the process for this step is as follows: Step 6.1, Model Lightweighting and Stabilization: Before deployment, the trained model (.pth file) is lightweighted, including model pruning, to reduce computational overhead. Then, the model is converted to an inference format suitable for edge devices (such as ONNX or TensorRT) and stabilized to generate the final deployment file.
[0092] Step 6.2: Construct a real-time emotion recognition service node: On the robot's onboard computer (such as an industrial PC equipped with ROS or a Jetson platform), create an independent emotion recognition service node. This node continuously subscribes to synchronous data streams from the robot's camera and microphone. Internally, the node implements an efficient inference pipeline: it performs face detection and alignment on the input real-time video stream, performs frame segmentation and feature extraction on the audio stream, then calls a fixed model for forward inference, and finally outputs the emotion category and its confidence level.
[0093] Step 6.3, Decision-Execution Closed-Loop Integration: The structured emotional information output by the above service nodes is published to the robot's central decision-making system. The decision-making system can generate corresponding robot behavior instructions (such as speech synthesis content, on-screen facial animations, or specific actions) based on a preset rule base (e.g., triggering a comforting dialogue if the emotion is "sadness" and the confidence level is >0.7) or a lightweight reinforcement learning policy network. These instructions are ultimately transformed into human-like interactive behaviors through the robot's underlying actuator nodes (such as motion control and voice broadcasting nodes), completing the closed loop from "perceiving emotions" to "responding".
[0094] Step 6.4, System Verification and Real-Time Assurance: Conduct system integration testing on a real robot platform. Key metrics include end-to-end latency (target <500 milliseconds) and robustness in noise-filled and lighting-varying environments. Real-time performance can be ensured by adjusting the inference frame rate, optimizing the data flow pipeline, and using hardware acceleration (such as GPU / TensorRT), enabling natural and timely interactive feedback from the robot.
[0095] It should be emphasized that the embodiments described in this invention are illustrative rather than limiting. Therefore, this invention includes, but is not limited to, the embodiments described in the specific implementation. Any other implementations derived by those skilled in the art based on the technical solutions of this invention are also within the scope of protection of this invention.
Claims
1. A dual-modal audiovisual emotion recognition system for assisting robot decision-making, characterized in that: It includes an audiovisual feature encoder, a domain adversarial de-identification module, a mask modality contrast learning module, a video-guided multimodal attention gating fusion module, an emotion classifier, and a robot decision strategy module. The audiovisual feature encoder, domain adversarial de-identification module, mask modality contrast learning module, video-guided multimodal attention gating fusion module, emotion classifier, and robot decision strategy module are connected in sequence. The audiovisual feature encoder is used to acquire and preprocess synchronized audiovisual data, and at the same time extract and align visual and audio features. The domain adversarial de-identification module is used to learn cross-domain invariant emotional features; The masked modality contrast learning module is used to enhance the model's robustness to modality loss; the video-guided multimodal attention gating fusion module is used to achieve adaptive multimodal feature fusion for video guidance; the emotion classifier is used to output emotion category predictions; and the robot decision-making strategy module is used to generate decision actions based on emotion recognition results.
2. A recognition method for an audiovisual bimodal emotion recognition system for assisting robot decision-making as described in claim 1, characterized in that: Includes the following steps: Step 1: The audiovisual feature encoder acquires synchronized visual and audio data and performs preprocessing. The preprocessed visual and audio features are extracted separately and then projected onto a common feature space for alignment to obtain aligned visual and audio features. Step 2: Input the aligned visual features and aligned audio features into the domain adversarial de-identification module. By jointly optimizing the emotion classification task and the adversarial domain discrimination task, cross-domain invariant emotion features stripped of speaker identity information are learned. Step 3: During the training phase, the mask modality contrastive learning module aligns the audio features with random masks and uses the contrastive learning loss function to constrain the relationship between the audio features before and after masking and the corresponding visual features in the representation space, so as to enhance the robustness of the model to missing audio information. Step 4: Input the features processed in Steps 3 and 4 into the video-guided multimodal attention gating fusion module. Intermodal information interaction is performed through the cross-attention mechanism, and visual and audio features are dynamically fused using adaptively generated gating vectors to obtain the final emotion fusion features. Step 5: Input the emotion fusion features into the emotion classifier to obtain the emotion category prediction results; Step 6: Input the emotion category prediction results into the robot decision-making strategy module to generate robot interaction actions.
3. The recognition method of the audiovisual bimodal emotion recognition system for assisting robot decision-making according to claim 2, characterized in that: Step 1 includes the following steps: Step 1.1: The audiovisual feature encoder acquires synchronized visual and audio data, and segments the video data into a fixed-length sequence of image frames. ,in This indicates the number of visual frames sampled, each It is a dimension of A tensor representing the visual content of a time slice; the original audio waveform synchronized with the video. It is processed into a fixed length, where Determined by the audio sampling rate and duration; Step 1.2: The audiovisual feature encoder uses a pre-trained Vision Transformer network to extract visual features from the visual data; Step 1.3: The audio-visual feature encoder uses a pre-trained Audio Spectrogram Transformer to extract audio features from the audio data.
4. The recognition method of the audiovisual bimodal emotion recognition system for assisting robot decision-making according to claim 2, characterized in that: Step 2 includes the following steps: Step 2.1: The domain adversarial de-identification module uses a lightweight multilayer perceptron. It receives the spliced audiovisual features and extracts the preliminarily fused emotion-related features: ; in, This indicates concatenation along the feature dimension. , , For preprocessed visual modal features, For preprocessed audio modal features, Multilayer perceptron The bias vector, The time step of the feature sequence. For the fused emotional feature dimensions, for A dimensional real vector space; Step 2.2: The domain adversarial de-identification module uses an identity classifier. From features Predicting speaker identity tags : ; in Represents cross-entropy loss, Loss due to identity classification; Step 2.3: The domain adversarial de-identification module uses the domain discriminator. It distinguishes whether features come from the training set or from simulated unknown users or environments, and uses a gradient inversion layer. Achieving confrontation: ; final, The optimization objective of a multilayer perceptron is to minimize the emotion classification loss while maximizing the domain discrimination loss, thereby stripping away domain-specific information and outputting de-identified emotion features. .
5. The recognition method of a bimodal audiovisual emotion recognition system for assisting robot decision-making according to claim 2, characterized in that: Step 3 includes the following steps: Step 3.1: The masked modality contrast learning module enhances the model's robustness when speech information is missing: randomly pair speech feature sequences... Masking certain time segments yields the masked features. ,Will Treat them as positive sample pairs, Consider them as difficult negative sample pairs; Step 3.2: The mask modality contrastive learning module introduces contrastive learning at the feature level: using a projection head. (like The features are mapped to the contrastive learning space for a single sample within a batch. : , , where superscript The features are primarily derived from visual or speech data, and the loss function is the normalized temperature-scale cross-entropy loss. , , in, It's a temperature over-parameter. It refers to the batch size.
6. The recognition method of the audiovisual bimodal emotion recognition system for assisting robot decision-making according to claim 2, characterized in that: The specific implementation method of step 4 is as follows: using visual features as query vectors and audio features as key vectors and value vectors, calculate visually guided audio attention to obtain context-modulated audio features. A gating vector is generated based on the modulated visual and audio features; the modulated visual features, audio features, and their interaction features are weighted and summed according to the gating vector to obtain the emotion fusion feature.
7. The recognition method of a bimodal audiovisual emotion recognition system for assisting robot decision-making according to claim 2, characterized in that: The specific implementation method of step 5 is as follows: the emotion classifier will fuse features. Input Emotion Classifier Predicting the probability distribution of emotion categories The loss is .
8. The recognition method of the audiovisual bimodal emotion recognition system for assisting robot decision-making according to claim 2, characterized in that: The specific implementation method of step 6 is as follows: the robot decision-making strategy module processes the emotion recognition results. and original features As input, output a decision action. Decision-making strategies Based on predefined rules.
9. The recognition method of a bimodal audiovisual emotion recognition system for assisting robot decision-making according to claim 2, characterized in that: In step 3, during the training phase, a reward signal based on interaction effects is introduced and jointly optimized with the emotion recognition loss. The total loss function is: ; ; in, It is a weighted hyperparameter that balances the various losses.