Multimodal co-empathy interaction method and system based on physiological signals and audiovisual fusion

By integrating visual, auditory, and physiological signals into a multimodal empathic interaction method, and utilizing deep learning and large language models for cross-modal analysis, the accuracy of user emotion recognition and the continuous tracking of feedback strategies in complex scenarios are solved, thereby improving the empathic interaction effect of embodied intelligence systems.

CN122241577APending Publication Date: 2026-06-19SHANGHAI UNIVERSITY OF ELECTRIC POWER +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI UNIVERSITY OF ELECTRIC POWER
Filing Date
2026-03-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately identify users' true emotions in complex scenarios. Single-modal information is easily misled by spoofing, feedback strategies are superficial, and there is a lack of multimodal perception and continuous tracking capabilities.

Method used

A multimodal empathic interaction method based on physiological signals and audiovisual fusion is adopted. By synchronously collecting video, audio and physiological signals, deep learning algorithms are used to extract features, and cross-modal consistency analysis and dynamic weight allocation are carried out in combination with a large language model to construct a closed-loop mechanism for emotional trends after interaction.

Benefits of technology

It has achieved the identification of users' emotional masquerading and the effective handling of semantic conflicts, generating emotional profiles that are closer to the user's true state, continuously tracking emotional changes and dynamically adjusting feedback strategies, thereby improving the empathy and interaction quality of the embodied intelligence system in real companionship scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241577A_ABST
    Figure CN122241577A_ABST
Patent Text Reader

Abstract

This invention provides a multimodal empathic interaction method and system based on the fusion of physiological signals and audiovisual information. By synchronously collecting multimodal signals and using deep learning algorithms for feature extraction, combined with a large language model for cross-modal consistency analysis and dynamic weight allocation, it can effectively identify emotional masquerading and semantic conflicts, generating an emotional profile that is closer to the user's true inner state. At the same time, by constructing a closed loop of emotional trends after interaction, it can continuously track and predict the evolution of the user's emotions in the short term, and dynamically adjust the empathic feedback strategy. This solves the problems of weak single-modal recognition ability, difficulty in continuously tracking emotional changes, and superficial feedback strategies in existing technologies, significantly improving the empathic ability and interaction quality of embodied intelligence systems in real companionship scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of embodied intelligence technology, and in particular to a multimodal empathic interaction method and system based on the fusion of physiological signals and audiovisual information. Background Technology

[0002] With the rapid development of artificial intelligence, sensor, and robotics technologies, the interaction methods of embodied intelligent systems are gradually evolving from the traditional "command-response" to empathic interaction based on "contextual understanding—emotional resonance—strategic feedback." Empathic interaction not only requires the system to recognize the user's surface-level emotional categories (such as happiness, anger, and sadness), but also to determine the intensity, trend, triggering cause of the emotion, and whether the user is engaging in emotional regulation or concealment. Based on this, it should generate reassuring, guiding, or assisting behaviors that conform to the context and relational boundaries. Currently, most emotion recognition and interaction solutions rely on single-modal signals for inference, such as recognizing facial expressions or body postures based on computer vision, or analyzing tone, speed, and energy changes based on speech features. While such solutions can achieve certain results in controlled environments, in complex scenarios such as real-world social interactions and family companionship, single-modal information often suffers from insufficient observability and high spoofability.

[0003] Specifically, taking the visual modality as an example, facial expressions are easily affected by factors such as lighting, occlusion, angle, makeup, masks, or glasses. Simultaneously, users often exhibit emotional regulation behaviors such as "forced smiles," "emotional suppression," and "frozen expressions" under social etiquette, professional norms, or self-protection, causing a deviation between outward expressions and inner experiences. In such cases, the system may misinterpret a "polite smile" as pleasure and a "calm face" as emotional stability, thus outputting feedback that does not match the user's true needs, and even causing misunderstandings and interaction resistance. Taking the voice modality as an example, voice emotional characteristics are significantly affected by environmental noise, echoes, microphone distance, speaker dialects, and individual voice differences. Users can achieve "linguistic calmness" by deliberately controlling their tone, speed, and wording, making it difficult for the system to distinguish between "calm narration" and states such as "suppressed anxiety / forced sadness." Furthermore, psychological research shows that emotional experience can usually be described by the "valence-arousal" dimension, where "arousal" reflects the physiological activation level of emotions such as tension, excitement, and fear. Most existing single-vision or single-voice solutions are better at determining valence direction, but lack reliable means of observing arousal intensity, making it difficult to build accurate user psychological profiles in scenarios with high arousal that are masked. Although some solutions incorporate physiological signals such as heart rate and blood oxygen, relying solely on a single physiological indicator also has limitations: physiological changes may be caused by exercise, caffeine intake, ambient temperature, or individual differences, and using them alone can easily lead to misjudgments such as "high arousal = negative emotions," and lacks joint reasoning with visual, voice, and contextual information, making it difficult to explain the triggering reasons and contextual meanings of physiological changes. Emotional interaction solutions based on large language models mostly use dialogue text as the core input and output. Even when using voice assistants, they often rely on speech-to-text for reasoning, lacking real-time multimodal perception constraints, making it difficult to identify implicit emotions using "non-verbal cues" and "physiological activation cues," and lacking quantitative modeling of continuous changes in user state, resulting in superficial feedback strategies, inappropriate timing, or mismatch with real needs.

[0004] Therefore, there is an urgent need in this field to solve the problem of how to integrate visual, auditory and deep physiological signals in complex scenarios to build a multimodal interactive system that can identify emotional masquerading, continuously track emotional changes and generate accurate empathic feedback. Summary of the Invention

[0005] The purpose of this invention is to provide a multimodal empathic interaction method and system based on the fusion of physiological signals and audiovisual information, so as to solve the problems existing in the prior art.

[0006] To achieve the above objectives, the present invention provides the following solution: This invention provides a multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information, comprising: Multimodal data acquisition steps: Simultaneously acquire the user's video signals, audio signals, and physiological signals; Multimodal feature extraction steps: The video signal, audio signal and physiological signal are processed using visual model, speech model and physiological signal analysis algorithm respectively to obtain emotion-related information and confidence level corresponding to each modality; Intelligent fusion decision-making steps: The emotion-related information and confidence scores of each modality obtained from the multimodal feature extraction steps are integrated into structured data and input into a large language model. The large language model performs cross-modal consistency analysis, conflict detection and dynamic weight allocation on the structured data to generate a comprehensive user emotion profile. Empathic response generation steps: Generate a multimodal empathic response that includes text, voice, and action strategies based on the comprehensive user emotion profile; Post-interaction emotion trend closed-loop steps: After outputting an empathetic response, continue to collect subsequent multimodal data of the user, analyze the emotion evolution trend, and feed the trend analysis results back to the intelligent fusion decision-making step to update the weight allocation and emotion prediction for the next round.

[0007] Preferably, in the multimodal feature extraction step, the visual model uses an emotion classification algorithm to process the video signal and outputs the emotion category and its confidence level; the speech model uses an end-to-end speech emotion recognition algorithm to extract the emotion features in the audio signal; and the physiological signal analysis algorithm extracts one or more physiological indicators from the physiological signal, such as heart rate, heart rate variability, pulse, blood oxygen, skin conductance, and respiratory rate, and their corresponding arousal levels.

[0008] Preferably, in the intelligent fusion decision-making step, the emotion-related information and confidence scores of each modality are integrated into structured data in JSON format. The structured data includes at least timestamps, emotion outputs of each modality, confidence scores of each modality, candidate weights, conflict markers, comprehensive emotion vectors, and context summary fields.

[0009] Preferably, the cross-modal consistency analysis includes: Compare whether the emotional valence of the visual modality matches the arousal index of the physiological modality; When the visual modality outputs positive emotions while the physiological modality outputs a high-arousal stress state, it is judged as an emotional masquerade conflict. When the visual modality outputs aggressive emotions while the vocal modality outputs fear characteristics, it is judged as a defensive emotional conflict; When the signal quality of any mode is lower than a preset threshold, it is determined to be an environmental interference conflict.

[0010] Preferably, the dynamic weight allocation includes: If the conflict is determined to be an emotional feigned conflict, the weight of the visual modality is reduced and the weight of the physiological signal modality is increased. If the conflict is determined to be a defensive emotional conflict, the weights of the vocal modality and the physiological modality should be evenly distributed. If the conflict is determined to be an environmental disturbance, the weight of the corresponding mode is reduced, and the gain of the other stable modes is increased.

[0011] Preferably, within the time interval between the end of a system conversation and the next user-initiated conversation, the following dynamic weight allocation logic is performed on the visual emotion of each frame: When a user exhibits common emotions such as neutral and happy for multiple consecutive frames, the system maintains an initial balanced weight for the seven categories of emotions: happiness, sadness, anger, surprise, neutrality, disgust, and fear, to ensure stable recognition of common emotions. When an uncommon emotion such as disgust or fear suddenly appears in a frame, the system immediately increases the weight of that emotion while appropriately reducing the weight of neutral and happy high-frequency emotions to highlight the attention to potential stress and negative emotions.

[0012] Preferably, the post-interaction emotion trend closed-loop step includes: Real-time calculation of the rate of change of the user's subsequent physiological indicators; If the rate of change meets the preset smoothing characteristics, the current empathy strategy is predicted to be effective, and the current weight allocation scheme is recorded in the individual preference database. If the rate of change meets the preset stress fluctuation characteristics, the current strategy is predicted to fail, an early warning weight correction parameter is generated and fed back to the intelligent fusion decision-making step.

[0013] This invention also provides a multimodal empathic interaction system based on the fusion of physiological signals and audiovisual information, comprising: Acquisition module: Used to synchronously acquire the user's video signals, audio signals, and physiological signals; The perception module includes a visual model processing unit, a speech feature extraction unit, and a physiological signal analysis unit, which process the video signal, audio signal, and physiological signal respectively, and output the emotion-related information and confidence level corresponding to each modality. Fusion Decision Module: Includes a large language model analyzer, which integrates the modal emotion-related information and confidence scores output by the perception module into structured data, and performs cross-modal consistency analysis, conflict detection and dynamic weight allocation to generate a comprehensive user emotion profile; Execution module: used to generate and execute multimodal empathic responses based on the comprehensive user emotion profile; Closed-loop monitoring module: After outputting an empathic response, it continues to collect subsequent multimodal data from the user, analyzes the trend of emotion evolution, and feeds back the trend analysis results to the fusion decision module.

[0014] Preferably, in the perception module, the visual model processing unit adopts an emotion classification algorithm, the voice feature extraction unit adopts an end-to-end voice emotion recognition algorithm, and the physiological signal analysis unit adopts non-contact rPPG technology or contact sensors to extract physiological indicators; the fusion decision module is deployed on an edge computing box or a cloud server; the execution module is an embodied intelligent robot terminal with text, voice and action output capabilities.

[0015] The present invention also provides a storage medium storing a computer program thereon, which, when executed by a processor, implements a multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information.

[0016] The present invention achieves the following beneficial technical effects compared to the prior art: This invention provides a multimodal empathic interaction method and system based on the fusion of physiological signals and audiovisual information. By synchronously collecting multimodal signals and using deep learning algorithms for feature extraction, combined with a large language model for cross-modal consistency analysis and dynamic weight allocation, it can effectively identify emotional masquerading and semantic conflicts, generating an emotional profile that is closer to the user's true inner state. At the same time, by constructing a closed loop of emotional trends after interaction, it can continuously track and predict the evolution of the user's emotions in the short term, and dynamically adjust the empathic feedback strategy. This solves the problems of weak single-modal recognition ability, difficulty in continuously tracking emotional changes, and superficial feedback strategies in existing technologies, significantly improving the empathic ability and interaction quality of embodied intelligence systems in real companionship scenarios. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a schematic diagram of the multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information provided by the present invention; Figure 2 This is a schematic diagram of the conflict detection logic of the fusion decision layer in this invention; Figure 3 This is a schematic diagram of the feedback layer and sentiment trend prediction in this invention; Figure 4 This is a simplified front view of an embodied intelligent system based on the fusion of physiological signals and audiovisual information, which is provided by this invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] The purpose of this invention is to provide a multimodal empathic interaction method and system based on the fusion of physiological signals and audiovisual information. This aims to address the problems in existing technologies, such as insufficient observability of single-modal information, susceptibility to emotional faking, difficulty in accurately identifying the user's true inner state, and superficial feedback strategies. This invention deeply integrates visual, auditory, and physiological signals, introduces a large language model for cross-modal consistency analysis and dynamic weight allocation, and constructs a closed-loop mechanism for post-interaction emotional trends, achieving accurate identification, continuous tracking, and adaptive feedback of the user's emotional state.

[0021] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0022] Example 1: like Figure 1 As shown, the multimodal empathic interaction method based on physiological signals and audiovisual fusion provided by this invention includes five core steps in its implementation process: multimodal data acquisition, multimodal feature extraction, intelligent fusion decision-making, empathic response generation, and post-interaction emotional trend closure. The following is a combination of... Figures 1 to 4 The technical solution of the present invention will be described in detail.

[0023] First, the system hardware configuration of this invention will be described. For example... Figure 4As shown, this invention provides a multimodal empathic interaction system based on the fusion of physiological signals and audiovisual information. Its specific implementation can be a embodied intelligent robot terminal. This robot integrates multiple functional modules, with its core components including: a display module, a voice module, a vision module, a sensor module, a computing module, and a motion module. The display module is used to display rich interactive interfaces in the form of text, images, or videos, such as displaying soothing emoticons or text; the voice module is responsible for receiving user voice commands and conducting voice interaction, realizing voice acquisition and interactive dialogue functions, specifically including a microphone array and speakers; the vision module captures user facial images and body movements through a high-definition camera and transmits them to the computing module; the sensor module can monitor the user's heart rate, body temperature, pulse, skin conductance, and other physiological signals and transmit them to the computing module. This module can use contact sensors (such as smart bracelets, wristbands, or finger-clip blood oxygen probes carried at the end of a robot's robotic arm) or non-contact sensors (such as cameras based on rPPG technology); the computing module, as the brain of the embodied system, is responsible for various computing tasks such as data feature extraction, model inference, and network communication. It can be deployed on the edge computing box of the robot body or upload some computing tasks to a cloud server; the motion module, through a wheeled chassis and a multi-degree-of-freedom robotic arm, gives the robot the ability to move and perform physical tasks. For example, when it detects that the user is sad, the robot can move to the user's side and perform a soothing patting motion.

[0024] Based on the aforementioned hardware system, the method of this invention first performs a multimodal data acquisition step. The system synchronously acquires multimodal data through an input layer, using a camera, microphone, and physiological sensors to obtain the user's video signals, audio signals, and physiological signals, respectively. The physiological signals include one or more of heart rate, heart rate variability, pulse, blood oxygen saturation, skin conductance, and respiratory rate. To ensure the accuracy of subsequent feature extraction, the acquired raw signals need to be preprocessed, such as performing frame rate unification and illumination correction on the video stream, noise filtering and echo cancellation on the audio stream, and denoising and baseline drift correction on the physiological signals.

[0025] The system then proceeds to the multimodal feature extraction step. The preprocessed multimodal input data is processed using advanced deep learning algorithms to determine emotion, yielding emotion-related information and confidence information for each modality. The emotion-related information includes at least one or more of the following: emotion category (e.g., anger, disgust, fear, happiness, sadness, surprise, calmness), valence index (positive or negative of the emotion), and arousal index (physiological activation intensity of the emotion). The confidence information characterizes the reliability of the modality's resolution result and is affected by factors such as data quality and environmental interference. In a specific embodiment of this invention, the visual algorithm can employ the Poster++ algorithm, which currently performs well in seven-class emotion detection. This algorithm can quickly detect facial expressions in each frame of a camera video stream and output the emotion classification result and its corresponding confidence value for each frame. The speech algorithm can employ the Wav2Vec2.0 algorithm open-sourced by Meta, whose core feature is its ability to directly process raw audio waveforms and output feature vectors containing extremely rich pitch, speech rate, and prosody information. This allows for accurate characterization of subtle emotional fluctuations such as anger, sadness, and joy, and effective determination of valence indices. For physiological signals, this invention employs a combination of non-contact and contact acquisition methods. For the non-contact method, rPPG technology is used to capture subtle color changes in facial skin caused by heartbeats using a camera. Algorithms such as CHROM (chromometry) or POS (projective plane orthogonal method) are used to filter out ambient light interference and reconstruct the heart rate curve and respiratory rate from the video. For the contact method, a robotic arm on an embodied intelligent system is used to actively or passively contact the user through sensors (such as photoelectric pulse sensors and ECG electrodes) carried by its end effector to collect accurate physiological data. The system precisely aligns the visual, auditory, and physiological features according to the acquisition timestamps, aggregating them to generate a multimodal feature data pool, which serves as the raw input for the core decision layer. The output data format of this step can be a structured vector containing emotion labels for each modality, arousal values, and confidence levels.

[0026] The next step is the core intelligent fusion decision-making process of this invention. The system integrates the emotion-related information and confidence scores of each modality obtained from the multimodal feature extraction step according to a preset structured template, generating structured data for large language model inference. Preferably, this structured data is in JSON format, and its content includes at least: timestamp, emotion output of each modality, confidence / quality index of each modality, candidate weights, conflict markers, comprehensive emotion vector, and a context summary field. Subsequently, this JSON structured data is input into the large language model analyzer of the fusion decision layer. This analyzer is responsible for performing consistency analysis, conflict detection, and dynamic weight allocation on the multimodal parsing results, ultimately outputting a comprehensive user emotion profile.

[0027] Specifically, such as Figure 2 As shown, the core decision-making layer first performs confidence verification on the input multimodal feature data to eliminate low-quality noise caused by sensor detachment or drastic environmental changes. Then, the large language model analyzer performs cross-modal consistency detection to determine whether the emotional information captured by different sensory dimensions matches. This detection process includes semantic alignment and conflict determination. Semantic alignment refers to comparing whether the valence output of the visual model matches the arousal index of physiological signals within the emotional space. For example, if the visual detection is "smiling" (high valence) and the physiological signal detection is a stable heart rate (low arousal), then they match; if the visual detection is "smiling" but the physiological signal shows a sudden increase in heart rate and increased skin conductance (high arousal), then a conflict occurs. The conflict determination is based on several preset typical conflict scenarios: when the visual modal output is positive emotion (such as a smile) and the physiological modal output is a high-arousal stress state (such as an increased heart rate), it is determined as "emotional faking conflict," which is common when users force a smile; when the visual modal output is aggressive emotion (such as an angry expression) and the voice modal output is fear characteristics (such as a trembling tone), it is determined as "defensive emotion conflict," which is common when users bluff to cover up their inner fear. In addition, if the signal quality of either modality is detected to be below a preset threshold (such as insufficient lighting causing a sharp drop in visual confidence, or environmental noise causing a sharp drop in voice confidence), it is determined as "environmental interference conflict."

[0028] If cross-modal information is determined to be consistent, the system identifies the user as being in a genuine emotional state, executes the conventional decision-making path, and maintains the preset standard weight distribution. If cross-modal information is determined to be conflicting, the decision layer enters a semantic contradiction classification procedure, initiating targeted logical intervention strategies for different contradiction types, i.e., dynamically reshaping the weight allocation of each modality. The specific logic is as follows: If it is determined to be a conflict of emotional pretense, a weight reshaping strategy is executed, logically suppressing the trust weight of the visual modality (because overt expressions are deceptive), and significantly increasing the weight of the physiological signal modality to the preset highest priority threshold, because physiological signals are difficult to completely conceal by subjective will; if it is determined to be a conflict of defensive emotions, a balanced weight strategy is executed, evenly allocating the weight ratio of the voice modality and the physiological modality, inferring the true emotion through the fear characteristics of the voice and the high arousal characteristics of the body; if it is determined to be a conflict of environmental interference, a compensation strategy is executed, identifying sensor failure and reducing the weight gain of the failed modality, while strengthening the weight of the remaining stable modalities. The core decision layer integrates the above weight adjustment results to generate a structured decision summary, and finally outputs a comprehensive user emotion profile in JSON format containing emotion labels, psychological stress levels, and reasons for weight allocation. This profile reflects the system's optimal estimate of the user's true internal state.

[0029] As one implementation method, during the time interval between the end of a system conversation and the next user-initiated conversation, the following dynamic weight allocation logic is performed on the visual emotion of each frame: When a user exhibits common emotions such as neutral and happy for multiple consecutive frames, the system maintains an initial balanced weight for the seven categories of emotions: happiness, sadness, anger, surprise, neutrality, disgust, and fear, to ensure stable recognition of common emotions. When an uncommon emotion such as disgust or fear suddenly appears in a frame, the system immediately increases the weight of that emotion while appropriately reducing the weight of neutral and happy high-frequency emotions to highlight the attention to potential stress and negative emotions.

[0030] The allocation of emotion weights within a visual modality is a prerequisite for cross-modal consistency analysis. If the visual modality still outputs "positive emotion" after weighting, but the physiological modality detects "high arousal stress state", then an emotional masquerade conflict is triggered, further reducing the weight of the visual modality and increasing the weight of the physiological modality.

[0031] If the visual modality outputs "aggressive emotion" after weight allocation, while the speech modality extracts "fear features", then a defensive emotional conflict is triggered, and the weights of the speech and physiological modalities are evenly allocated.

[0032] The core difference between the two is: Intramodal weighting: Optimize emotion recognition accuracy by considering the frequency and significance differences of different emotion categories within a single modality.

[0033] Cross-modal dynamic weight allocation: Optimize the reliability of multimodal fusion decision-making to address signal conflicts or quality differences between different modes.

[0034] After obtaining a comprehensive user emotional profile, the system proceeds to the empathic response generation step. The system inputs the JSON comprehensive user profile and preset structured prompts as contextual information to the generation model (which can be the same large language model or a dedicated dialogue generation model). These prompts guide the model to generate an empathic response that matches the current user's emotional state (e.g., "the user appears calm but is highly anxious internally"), relational boundaries (e.g., the "companion robot" role), and interaction goals (e.g., "soothing"). The empathic response is a multimodal instruction sequence containing generated text content (e.g., comforting words), suggested tone of voice (e.g., gentle, slow speech), and a corresponding embodied action sequence (e.g., the robot approaches the user, makes a gentle pat on the back, and a heart appears on the screen). The execution module (i.e., the embodied intelligent robot) coordinates the speech synthesis module, motion control module, and display module according to this instruction sequence to complete the multimodal empathic expression.

[0035] Finally, the system executes a closed-loop step regarding post-interaction sentiment trends. After outputting an empathetic response, the system does not stop there but immediately begins a new round of monitoring. For example... Figure 3 As shown, while executing the intelligent response, the execution module triggers the trend evaluation engine of the feedback and prediction layer to continuously monitor the user's subsequent multimodal responses. This engine continues to collect the user's subsequent multimodal data and performs feature extraction and preliminary analysis again, calculating in real time the rate of change of the user's physiological indicators after the interaction, including but not limited to the heart rate recovery speed (i.e., the slope of the heart rate returning to the baseline), the stability of respiratory rate, and the magnitude of the decrease in skin conductance. Based on these rates of change, the system performs emotion evolution trend determination, classifying the interaction result as either a "stabilizing state" or a "deteriorating state." If the rate of change meets the preset recovery characteristics (such as a rapid decrease in heart rate and a stabilizing breathing pattern), the current empathy strategy is predicted to be effective. The system will then suggest maintaining the current interaction boundaries and record the currently successful modal weight allocation scheme (such as an effective strategy for a specific user) in the individual preference database to provide more personalized services to that user in the future. If the rate of change matches the preset stress fluctuation characteristics (such as heart rate increasing instead of decreasing, and enhanced skin conductance), the predictive interaction strategy may have triggered the user's stress response or failed to alleviate their negative emotions. The system will then suggest immediately adjusting the weight bias or changing the interaction strategy in subsequent rounds and generate warning weight correction parameters. The feedback and prediction layer generates the next round of prediction warning parameters based on the trend judgment results, and updates the starting data of the core decision layer through a feedback loop mechanism. For example, in the next round of interaction, the sensitivity of physiological signals may be increased in advance, or the large model may be prompted to adopt a more cautious, tentative questioning strategy.

[0036] pass Figures 1 to 3 The closed-loop architecture shown in this invention realizes a complete interaction chain from low-level sensory perception to high-level logic verification and dynamic strategy correction. This invention effectively identifies and exposes users' social pretenses through a logic intervention mechanism, ensuring that the empathic feedback output by the embodied intelligence system reaches the user's true psychological needs. Furthermore, it continuously learns and optimizes through ongoing interaction, significantly improving the quality of human-computer empathic interaction in complex scenarios.

[0037] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0038] It should be noted that the components mentioned in the above embodiments are all general standard parts or components known to those skilled in the art. Their structures and principles can be learned by those skilled in the art through technical manuals or conventional experimental methods.

[0039] This invention has illustrated its principles and implementation methods using specific examples. The descriptions of these embodiments are merely illustrative of the method and its core ideas; furthermore, those skilled in the art will recognize that modifications may be made to the specific implementation methods and application scope based on the principles of this invention. Therefore, the content of this specification should not be construed as limiting the invention.

Claims

1. A multi-modal co-empathy interaction method based on physiological signals and audio-visual fusion, characterized in that, include: Multimodal data acquisition steps: Simultaneously acquire the user's video signals, audio signals, and physiological signals; Multimodal feature extraction steps: The video signal, audio signal and physiological signal are processed using visual model, speech model and physiological signal analysis algorithm respectively to obtain emotion-related information and confidence level corresponding to each modality; Intelligent fusion decision-making steps: The emotion-related information and confidence scores of each modality obtained from the multimodal feature extraction steps are integrated into structured data and input into a large language model. The large language model performs cross-modal consistency analysis, conflict detection and dynamic weight allocation on the structured data to generate a comprehensive user emotion profile. Empathic response generation steps: Generate a multimodal empathic response that includes text, voice, and action strategies based on the comprehensive user emotion profile; Post-interaction emotion trend closed-loop steps: After outputting an empathetic response, continue to collect subsequent multimodal data of the user, analyze the emotion evolution trend, and feed the trend analysis results back to the intelligent fusion decision-making step to update the weight allocation and emotion prediction for the next round.

2. The multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information according to claim 1, characterized in that, In the multimodal feature extraction step, the visual model uses an emotion classification algorithm to process the video signal and outputs the emotion category and its confidence level; the speech model uses an end-to-end speech emotion recognition algorithm to extract the emotion features in the audio signal; and the physiological signal analysis algorithm extracts one or more physiological indicators from the physiological signals, such as heart rate, heart rate variability, pulse, blood oxygen, skin conductance, and respiratory rate, and their corresponding arousal levels.

3. The multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information according to claim 1, characterized in that, In the intelligent fusion decision-making step, the emotion-related information and confidence scores of each modality are integrated into structured data in JSON format. The structured data includes at least timestamps, emotion outputs of each modality, confidence scores of each modality, candidate weights, conflict markers, comprehensive emotion vectors, and context summary fields.

4. The multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information according to claim 1, characterized in that, The cross-modal consistency analysis includes: Compare whether the emotional valence of the visual modality matches the arousal index of the physiological modality; When the visual modality outputs positive emotions while the physiological modality outputs a high-arousal stress state, it is judged as an emotional masquerade conflict. When the visual modality outputs aggressive emotions while the vocal modality outputs fear characteristics, it is judged as a defensive emotional conflict; When the signal quality of any mode is lower than a preset threshold, it is determined to be an environmental interference conflict.

5. The multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information according to claim 4, characterized in that, The dynamic weight allocation includes: If the conflict is determined to be an emotional feigned conflict, the weight of the visual modality is reduced and the weight of the physiological signal modality is increased. If the conflict is determined to be a defensive emotional conflict, the weights of the vocal modality and the physiological modality should be evenly distributed. If the conflict is determined to be an environmental disturbance, the weight of the corresponding mode is reduced, and the gain of the other stable modes is increased.

6. The multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information according to claim 1, characterized in that, Within the time interval between the end of a dialogue and the start of the next dialogue by the user, the following dynamic weight allocation logic is applied to the visual emotion of each frame: When a user exhibits common emotions such as neutral and happy for multiple consecutive frames, the system maintains an initial balanced weight for the seven categories of emotions: happiness, sadness, anger, surprise, neutrality, disgust, and fear, to ensure stable recognition of common emotions. When an uncommon emotion such as disgust or fear suddenly appears in a frame, the system immediately increases the weight of that emotion while appropriately reducing the weight of neutral and happy high-frequency emotions to highlight the attention to potential stress and negative emotions.

7. The multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information according to claim 1, characterized in that, The post-interaction sentiment trend closed-loop steps include: Real-time calculation of the rate of change of the user's subsequent physiological indicators; If the rate of change meets the preset smoothing characteristics, the current empathy strategy is predicted to be effective, and the current weight allocation scheme is recorded in the individual preference database. If the rate of change meets the preset stress fluctuation characteristics, the current strategy is predicted to fail, an early warning weight correction parameter is generated and fed back to the intelligent fusion decision-making step.

8. A multimodal empathic interaction system based on the fusion of physiological signals and audiovisual information, employing the multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information as described in any one of claims 1-7, characterized in that, include: Acquisition module: Used to synchronously acquire the user's video signals, audio signals, and physiological signals; The perception module includes a visual model processing unit, a speech feature extraction unit, and a physiological signal analysis unit, which process the video signal, audio signal, and physiological signal respectively, and output the emotion-related information and confidence level corresponding to each modality. Fusion Decision Module: Includes a large language model analyzer, which integrates the modal emotion-related information and confidence scores output by the perception module into structured data, and performs cross-modal consistency analysis, conflict detection and dynamic weight allocation to generate a comprehensive user emotion profile; Execution module: used to generate and execute multimodal empathic responses based on the comprehensive user emotion profile; Closed-loop monitoring module: After outputting an empathic response, it continues to collect subsequent multimodal data from the user, analyzes the trend of emotion evolution, and feeds back the trend analysis results to the fusion decision module.

9. The multimodal empathic interaction system based on the fusion of physiological signals and audiovisual information according to claim 8, characterized in that, In the perception module, the visual model processing unit adopts an emotion classification algorithm, the voice feature extraction unit adopts an end-to-end voice emotion recognition algorithm, and the physiological signal analysis unit adopts non-contact rPPG technology or contact sensors to extract physiological indicators; the fusion decision module is deployed on an edge computing box or cloud server; the execution module is an embodied intelligent robot terminal with text, voice and action output capabilities.

10. A storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the multimodal empathic interaction method based on the fusion of physiological signals and audiovisual information as described in any one of claims 1 to 7.