A multi-modal emotional interaction system suitable for companion devices

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By improving the MGCMA model and multimodal fusion method, the cross-modal alignment problem in multimodal data processing was solved, achieving accuracy in emotion recognition and rationality in interactive behavior, thus enhancing the emotion perception and response capabilities of companion devices.

CN122196694APending Publication Date: 2026-06-12江西冠英智能科技股份有限公司

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: 江西冠英智能科技股份有限公司
Filing Date: 2026-04-09
Publication Date: 2026-06-12

Application Information

Patent Timeline

09 Apr 2026

Application

12 Jun 2026

Publication

CN122196694A

IPC: G06F18/241; G06F18/25; G06F3/01

AI Tagging

Application Domain

Input/output for user-computer interaction Graph reading

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

User interface display system, method, computer device and storage medium
US12657756B2Input/output for user-computer interaction Image analysis
system
JP2026101227AInput/output for user-computer interactionCordless telephonesDisplay device Acoustics
Electronic devices with finger sensors
US12656914B2Input/output for user-computer interaction Details for portable computers
Semiconductor inventory equipment maintenance system and method
CN120087937Blower requirementEasy to carry outInput/output for user-computer interaction Data processing applications
Device for work support in a predefined work area within an assigned spatial profile
DE102013201309B4Input/output for user-computer interactionMeasuring points marking

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies lack a unified cross-modal alignment and association mechanism in multimodal data processing, resulting in insufficient information utilization, limited accuracy of emotion judgment results, and failure to comprehensively consider device capabilities, interaction scenarios, and user attribute constraints in interaction decisions, leading to interaction behaviors that do not conform to actual execution conditions.

Method used

A multimodal fusion and collaborative scheduling method is adopted. Cross-modal correlation modeling and emotion determination are performed through an improved MGCMA model. Combined with device capabilities, interaction scenarios and user attribute constraints, multi-channel collaborative interaction commands are generated to realize emotion recognition and interaction control.

Benefits of technology

It improves the accuracy of emotion judgment results and the rationality of interactive behavior, enhances the system's ability to identify and adapt to complex emotional states, and improves the coherence of multi-channel output and user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122196694A_ABST

Patent Text Reader

Abstract

The application discloses a kind of multi-modal mood interaction systems suitable for companion device, comprising: data acquisition module, for collecting and pre-processing multi-modal original data in companion device interaction scene;Feature construction module, for extracting multi-modal original feature set;Mood determination module, for outputting mood determination result by improved MGCMA model;Decision construction module, for determining mood interaction decision constraint condition and associated combination with mood determination result;Result generation module, for determining target interaction result in preset candidate interaction behavior set;Instruction generation module, for mapping target interaction result to companion device and generating multi-channel collaborative interaction instruction set;Instruction execution module, for executing multi-channel collaborative interaction instruction set.The application adopts multi-modal fusion and collaborative scheduling method, realizes mood recognition and interaction control, with the advantages of high recognition accuracy and strong interaction coordination.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of emotion interaction, and more particularly to a multimodal emotion interaction system suitable for companion devices. Background Technology

[0002] As the application of companion devices in smart terminals continues to expand, technologies that adjust interactions based on users' emotional states are gaining increasing attention. Existing technologies typically determine user emotions using single-modal methods such as speech recognition, facial expression recognition, or behavioral analysis, and trigger preset interactive behaviors based on the determination results to achieve basic human-computer interaction functions. However, these technologies still have certain limitations in multimodal data fusion, emotional semantic expression, and interactive response coordination, making it difficult to meet the needs of emotion perception and response in complex interactive scenarios.

[0003] Existing technologies often lack a unified cross-modal alignment and association mechanism in multimodal data processing, leading to insufficient utilization of information across different modalities and limited accuracy in emotion judgment. Furthermore, during the interaction decision-making stage, most solutions fail to comprehensively consider device capability constraints, interaction scenario constraints, and user attribute constraints, easily resulting in interactive behaviors that do not conform to actual execution conditions. In addition, the lack of an effective channel coordination and scheduling mechanism during multi-channel output makes it difficult to achieve orderly coordination between voice, visual, and motion feedback, impacting the overall interaction effect and user experience.

[0004] Therefore, how to provide a multimodal emotion interaction system suitable for companionship devices is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] One objective of this invention is to propose a multimodal emotion interaction system suitable for companion devices. This invention employs a multimodal fusion and collaborative scheduling method to achieve emotion recognition and interactive control, possessing the advantages of high recognition accuracy and strong interactive coordination.

[0006] A multimodal emotion interaction system applicable to companion devices according to an embodiment of the present invention includes: The data acquisition module is used to collect multimodal raw data in the interaction scenarios of companion devices and preprocess it to generate a standardized multimodal input dataset; The feature construction module is used to extract multimodal features based on a standardized multimodal input dataset and establish the correspondence between the features of each modality to obtain the original multimodal feature set; The emotion determination module is used to perform cross-modal correlation modeling and emotion determination processing on the original multimodal feature set through an improved MGCMA model, and output the emotion determination result. The decision-making construction module is used to determine the constraints of emotion interaction decision-making and associate and combine the emotion judgment results with the constraints of emotion interaction decision-making to obtain the set of emotion interaction decision inputs; The result generation module is used to determine the target interaction result from a preset set of candidate interaction behaviors based on the set of emotion interaction decision inputs. The instruction generation module is used to map the target interaction result to the voice output channel, visual display channel and action execution channel of the companion device, and generate a set of multi-channel collaborative interaction instructions based on the channel scheduling relationship; The instruction execution module is used to send a set of multi-channel collaborative interaction instructions to companion devices and drive the devices to perform corresponding interactive operations.

[0007] Optionally, the multimodal raw data includes user voice data, facial expression image data, body behavior data, interaction scene data, and user attribute data. The preprocessing includes aligning the multimodal raw data based on timestamps, removing abnormal segments, completing missing segments, unifying the dimensions of each modality, and associating and organizing the multimodal raw data according to the interaction object identifier and the collection time sequence.

[0008] Optionally, the feature construction module includes: Modality separation and structured segmentation are performed on the standardized multimodal input dataset to generate speech time window sequences, facial expression image frame sequences, behavioral action segment sequences, scene state segment sequences, and attribute record sequences. Speech information parsing and processing are performed on each speech time window in the speech time window sequence to obtain speech emotion features; Perform facial expression semantic parsing processing on each facial expression image frame in the facial expression image frame sequence to obtain facial expression semantic features; Behavioral state parsing is performed on each behavioral action segment in the sequence of behavioral action segments to obtain behavioral state features; By combining the scene state information in the scene state segment sequence, scene content recognition processing is performed to obtain scene context features; Based on the information of each attribute record in the attribute record sequence, perform attribute content processing to obtain user attribute characteristics; Based on a unified time reference and interaction object identifier, the time positions corresponding to voice emotion features, facial expression semantic features, behavioral state features, scene context features, and user attribute features are matched. Feature items with a time interval less than a preset time threshold and consistent interaction object identifiers are grouped into the same association group, and the correspondence between each feature is established to obtain the multimodal original feature set.

[0009] Optionally, the emotion determination module includes: The original multimodal feature set is input into the multi-branch input encoding layer of the improved MGCMA model, and encoding processing is performed through the speech encoding branch, expression encoding branch and behavior encoding branch to obtain speech emotion encoding result, expression semantic encoding result and behavior state encoding result, respectively. The speech emotion coding results, facial expression semantic coding results, and behavioral state coding results are subjected to cross-modal distribution-level alignment processing by a distribution-level alignment unit to obtain the distribution-level alignment result. The sample-level alignment result is obtained by performing cross-modal sample-level alignment processing on the distribution-level alignment result through the sample-level alignment unit; Fine-grained cross-modal alignment is performed on the speech time window unit, facial expression local region unit, and behavioral action segment unit in the sample-level alignment result by using a fine-grained cross-modal alignment unit to obtain the fine-grained cross-modal alignment result. Joint embedding processing is performed on the fine-grained cross-modal alignment results, scene context features, and user attribute features to obtain the emotion fusion representation results; The emotion fusion representation result is input into the structured judgment output layer to perform emotion category judgment, emotion intensity judgment, and emotion semantic state judgment, and output the emotion judgment result, which includes emotion category identifier, emotion intensity identifier, and emotion semantic state identifier.

[0010] Optionally, the decision-making construction module includes: Based on the boundary processing of the voice output capability, visual display capability, action execution capability, and channel parallel scheduling capability of companion devices, the range of voice output, visual presentation, action feedback, and channel coordination is determined, and the device capability constraints are obtained by aggregation. Based on the scene context features, determine the environment adaptation conditions, time period adaptation conditions, location adaptation conditions and object adaptation conditions corresponding to the current interaction state, and organize each adaptation condition to obtain the interaction scene constraints; Based on user attribute characteristics, determine the age adaptation conditions, relationship adaptation conditions, preference adaptation conditions, habit adaptation conditions and sensitive response adaptation conditions corresponding to the current interaction object, and organize each adaptation condition to obtain user attribute constraints; The emotional type information corresponding to the emotional category identifier, the strength level information corresponding to the emotional intensity identifier, and the semantic tendency information corresponding to the emotional semantic state identifier in the emotional judgment results are jointly organized to obtain the basic information for emotional decision-making. Based on device capability constraints, interaction scenario constraints, and user attribute constraints, the constraint satisfaction level values are calculated separately and then weighted summation is performed to obtain a comprehensive constraint satisfaction value. By associating basic information on emotion decision-making with device capability constraints, interaction scenario constraints, and user attribute constraints, candidate units for emotion interaction decisions are constructed. Each candidate unit for emotion interaction decisions is then matched with its corresponding comprehensive constraint degree value. The comprehensive constraint degree value is compared with a preset constraint through a threshold, and candidate units for emotion interaction decisions that meet the preset constraint threshold are retained, thus obtaining the set of emotion interaction decision inputs.

[0011] Optionally, the result generation module includes: Based on the set of emotion-based interactive decision inputs, the behavior mapping and sorting process is performed on each candidate interactive behavior in the preset set of candidate interactive behaviors to determine the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship corresponding to each candidate interactive behavior. Based on the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship corresponding to each candidate interactive behavior, the emotion matching degree value, device adaptation degree value, scene adaptation degree value, and attribute adaptation degree value are calculated respectively, and weighted summation processing is performed to obtain the comprehensive behavior score value corresponding to each candidate interactive behavior. The comprehensive behavior score corresponding to each candidate interaction behavior is compared with the preset behavior retention threshold. Candidate interaction behaviors that reach the preset behavior retention threshold are retained, and candidate interaction behaviors that do not reach the preset behavior retention threshold are removed, thus obtaining the initial set of candidate interaction behaviors. Based on each candidate interaction behavior in the initial screening candidate interaction behavior set, cross-behavior compatibility judgment processing is performed to obtain a candidate interaction behavior combination set; Based on each candidate interaction behavior combination in the candidate interaction behavior combination set, perform appropriate matching scoring processing to obtain the corresponding combination score value; The candidate interaction behavior combination with the highest combined score is taken as the target interaction behavior combination. Based on the channel scheduling relationship candidates, the priority value of emotion expression, and the timeliness value of interaction response corresponding to each candidate interaction behavior in the target interaction behavior combination, the execution order is processed to obtain the target execution order. Based on the combination of target interactive behaviors and the target execution order, the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship in the combination of target interactive behaviors are aggregated accordingly to obtain the target interaction result.

[0012] Optionally, the instruction generation module includes: Extract the voice interaction content, visual presentation content, and action feedback content from the target interaction results, and match them with the execution capabilities of the voice output channel, visual display channel, and action execution channel in the companion device to obtain voice interaction command items, visual presentation command items, and action feedback command items. Based on the channel scheduling relationship in the target interaction result, determine the start order, parallel output relationship, duration connection relationship and conflict avoidance relationship among the voice output channel, visual display channel and action execution channel, and obtain the channel collaborative scheduling result; Based on the channel collaborative scheduling results, channel allocation processing is performed on voice interaction command items, visual presentation command items, and action feedback command items. The execution sequence of commands in each command queue is adjusted according to the start order, parallel output relationship, duration connection relationship, and conflict avoidance relationship to obtain a multi-channel execution sequence. Based on the multi-channel execution sequence, voice interaction commands are converted into voice output channel control commands, visual presentation commands are converted into visual display channel control commands, and motion feedback commands are converted into motion execution channel control commands. The control commands of each channel are then sorted and integrated to generate a multi-channel collaborative interaction command set.

[0013] Optionally, the interactive operations include voice output channel performing voice broadcast operations, visual display channel performing interface presentation and image display operations, and action execution channel performing action feedback operations.

[0014] The beneficial effects of this invention are: This invention achieves unified modeling and collaborative utilization of multimodal emotional information by constructing a complete processing chain covering data acquisition, feature construction, emotion determination, decision generation, and multi-channel execution. Compared with existing technologies that rely on a single modality or simple fusion methods, this solution establishes a multimodal original feature set based on multi-source data such as speech, facial expressions, behavior, scene, and user attributes. Furthermore, it enhances semantic consistency between different modalities through cross-modal association modeling and fine-grained alignment mechanisms. This results in higher accuracy and expressive completeness of emotion determination results at the level of category, intensity, and semantic state, thereby enhancing the system's ability to recognize complex emotional states.

[0015] Building upon the emotion assessment results, this solution further incorporates device capability constraints, interaction scenario constraints, and user attribute constraints. By constructing an emotional interaction decision input set, it achieves a unified correlation between emotional states and actual execution conditions, ensuring that the generated interaction results not only meet the needs of emotional expression but also possess executability and adaptability. Compared to existing technologies that directly trigger interaction behavior based on emotional results, this solution avoids interaction outputs that do not conform to device capabilities or user characteristics, improves the rationality and stability of interaction behavior, and enhances the system's adaptability in diverse usage scenarios.

[0016] During the interactive execution phase, this solution maps the target interaction result to the voice output channel, visual display channel, and action execution channel, and generates a multi-channel collaborative interaction instruction set based on channel scheduling relationships. This achieves timing coordination and conflict avoidance among multi-channel outputs. Compared to traditional single-channel or simple parallel output methods, this solution establishes a unified collaborative execution mechanism among voice broadcasting, visual presentation, and action feedback, ensuring consistency in timing and resource allocation across various interaction forms. This enhances the overall continuity and immersion of the interaction, significantly improving the user experience. Attached Figure Description

[0017] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings: Figure 1 This is a schematic diagram of the structure of a multimodal emotion interaction system suitable for companion devices proposed in this invention; Figure 2 This is a flowchart illustrating the output of emotion determination results for a multimodal emotion interaction system applicable to companion devices, as proposed in this invention. Figure 3 This is a flowchart illustrating the process of determining the target interaction result of a multimodal emotion interaction system applicable to companion devices, as proposed in this invention. Detailed Implementation

[0018] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0019] refer to Figures 1-3 A multimodal emotion interaction system suitable for companion devices, comprising: The data acquisition module is used to collect multimodal raw data in the interaction scenarios of companion devices and preprocess it to generate a standardized multimodal input dataset; The feature construction module is used to extract multimodal features based on a standardized multimodal input dataset and establish the correspondence between each modality feature to obtain the original multimodal feature set; The emotion determination module is used to perform cross-modal correlation modeling and emotion determination processing on the original multimodal feature set through an improved MGCMA model, and output the emotion determination result. The decision-making construction module is used to determine the constraints of emotion interaction decision-making and associate and combine the emotion judgment results with the constraints of emotion interaction decision-making to obtain the set of emotion interaction decision inputs; The result generation module is used to determine the target interaction result from a preset set of candidate interaction behaviors based on the set of emotion interaction decision inputs. The instruction generation module is used to map the target interaction result to the voice output channel, visual display channel and action execution channel of the companion device, and generate a set of multi-channel collaborative interaction instructions based on the channel scheduling relationship; The instruction execution module is used to send a set of multi-channel collaborative interaction instructions to companion devices and drive the devices to perform corresponding interactive operations.

[0020] In this embodiment, the multimodal raw data includes user voice data, facial expression image data, body behavior data, interaction scene data, and user attribute data. The preprocessing includes aligning the multimodal raw data based on timestamps, removing abnormal segments, completing missing segments, unifying the dimensions of each modality, and associating and organizing the multimodal raw data according to the interaction object identifier and the collection time sequence.

[0021] In this embodiment, the feature construction module includes: Modality separation and structured segmentation are performed on standardized multimodal input datasets to generate speech time window sequences, facial expression image frame sequences, behavioral action segment sequences, scene state segment sequences, and attribute record sequences. Modality separation and structured segmentation include sliding windowing of user speech data according to a preset time window length, organizing facial expression image data into frame sequences according to the image acquisition order, segmenting limb behavior data into action segments according to continuous action change intervals, dividing interactive scene data into state segments according to the scene state change process, and forming attribute record sequences according to the interactive object identifiers. Speech information parsing and processing are performed on each speech time window in the speech time window sequence to obtain speech emotion features. Speech emotion features include fundamental frequency change features, short-time energy change features, formant distribution features, speech rate change features, pause interval features, and spectral envelope features corresponding to each speech time window. Based on each facial expression image frame in the facial expression image frame sequence, facial expression semantic parsing is performed to obtain facial expression semantic features, which include local texture features, contour deformation features, facial expression action unit intensity features, and region position change features. Behavioral state analysis is performed on each behavioral action segment in the sequence of behavioral action segments to obtain behavioral state features, which include head posture change features, limb swing amplitude features, trunk orientation change features, movement rhythm features, movement speed features, and movement stagnation features. By combining the scene state information in the scene state segment sequence, scene content recognition processing is performed to obtain scene context features. Scene content recognition includes determining the interaction environment category, ambient lighting status, ambient noise status, device location status, interaction time period status, and surrounding object status. Based on the attribute record information in the attribute record sequence, attribute content processing is performed to obtain user attribute characteristics. Attribute content processing includes determining user age stratification attributes, identity relationship attributes, historical preference attributes, interaction habit attributes, and sensitive response attributes. Based on a unified time reference and interaction object identifier, the time positions corresponding to voice emotion features, facial expression semantic features, behavioral state features, scene context features, and user attribute features are matched. Feature items with a time interval less than a preset time threshold and consistent interaction object identifiers are grouped into the same association group, and the correspondence between each feature is established to obtain the multimodal original feature set.

[0022] In this embodiment, the emotion determination module includes: The original multimodal feature set is input into the multi-branch input encoding layer of the improved MGCMA model, and encoding processing is performed through the speech encoding branch, expression encoding branch and behavior encoding branch to obtain speech emotion encoding result, expression semantic encoding result and behavior state encoding result, respectively. Compared to existing models, the improved MGCMA model structurally adjusts the input method of the original modal feature encoding layer to a multi-branch input encoding layer consisting of speech encoding, facial expression encoding, and behavioral encoding branches. This allows each modal feature to complete branch processing at the input stage, thereby reducing intermodal interference and improving the independence of feature representation. In terms of alignment mechanism, the original word-level alignment method is adjusted to a fine-grained cross-modal alignment unit oriented towards speech time window units, facial expression local region units, and behavioral action segment units. This refines the correspondence between different modalities from overall matching to unit-level matching, thereby improving the accuracy of cross-modal semantic association and enhancing the recognition ability in complex emotional scenarios. In terms of output structure, the fusion and classification output layer is adjusted to a structured decision output layer, enabling the model to simultaneously output emotion category labels, emotion intensity labels, and emotion semantic state labels, thereby improving the completeness of emotion expression and enhancing the model's adaptability in diverse interaction scenarios. The improved MGCMA model's loss function includes a multimodal consistency constraint, a fine-grained alignment constraint, and a structured decision constraint. The loss values corresponding to each constraint are weighted and summed to obtain the overall loss value. Based on the overall loss value, gradient backpropagation is used to calculate the gradient information of each layer's parameters, and parameter iterative updates are performed according to the gradient information. Simultaneously, the gradient values are amplitude-limited, and the learning rate is adaptively adjusted. During training, the overall loss value is continuously monitored. When the difference between two consecutive iterations of the overall loss value within a preset training period is less than a preset convergence threshold, the improved MGCMA model is considered to have converged, and parameter updates are stopped. The improved MGCMA model training parameters include: randomly initializing the parameters of each layer in the improved MGCMA model according to a uniform distribution range between 0.01 and 0.1; setting the training batch size to an integer value between 32 and 128, and setting the training period to an integer value between 50 and 200; setting the initial learning rate value to between 0.001 and 0.01, and performing piecewise decay of the learning rate according to a preset step size during training, and adaptively adjusting the learning rate according to the overall loss value change, and proportionally reducing the current learning rate to 0.1 to 0.5 of the original value when the preset training period node is reached; setting the gradient magnitude limit threshold to a value between 1 and 5, and truncating gradient values exceeding this threshold during parameter updates; setting the weight coefficients corresponding to the multimodal consistency constraint, fine-grained alignment constraint, and structured decision constraint to values between 0.1 and 1, and proportionally adjusting the weight coefficients according to the loss change magnitude of each constraint over several consecutive training periods; The process of obtaining the speech emotion coding result includes: sequentially organizing each speech time window corresponding to the speech emotion feature through the speech coding branch; arranging the feature data corresponding to each speech time window in chronological order to form a speech time window feature sequence; adding time position identification information to the speech time window feature sequence according to the time position of each speech time window in the overall speech sequence; and performing continuous correlation calculation on the feature changes between adjacent speech time windows in the speech time window feature sequence, so that each speech time window feature simultaneously contains the emotion change information of the preceding and following speech time windows, thus obtaining the speech emotion coding result. The results of facial expression semantic coding are obtained by: performing region expansion processing on each facial expression image frame corresponding to the facial expression semantic features through the facial expression coding branch; sequentially combining the features of each local facial expression region unit in each facial expression image frame according to the preset spatial distribution rules to form the facial expression image frame feature vector; adding spatial position identification information to the facial expression image frame feature vector according to the relative position relationship of each local facial expression region unit in the facial expression image frame; and further uniformly calculating the feature association relationship between each local facial expression region unit in the same facial expression image frame, so that the facial expression image frame features simultaneously reflect the semantic linkage relationship between each region, thus obtaining the facial expression semantic coding results. The process of obtaining the behavior state coding result includes: dividing each behavior action segment corresponding to the behavior state feature into action stages through the behavior coding branch; dividing each behavior action segment into multiple continuous stages according to the action execution process; sequentially combining the feature data corresponding to each stage to form a behavior action segment feature sequence; adding temporal position identification information to the behavior action segment feature sequence based on the time position of each behavior action segment in the overall behavior process; and performing cross-segment association calculation on the state evolution relationship between different behavior action segments, so that the behavior action segment features can reflect the continuous state transition information in the action change process, thus obtaining the behavior state coding result. Cross-modal distribution-level alignment processing is performed on the speech emotion coding results, facial expression semantic coding results, and behavioral state coding results using a distribution-level alignment unit to obtain the distribution-level alignment result. The cross-modal distribution-level alignment processing includes: grouping the speech emotion coding results, facial expression semantic coding results, and behavioral state coding results according to the interaction object identifier using a distribution-level alignment unit; arranging the multimodal coding results corresponding to the same interaction object in chronological order to form a multimodal sequence; calculating the average, maximum, minimum, and dispersion values of the feature values for each modality coding result in the multimodal sequence on each feature dimension; and combining the average, maximum, minimum, and dispersion values as the corresponding modality coding result. The distribution parameters of the coding results are used to normalize the speech emotion coding results, facial expression semantic coding results, and behavioral state coding results based on the distribution parameters. Based on the normalized coding results, the difference between feature values of different modalities is calculated, and numerical correction processing is performed on the feature values corresponding to the difference according to preset adjustment rules to align the coding results of different modalities within the same numerical range, resulting in a distribution-level alignment result. The preset adjustment rules include scaling and offset correction processing on the values of each normalized modal coding result in the corresponding feature dimension based on the degree of difference between the distribution parameters corresponding to different modal coding results, so that the numerical range and variation amplitude of each modal coding result in the corresponding feature dimension remain consistent. The distribution-level alignment results are processed using sample-level alignment units to achieve cross-modal sample-level alignment. This cross-modal sample-level alignment process includes: dividing the distribution-level alignment results into samples based on the interaction object identifier and a unified time reference; constructing a group of cross-modal sample units from the speech emotion coding results, facial expression semantic coding results, and behavioral state coding results corresponding to the same interaction object at the same time position; performing one-to-one matching processing on the different modal coding results within each cross-modal sample unit; aligning and correcting the time position of each modal coding result based on the matching relationship to ensure that each modal coding result within the same cross-modal sample unit has a consistent time index; and performing feature concatenation processing on each aligned cross-modal sample unit to obtain the sample-level alignment result. Fine-grained cross-modal alignment processing is performed on speech time window units, facial expression local region units, and behavioral action segment units in the sample-level alignment results using fine-grained cross-modal alignment units to obtain fine-grained cross-modal alignment results. The fine-grained cross-modal alignment processing includes: finely dividing the speech time window units, facial expression local region units, and behavioral action segment units in the sample-level alignment results according to a unified time reference using fine-grained cross-modal alignment units; constructing a candidate alignment unit set for speech time window units, facial expression local region units, and behavioral action segment units corresponding to the same time position; and processing each speech time window unit, facial expression local region unit, and behavioral action segment unit in the candidate alignment unit set. The feature vectors corresponding to the segment units are interpolated along the same feature dimension. The interpolation results are then processed by absolute value analysis and weighted summation. The weighted summation result is used as the correlation strength value between the corresponding units. Based on the correlation strength value, each unit in the candidate alignment unit set is matched and filtered, and the combination of units with a correlation strength value greater than a preset threshold is retained as the target alignment unit. Position correction processing is performed on the speech time window units, facial expression local region units, and behavioral action segment units in the target alignment unit to ensure that the correlation positions of each unit are consistent within the corresponding sample units. Feature splicing processing is performed on the position-corrected target alignment units to obtain fine-grained cross-modal alignment results. A joint embedding process is performed on the fine-grained cross-modal alignment results, scene context features, and user attribute features to obtain the emotion fusion representation result. The joint embedding process includes: aligning and organizing the fine-grained cross-modal alignment results, scene context features, and user attribute features according to the interaction object identifier and a unified time reference; constructing a joint input unit from the fine-grained cross-modal alignment results, scene context features, and user attribute features corresponding to the same interaction object at the same time position; performing feature expansion processing on the fine-grained cross-modal alignment results in the joint input unit, and performing feature expansion processing on the scene context features and user attribute features to ensure that the three types of features are consistent in feature dimensions; concatenating the dimension-aligned fine-grained cross-modal alignment results, scene context features, and user attribute features to form a fused feature vector; weighting the feature values of each part in the fused feature vector based on a preset weighting rule; and performing nonlinear mapping processing on the weighted fused feature vector to obtain the emotion fusion representation result. The emotion fusion representation result is input into the structured judgment output layer to perform emotion category judgment, emotion intensity judgment, and emotion semantic state judgment, and output the emotion judgment result, which includes emotion category identifier, emotion intensity identifier, and emotion semantic state identifier. The emotion category determination process includes: using the category determination unit in the structured determination output layer, multiplying each feature value in the emotion fusion representation result with the preset category determination parameters of the corresponding emotion category one by one, summing the product results, and using the summation result as the category score value of the corresponding emotion category; performing normalization processing on the category score values to convert each category score value into a category probability value; sorting and comparing each category probability value, selecting the emotion category corresponding to the category probability with the largest value as the target category, and outputting the corresponding emotion category label; The emotion intensity determination process includes: inputting the emotion fusion representation result into the intensity determination unit in the structured determination output layer; performing feature weighting calculation on the emotion fusion representation result according to the preset intensity determination parameters to obtain the emotion intensity value; comparing the emotion intensity value with the preset intensity threshold interval one by one to determine the interval range in which the emotion intensity value is located; and outputting the corresponding emotion intensity label according to the interval range. The emotion semantic state determination includes: using the semantic state determination unit in the structured determination output layer, multiplying each feature value in the emotion fusion representation result with the preset semantic state determination parameter corresponding to each emotion semantic state in a dimension-wise manner, summing the product results, and using the summation result as the semantic score value of the corresponding emotion semantic state; performing threshold filtering on the semantic score values, selecting semantic states that are greater than the preset semantic threshold as valid semantic states; combining and encoding the selected valid semantic states, and outputting the corresponding emotion semantic state identifier.

[0023] In this embodiment, the decision-making construction module includes: Based on the boundary processing of the voice output capability, visual display capability, action execution capability, and channel parallel scheduling capability of companion devices, the range of voice output, visual presentation, action feedback, and channel coordination is determined, and the device capability constraints are obtained by aggregation. Based on the scene context features, determine the environment adaptation conditions, time period adaptation conditions, location adaptation conditions and object adaptation conditions corresponding to the current interaction state, and organize each adaptation condition to obtain the interaction scene constraints; Based on user attribute characteristics, determine the age adaptation conditions, relationship adaptation conditions, preference adaptation conditions, habit adaptation conditions and sensitive response adaptation conditions corresponding to the current interaction object, and organize each adaptation condition to obtain user attribute constraints; The emotional type information corresponding to the emotional category identifier, the strength level information corresponding to the emotional intensity identifier, and the semantic tendency information corresponding to the emotional semantic state identifier in the emotional judgment results are jointly organized to obtain the basic information for emotional decision-making. The constraint satisfaction values are calculated based on device capability constraints, interaction scenario constraints, and user attribute constraints, and then weighted and aggregated to obtain a comprehensive constraint satisfaction value. Specifically, the device capability constraint satisfaction value is determined by the proportion of the number of matching items between the interaction content corresponding to the emotional decision-making basic information and the voice output range, visual presentation range, action feedback range, and channel coordination range to the total number of constraints. The interaction scenario constraint satisfaction value is determined by the proportion of the number of matching items between the current interaction state and the environmental adaptation conditions, time period adaptation conditions, location adaptation conditions, and object adaptation conditions to the total number of constraints. The user attribute constraint satisfaction value is determined by the proportion of the number of matching items between the current interaction object attributes and the age adaptation conditions, relationship adaptation conditions, preference adaptation conditions, habit adaptation conditions, and sensitive response adaptation conditions to the total number of constraints. By associating basic information on emotion decision-making with device capability constraints, interaction scenario constraints, and user attribute constraints, candidate units for emotion interaction decisions are constructed. Each candidate unit is then matched with its corresponding comprehensive constraint degree value. The comprehensive constraint degree value is compared with a preset constraint through a threshold, and candidate units for emotion interaction decisions that meet the preset constraint threshold are retained, thus obtaining the set of emotion interaction decision inputs.

[0024] In this embodiment, the result generation module includes: Based on the set of emotion-based interactive decision inputs, the behavior mapping and sorting process is performed on each candidate interactive behavior in the preset set of candidate interactive behaviors to determine the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship corresponding to each candidate interactive behavior. Based on the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship for each candidate interactive behavior, the emotion matching degree, device adaptation degree, scene adaptation degree, and attribute adaptation degree are calculated respectively. A weighted aggregation process is then performed to obtain the comprehensive behavior score for each candidate interactive behavior. Specifically, the emotion matching degree is determined by the proportion of matching items between the candidate interactive behavior and the emotion category identifier, emotion intensity identifier, and emotion semantic state identifier to the total number of emotion judgment items. The device adaptation degree is determined by the proportion of matching items between the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship for the candidate interactive behavior and the device capability constraints to the total number of device constraints. The scene adaptation degree is determined by the proportion of matching items between the candidate interactive behavior and the interaction scene constraints to the total number of scene constraints. The attribute adaptation degree is determined by the proportion of matching items between the candidate interactive behavior and the user attribute constraints to the total number of attribute constraints. The comprehensive behavior score corresponding to each candidate interaction behavior is compared with the preset behavior retention threshold. Candidate interaction behaviors that reach the preset behavior retention threshold are retained, and candidate interaction behaviors that do not reach the preset behavior retention threshold are removed, thus obtaining the initial set of candidate interaction behaviors. Based on each candidate interactive behavior in the initial screening candidate interactive behavior set, cross-behavior compatibility judgment processing is performed to obtain a candidate interactive behavior combination set. The cross-behavior compatibility judgment processing includes determining content conflict relationships, channel conflict relationships, and temporal conflict relationships, and performing combination and sorting processing on candidate interactive behaviors that do not have content conflict relationships, channel conflict relationships, and temporal conflict relationships with each other. Based on each candidate interaction behavior combination in the candidate interaction behavior combination set, a combination matching score processing is performed to obtain the corresponding combination score value. The combination matching score processing includes determining the combination emotion consistency value, the combination constraint satisfaction value, and the combination execution feasibility value, and then weighting and summarizing them according to the preset combination weights. Among them, the combination emotion consistency value is determined according to the proportion of the number of matching items between each candidate interaction behavior in the candidate interaction behavior combination and the emotion interaction decision input set to the total number of combination emotion judgment items. The combination constraint satisfaction value is determined according to the proportion of the number of constraint items of the candidate interaction behavior combination that satisfy device capability constraints, interaction scenario constraints, and user attribute constraints to the total number of combination constraint items. The combination execution feasibility value is determined according to the proportion of the number of executable items of each candidate interaction behavior in the candidate interaction behavior combination on the voice output channel, visual display channel, and action execution channel to the total number of combination execution items. The candidate interactive behavior combination with the highest combined score is taken as the target interactive behavior combination. The target interactive behavior combination is then processed according to the channel scheduling relationship candidates, emotion expression priority value, and interaction response timeliness value corresponding to each candidate interactive behavior in the target interactive behavior combination. The target execution order is obtained by arranging the execution order. The emotion expression priority value is determined according to the completeness of each candidate interactive behavior's response to the emotion category identifier, emotion intensity identifier, and emotion semantic state identifier. The interaction response timeliness value is determined according to the combination relationship of the execution duration value, channel occupation duration value, and interaction start waiting duration value corresponding to each candidate interactive behavior. Based on the target interaction behavior combination and the target execution order, the corresponding aggregation processing is performed on the voice interaction content candidates, visual presentation content candidates, action feedback content candidates, and channel scheduling relationship candidates in the target interaction behavior combination to obtain the target interaction result, which includes voice interaction content, visual presentation content, action feedback content, and channel scheduling relationship.

[0025] In this embodiment, the instruction generation module includes: Extract the voice interaction content, visual presentation content, and action feedback content from the target interaction results, and match them with the execution capabilities of the voice output channel, visual display channel, and action execution channel in the companion device to obtain voice interaction command items, visual presentation command items, and action feedback command items. Obtaining voice interaction command items includes: extracting the voice text content, voice emotion expression mode, voice output duration, and voice output start time from the voice interaction content; organizing the voice text content into voice broadcast data and matching it with voice broadcast capabilities; converting the voice emotion expression mode into voice timbre parameters and matching it with voice timbre capabilities; configuring the voice output duration as a playback duration control parameter and matching it with voice duration carrying capacity; configuring the voice output start time as a startup timing parameter and matching it with voice timing execution capabilities; and then integrating the voice broadcast data, voice timbre parameters, playback duration control parameters, and startup timing parameters to obtain voice interaction command items. Obtaining the visual presentation instruction item includes: reading the image display content, interface presentation content, visual dynamic effects, and visual display duration from the visual presentation content; forming screen display data based on the image display content and adapting it to the screen display capabilities; forming interface rendering parameters based on the interface presentation content and adapting them to the interface rendering capabilities; forming dynamic presentation control parameters based on the visual dynamic effects and adapting them to the dynamic presentation capabilities; forming display timing parameters based on the visual display duration and adapting them to the display timing execution capabilities; and then merging the screen display data, interface rendering parameters, dynamic presentation control parameters, and display timing parameters to obtain the visual presentation instruction item. Obtaining the motion feedback instruction item includes: acquiring the motion type, motion amplitude, motion duration, and motion start time from the motion feedback content; determining the motion driving parameters according to the motion type and matching them with the motion driving capability; determining the amplitude control parameters according to the motion amplitude and matching them with the motion amplitude control capability; determining the continuous execution parameters according to the motion duration and matching them with the motion continuous execution capability; determining the motion timing parameters according to the motion start time and matching them with the motion timing execution capability; and then arranging the motion driving parameters, amplitude control parameters, continuous execution parameters, and motion timing parameters to obtain the motion feedback instruction item. Based on the channel scheduling relationship in the target interaction result, determine the start order, parallel output relationship, duration connection relationship and conflict avoidance relationship among the voice output channel, visual display channel and action execution channel, and obtain the channel collaborative scheduling result; Based on the channel collaborative scheduling results, the voice interaction command items, visual presentation command items, and action feedback command items are processed by channel allocation. The voice interaction command items are written into the command queue corresponding to the voice output channel, the visual presentation command items are written into the command queue corresponding to the visual display channel, and the action feedback command items are written into the command queue corresponding to the action execution channel. The execution sequence of the commands in each command queue is adjusted according to the start order, parallel output, duration connection, and conflict avoidance relationship to obtain a multi-channel execution sequence. Based on the multi-channel execution sequence, voice interaction commands are converted into voice output channel control commands, visual presentation commands are converted into visual display channel control commands, and motion feedback commands are converted into motion execution channel control commands. The control commands of each channel are then sorted and integrated to generate a multi-channel collaborative interaction command set.

[0026] In this embodiment, the interactive operations include voice output channel performing voice broadcast operations, visual display channel performing interface presentation and image display operations, and action execution channel performing action feedback operations.

[0027] Example 1: To verify the feasibility of this invention in practice, it was applied to a home care scenario in a Shanghai community for elderly companionship devices. The device was deployed in the daily life environment of elderly people living alone to identify their emotional state and adjust responses during daily interactions. In actual use, existing technologies mainly rely on voice recognition for emotion assessment. When a user's tone is calm but their expression is down or their movements are slow, the device often still provides a standard response, failing to identify potential negative emotions. This results in a simplistic and untargeted interaction, making it difficult to meet the actual emotional care needs of companionship devices.

[0028] In this scenario, the device collects user voice data, facial expression image data, and body behavior data, and combines this with environmental state information and user historical attribute information to form multimodal input data. After the data enters the system, time alignment and anomaly handling are first performed to establish correlations between various data types under a unified time benchmark. Subsequently, through multimodal feature extraction and cross-modal correlation modeling, the system comprehensively analyzes intonation changes in speech, local area changes in facial expressions, and movement rhythms in behavior to obtain emotion judgment results that include emotion category, emotion intensity, and semantic state. Based on this, the system combines device capabilities, current environmental conditions, and user preferences to filter and combine candidate interactive behaviors, and generates interactive instructions that coordinate voice broadcasting, visual presentation, and action feedback according to channel scheduling relationships, ensuring consistency and coherence in the device's output process.

[0029] During actual operation, the device records and analyzes user interactions at different times over a multi-day usage period. Records include the number of emotion recognition triggers, interaction response duration, synchronization between voice and visual output, and action feedback triggers. Compared to the operation records of devices using a single voice recognition method, it can be observed that this invention can identify more potential emotional change scenarios within the same time period and maintain stable multi-channel collaborative output performance in complex environments. Under different room environments, lighting conditions, and interaction times, the system maintains consistent emotion recognition logic and interaction response processes, demonstrating good adaptability and stability, thereby effectively improving the emotion perception capability and interaction quality of companion devices in human-computer interaction.

[0030] Table 1. Performance Comparison of the Invention and Traditional Multimodal Emotion Interaction Methods

[0031] As can be clearly seen from Table 1, the method of the present invention is superior to the traditional method in many indicators.

[0032] In terms of emotion recognition capabilities, this invention achieves an accuracy rate of 91.2%, a significant improvement over the traditional method's 86.4%. The accuracy rate for emotion intensity determination increases from 82.7% to 87.1%, and the accuracy rate for emotion semantic state determination increases from 79.8% to 84.0%. This improvement stems from the invention's introduction of a multimodal original feature set at the feature level, and the use of an improved MGCMA model for cross-modal association modeling. Simultaneously, fine-grained cross-modal alignment processing establishes a precise correspondence between speech time window units, facial expression local region units, and behavioral action segment units. This avoids the information loss problems caused by single-modality or coarse-grained fusion in traditional methods, thus enabling more complete and stable judgment results in complex emotional scenarios.

[0033] From the perspective of interactive execution performance, the interactive response latency of this invention is 376ms, significantly better than the 428ms of the traditional method; the multi-channel synchronization deviation is reduced from 143ms to 96ms. This improvement is mainly due to the fact that after the target interactive result is generated, this invention uniformly arranges the voice output channel, visual display channel, and action execution channel through channel scheduling relationships, ensuring consistency in the start-up order, parallel output relationship, and duration connection relationship of each channel. In the traditional method, each channel is usually triggered independently, which easily leads to latency superposition and execution misalignment. However, this invention controls the multi-channel execution sequence through unified control, effectively reducing the waiting time between channels and the problem of output asynchrony, thus resulting in a faster overall response speed and more coordinated output.

[0034] From the perspective of interaction performance metrics, the false trigger rate of this invention decreased from 6.8% to 4.9%, the interaction completion rate increased from 88.1% to 92.3%, the number of user interruptions decreased from 5.6 times / day to 4.1 times / day, and the number of effective interactions increased from 23.4 times / day to 26.1 times / day. These changes indicate that by introducing device capability constraints, interaction scenario constraints, and user attribute constraints in the emotion interaction decision-making stage, this invention can more accurately filter and combine candidate interaction behaviors, enabling the output content to achieve optimal performance in terms of emotion matching, environment adaptation, and user adaptation. This reduces unreasonable triggering, improves interaction continuity and user acceptance, and ultimately achieves an overall improvement in interaction quality.

[0035] The above are merely preferred embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A multimodal emotion interaction system suitable for companionship devices, characterized in that, include: The data acquisition module is used to collect multimodal raw data in the interaction scenarios of companion devices and preprocess it to generate a standardized multimodal input dataset; The feature construction module is used to extract multimodal features based on a standardized multimodal input dataset and establish the correspondence between the features of each modality to obtain the original multimodal feature set; The emotion determination module is used to perform cross-modal correlation modeling and emotion determination processing on the original multimodal feature set through an improved MGCMA model, and output the emotion determination result. The decision-making construction module is used to determine the constraints of emotion interaction decision-making and associate and combine the emotion judgment results with the constraints of emotion interaction decision-making to obtain the set of emotion interaction decision inputs; The result generation module is used to determine the target interaction result from a preset set of candidate interaction behaviors based on the set of emotion interaction decision inputs. The instruction generation module is used to map the target interaction result to the voice output channel, visual display channel and action execution channel of the companion device, and generate a set of multi-channel collaborative interaction instructions based on the channel scheduling relationship; The instruction execution module is used to send a set of multi-channel collaborative interaction instructions to companion devices and drive the devices to perform corresponding interactive operations.

2. The multimodal emotion interaction system suitable for companion devices according to claim 1, characterized in that, The multimodal raw data includes user voice data, facial expression image data, body behavior data, interaction scene data, and user attribute data. The preprocessing includes aligning the multimodal raw data based on timestamps, removing abnormal segments, completing missing segments, unifying the dimensions of each modality, and associating and organizing the multimodal raw data according to the interaction object identifier and the collection time sequence.

3. A multimodal emotion interaction system suitable for companion devices according to claim 1, characterized in that, The feature construction module includes: Modality separation and structured segmentation are performed on the standardized multimodal input dataset to generate speech time window sequences, facial expression image frame sequences, behavioral action segment sequences, scene state segment sequences, and attribute record sequences. Speech information parsing and processing are performed on each speech time window in the speech time window sequence to obtain speech emotion features; Perform facial expression semantic parsing processing on each facial expression image frame in the facial expression image frame sequence to obtain facial expression semantic features; Behavioral state parsing is performed on each behavioral action segment in the sequence of behavioral action segments to obtain behavioral state features; By combining the scene state information in the scene state segment sequence, scene content recognition processing is performed to obtain scene context features; Based on the information of each attribute record in the attribute record sequence, perform attribute content processing to obtain user attribute characteristics; Based on a unified time reference and interaction object identifier, the time positions corresponding to voice emotion features, facial expression semantic features, behavioral state features, scene context features, and user attribute features are matched. Feature items with a time interval less than a preset time threshold and consistent interaction object identifiers are grouped into the same association group, and the correspondence between each feature is established to obtain the multimodal original feature set.

4. A multimodal emotion interaction system suitable for companion devices according to claim 1, characterized in that, The emotion determination module includes: The original multimodal feature set is input into the multi-branch input encoding layer of the improved MGCMA model, and encoding processing is performed through the speech encoding branch, expression encoding branch and behavior encoding branch to obtain speech emotion encoding result, expression semantic encoding result and behavior state encoding result, respectively. The speech emotion coding results, facial expression semantic coding results, and behavioral state coding results are subjected to cross-modal distribution-level alignment processing by a distribution-level alignment unit to obtain the distribution-level alignment result. The sample-level alignment result is obtained by performing cross-modal sample-level alignment processing on the distribution-level alignment result through the sample-level alignment unit; Fine-grained cross-modal alignment is performed on the speech time window unit, facial expression local region unit, and behavioral action segment unit in the sample-level alignment result by using a fine-grained cross-modal alignment unit to obtain the fine-grained cross-modal alignment result. Joint embedding processing is performed on the fine-grained cross-modal alignment results, scene context features, and user attribute features to obtain the emotion fusion representation results; The emotion fusion representation result is input into the structured judgment output layer to perform emotion category judgment, emotion intensity judgment, and emotion semantic state judgment, and output the emotion judgment result, which includes emotion category identifier, emotion intensity identifier, and emotion semantic state identifier.

5. A multimodal emotion interaction system suitable for companion devices according to claim 1, characterized in that, The decision-making construction module includes: Based on the boundary processing of the voice output capability, visual display capability, action execution capability, and channel parallel scheduling capability of companion devices, the range of voice output, visual presentation, action feedback, and channel coordination is determined, and the device capability constraints are obtained by aggregation. Based on the scene context features, determine the environment adaptation conditions, time period adaptation conditions, location adaptation conditions and object adaptation conditions corresponding to the current interaction state, and organize each adaptation condition to obtain the interaction scene constraints; Based on user attribute characteristics, determine the age adaptation conditions, relationship adaptation conditions, preference adaptation conditions, habit adaptation conditions and sensitive response adaptation conditions corresponding to the current interaction object, and organize each adaptation condition to obtain user attribute constraints; The emotional type information corresponding to the emotional category identifier, the strength level information corresponding to the emotional intensity identifier, and the semantic tendency information corresponding to the emotional semantic state identifier in the emotional judgment results are jointly organized to obtain the basic information for emotional decision-making. Based on device capability constraints, interaction scenario constraints, and user attribute constraints, the constraint satisfaction level values are calculated separately and then weighted summation is performed to obtain a comprehensive constraint satisfaction value. By associating basic information on emotion decision-making with device capability constraints, interaction scenario constraints, and user attribute constraints, candidate units for emotion interaction decisions are constructed. Each candidate unit is then matched with its corresponding comprehensive constraint degree value. The comprehensive constraint degree value is compared with a preset constraint through a threshold, and candidate units for emotion interaction decisions that meet the preset constraint threshold are retained, thus obtaining the set of emotion interaction decision inputs.

6. A multimodal emotion interaction system suitable for companion devices according to claim 1, characterized in that, The result generation module includes: Based on the set of emotion-based interactive decision inputs, the behavior mapping and sorting process is performed on each candidate interactive behavior in the preset set of candidate interactive behaviors to determine the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship corresponding to each candidate interactive behavior. Based on the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship corresponding to each candidate interactive behavior, the emotion matching degree value, device adaptation degree value, scene adaptation degree value, and attribute adaptation degree value are calculated respectively, and weighted summation processing is performed to obtain the comprehensive behavior score value corresponding to each candidate interactive behavior. The comprehensive behavior score corresponding to each candidate interaction behavior is compared with the preset behavior retention threshold. Candidate interaction behaviors that reach the preset behavior retention threshold are retained, and candidate interaction behaviors that do not reach the preset behavior retention threshold are removed, thus obtaining the initial set of candidate interaction behaviors. Based on each candidate interaction behavior in the initial screening candidate interaction behavior set, cross-behavior compatibility judgment processing is performed to obtain a candidate interaction behavior combination set; Based on each candidate interaction behavior combination in the candidate interaction behavior combination set, perform appropriate matching scoring processing to obtain the corresponding combination score value; The candidate interaction behavior combination with the highest combined score is taken as the target interaction behavior combination. Based on the channel scheduling relationship candidates, the priority value of emotion expression, and the timeliness value of interaction response corresponding to each candidate interaction behavior in the target interaction behavior combination, the execution order is processed to obtain the target execution order. Based on the combination of target interactive behaviors and the target execution order, the candidate voice interaction content, candidate visual presentation content, candidate action feedback content, and candidate channel scheduling relationship in the combination of target interactive behaviors are aggregated accordingly to obtain the target interaction result.

7. A multimodal emotion interaction system suitable for companion devices according to claim 1, characterized in that, The instruction generation module includes: Extract the voice interaction content, visual presentation content, and action feedback content from the target interaction results, and match them with the execution capabilities of the voice output channel, visual display channel, and action execution channel in the companion device to obtain voice interaction command items, visual presentation command items, and action feedback command items. Based on the channel scheduling relationship in the target interaction result, determine the start order, parallel output relationship, duration connection relationship and conflict avoidance relationship among the voice output channel, visual display channel and action execution channel, and obtain the channel collaborative scheduling result; Based on the channel collaborative scheduling results, channel allocation processing is performed on voice interaction command items, visual presentation command items, and action feedback command items. The execution sequence of commands in each command queue is adjusted according to the start order, parallel output relationship, duration connection relationship, and conflict avoidance relationship to obtain a multi-channel execution sequence. Based on the multi-channel execution sequence, voice interaction commands are converted into voice output channel control commands, visual presentation commands are converted into visual display channel control commands, and motion feedback commands are converted into motion execution channel control commands. The control commands of each channel are then sorted and integrated to generate a multi-channel collaborative interaction command set.

8. A multimodal emotion interaction system suitable for companion devices according to claim 1, characterized in that, The interactive operations include voice output channel performing voice broadcast operations, visual display channel performing interface presentation and image display operations, and action execution channel performing action feedback operations.