Multi-modal dialogue sentiment recognition method based on identity-aware network
By explicitly modeling speaker identity changes using an identity-aware network, the problems of speaker emotion interference and insufficient cross-modal information mining in multimodal dialogue emotion recognition are solved, achieving more accurate and stable emotion recognition results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGXI NORMAL UNIV
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-19
AI Technical Summary
Existing multimodal dialogue emotion recognition methods struggle to effectively distinguish between the continuation of the same speaker's emotions and the transfer of emotions between different speakers in multi-speaker interaction environments. Furthermore, their ability to mine cross-modal emotion information is limited, leading to discontinuous or confusing emotion evolution processes and affecting the stability and robustness of the model.
We adopt an identity-aware network-based approach, which explicitly models speaker identity changes through multimodal feature unified mapping, identity and context-aware network modules, cross-modal attention fusion networks, and label-guided identity transfer-assisted modeling. The model is optimized by combining the loss from the main emotion recognition task and the loss from identity transfer assistance.
It improves the accuracy and stability of multimodal dialogue emotion recognition, enhances the robustness of the model in complex dialogue scenarios, and improves the recognition accuracy and weighted F1 score.
Smart Images

Figure CN122241349A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, specifically to a multimodal dialogue emotion recognition method based on an identity-aware network. Background Technology
[0002] With the rapid development of artificial intelligence, human-computer interaction, and affective computing technologies, multimodal dialogue emotion recognition technology has been widely applied in scenarios such as intelligent customer service, emotional companionship systems, and social behavior analysis. This technology typically utilizes multiple modalities, including text, speech, and vision, to automatically identify the emotional state of each utterance in a dialogue. However, in real-world dialogue scenarios, especially in multi-speaker interactive environments, the frequent switching of speaker identities and poor continuity of emotional expression make the emotional evolution process complex and variable, posing a significant challenge to multimodal dialogue emotion recognition.
[0003] Existing multimodal dialogue emotion recognition methods can be broadly categorized into two types: one is based on temporal modeling, which typically utilizes recurrent neural networks or attention mechanisms to model the dialogue context and capture the emotional dependencies between utterances; the other is based on graph structures, which construct graphs or hypergraphs containing utterance nodes and speaker nodes to model the contextual and speaker relationships in the dialogue. Some methods further introduce multimodal feature fusion strategies, integrating text, speech, and visual information into a unified representation space to improve the accuracy of emotion recognition.
[0004] While the aforementioned methods have made some progress in multimodal dialogue emotion recognition tasks, significant shortcomings remain. First, most existing methods employ a single-path, dialogue-level context modeling approach, where the emotional states of different speakers can easily interfere with each other during information transmission. This is particularly problematic in dialogue scenarios where speakers frequently switch, leading to discontinuities or confusion in the emotion evolution process. Second, existing methods typically treat speaker identity as a static attribute or implicit feature in modeling, lacking explicit modeling and constraints on the process of speaker identity changes between discourses. This makes it difficult to distinguish between the continuation of emotion from the same speaker and the transfer of emotion between different speakers. Furthermore, multimodal emotional information differs in temporal granularity and expression. Existing cross-modal modeling methods have limited ability to collaboratively mine global emotional context and local emotional cues, affecting the stability and robustness of the model in complex dialogue scenarios. Summary of the Invention
[0005] The purpose of this invention is to provide a multimodal dialogue emotion recognition method based on identity-aware networks. By jointly modeling multimodal semantic information, dialogue-level contextual dependencies, speaker individual emotional dynamics, and inter-utterance identity transfer relationships, it achieves refined modeling and accurate recognition of emotional states in dialogue.
[0006] To achieve the above objectives, this invention provides a multimodal dialogue emotion recognition method based on an identity-aware network, comprising the following steps:
[0007] Step 1: Unify the mapping of multimodal features extracted from the pre-trained model;
[0008] Step 2: Construct an identity and context-aware network module using recurrent networks and residual structures;
[0009] Step 3: Construct a cross-modal attention fusion network using multi-head attention;
[0010] Step 4: Introduce a label-guided identity transfer auxiliary modeling module to explicitly model speaker switching between dialogues;
[0011] Step 5: Jointly optimize model training using the main sentiment recognition task loss and identity transfer auxiliary loss;
[0012] Step 6: Use the trained model to perform multimodal information fusion and sentiment classification.
[0013] Optionally, the expression for the multimodal feature unified mapping module in step 1 is as follows:
[0014]
[0015] in, , For the corresponding modality after feature unification mapping module The output, This represents a fully connected layer. It corresponds to the mode. Input.
[0016] Optionally, the identity and context-aware network module includes a context-aware network, an identity-aware network, and a cross-residual network. The core component of the context-aware network is a context recursive encoder, which is used to model the temporal dependencies between utterances in a dialogue sequence. The core component of the identity-aware network is a speaker-level recursive encoder, which is used to characterize the emotional evolution process of different speakers. The cross-residual network is a cross-residual fusion network, which is used to fuse multi-source features.
[0017] Optionally, the cross-modal attention fusion network includes a global-local cross-modal attention fusion module and a multimodal information fusion network. The global-local cross-modal attention fusion module is responsible for realizing hierarchical interaction and fusion between the main modality and multiple auxiliary modalities. The multimodal information fusion network is used to perform the final fusion processing on the modal information output by the global-local cross-modal attention fusion module.
[0018] In step 3, self-attention modeling is first performed on the main modality features and each auxiliary modality feature to obtain the modality's internal context-enhanced representation. Then, the self-attention-enhanced main modality features are used as query terms and cross-attention calculation is performed with each auxiliary modality feature to achieve local cross-modal information fusion. During the execution process, a global attention network is introduced to model the auxiliary modality features as a whole, extract global multimodal association information, and cross-attention calculation is performed again under the guidance of global information to generate the global modality fusion feature.
[0019] Optionally, the execution process of step 4 includes the following steps:
[0020] Step 4.1: Use a cross-modal attention fusion network with two shared parameters to calculate and generate four sets of feature representations;
[0021] Step 4.2: Based on the four sets of feature representations, construct the discourse pair-level feature representations;
[0022] Step 4.3: Input the constructed discourse pair-level feature representations into the identity transfer discrimination network to calculate and output the speaker identity transfer probability distribution;
[0023] Step 4.4: Construct identity transfer supervision labels and calculate identity transfer auxiliary loss.
[0024] Optionally, the joint expression for the main emotion recognition task loss and the identity transfer auxiliary loss in step 5 is as follows:
[0025]
[0026] in, Loss in the main emotion recognition task This refers to the loss for identity transfer assistance based on text, acoustics, and vision. This represents the joint modality identity transfer auxiliary loss. , , , These are hyperparameters, all set to 1.
[0027] This invention provides a multimodal dialogue emotion recognition method based on an identity-aware network. It unifies the mapping of multimodal features such as text, speech, and vision to the same space, then jointly models the dialogue context and speaker identity information through a recursive network. Features are enhanced through multi-path residual fusion, and a global-local cross-modal attention network is used to achieve hierarchical fusion of multimodal features. Furthermore, label-guided identity transfer is introduced to assist modeling, and a main-slave loss optimization model is jointly used. Finally, the trained model is used for multimodal information fusion and emotion classification. This invention mitigates interference from multiple speaker emotions by explicitly modeling speaker identity changes and fully exploits the complementarity of multimodal emotional cues. Verification shows that this invention outperforms existing state-of-the-art methods in both recognition accuracy and weighted F1 score on public datasets, significantly improving model robustness and generalization ability. Attached Figure Description
[0028] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0029] Figure 1 This is a flowchart illustrating a multimodal dialogue emotion recognition method based on an identity-aware network according to the present invention.
[0030] Figure 2 This is a schematic diagram of the overall model framework of a multimodal dialogue emotion recognition method based on an identity-aware network according to the present invention.
[0031] Figure 3 This is a schematic diagram of the identity and context-aware network module structure based on recursive networks of the present invention.
[0032] Figure 4 This is a schematic diagram of the cross-modal attention fusion network structure of the present invention.
[0033] Figure 5 This is a schematic diagram of the cross-modal global-local emotion context collaborative enhancement module structure of the present invention.
[0034] Figure 6 This is a schematic diagram of the tag-based identity transfer auxiliary supervision module of the present invention.
[0035] Figure 7 This is a schematic diagram showing a comparative analysis of the sentiment prediction results of a multimodal dialogue sentiment recognition method based on an identity-aware network in an embodiment of the present invention and different ablation models. Detailed Implementation
[0036] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.
[0037] Please see Figure 1 This invention provides a multimodal dialogue emotion recognition method based on an identity-aware network, comprising the following steps:
[0038] S1: Unified mapping of multimodal features extracted from pre-trained models;
[0039] S2: Construct an identity and context-aware network module using recurrent networks and residual structures;
[0040] S3: Construct a cross-modal attention fusion network using multi-head attention;
[0041] S4: Introduce a label-guided identity transfer auxiliary modeling module to explicitly model speaker switching between discourses;
[0042] S5: Use the main sentiment recognition task loss and identity transfer auxiliary loss to jointly optimize model training;
[0043] S6: Use the trained model to perform multimodal information fusion and sentiment classification.
[0044] The following provides further explanation in conjunction with the specific implementation steps:
[0045] The overall model framework of the multimodal dialogue emotion recognition method based on identity-aware networks described in this invention is as follows: Figure 2 As shown, the specific execution steps are as follows:
[0046] Step S1: Unified mapping of multimodal features
[0047] Since the features of different modalities are extracted by pre-trained models, their feature dimensions differ. This invention utilizes the feature mapping capability of fully connected networks to perform dimension unification processing on the features of each modality, so that the features of different modalities are mapped to the same dimensional space.
[0048] The expression for the multimodal feature unified mapping module is as follows:
[0049] (1)
[0050] in, , For the corresponding modality after feature unification mapping module The output, This represents a fully connected layer. It corresponds to the mode. Input.
[0051] Step S2: Construct an Identity and Context Aware Network Module (MSRM) based on a recursive network.
[0052] like Figure 3 The diagram shows the structure of the identity and context-aware network module based on recursive networks in this invention. It mainly consists of three parts: (i) a context-aware network, (ii) an identity-aware network, and (iii) a cross-residual network for coupling context information, person identity information, and original information.
[0053] The core component of the context-aware network is a context recursive encoder, which is used to model the temporal dependencies between utterances in a dialogue sequence. The core component of the identity-aware network is a speaker-level recursive encoder, which is used to characterize the emotional evolution process of different speakers. The cross-residual network is a cross-residual fusion network, which is used to fuse multi-source features.
[0054] In this invention, the specific execution process of the context recursive encoder includes the following steps:
[0055] In practice, the multimodal utterance features, after unified mapping, are first input into the recurrent neural network according to the original chronological order of the dialogue. At each time step, the recurrent network receives the current utterance features and, combined with the hidden states of preceding utterances, encodes the contextual information of the current moment, thereby gradually accumulating dialogue-level contextual semantic information. In this way, the encoder can capture the overall trend of emotional state evolution over time during the dialogue, generating a feature representation containing contextual information for each utterance.
[0056] The expression for the context recursive encoder is shown in equation (2):
[0057] (2)
[0058] in, After unified mapping, the mode is Input features, For modality The The context-aware hidden state output of each utterance. For context recurrent networks.
[0059] The speaker-level recursive encoder is specifically executed through the following steps:
[0060] In practice, the original dialogue sequence is first divided into multiple speaker subsequences based on the speaker's identity. Each subsequence contains only the utterance features of the same speaker. Then, each speaker subsequence is input into a corresponding recurrent neural network for modeling, enabling the network to capture the continuous changes in the emotional state of a single speaker during the dialogue. Since the time steps of the speaker subsequences are not consistent with the original dialogue sequence, after encoding, a time step remapping operation is performed to realign the hidden states obtained in the speaker subsequences to their corresponding utterance positions in the original dialogue sequence, thereby obtaining a speaker-level emotional representation consistent with the original time sequence.
[0061] The expression for the speaker-level recursive encoder is shown in equations (3) and (4):
[0062] (3)
[0063] (4)
[0064] in, Indicates the speaker is The time steps in the subsequence, It is a mode The speaker The The speaker of a utterance perceives the hidden state of the output. For identity-aware recurrent networks, It is a mode Speaker The subsequence input features, This represents a time-step remapping operation, used to remap features from speaker subsequences. Remapping back to the corresponding time step position in the original dialogue sequence yields the identity-aware output. .
[0065] The specific execution process of the cross-residual fusion network includes the following steps:
[0066] After obtaining the original representation of the discourse, the recursive representation of the context, and the recursive representation of the speaker, multiple residual fusion paths are constructed to jointly model the above information.
[0067] Specifically, based on different feature combinations, multiple residual fusion paths are constructed, and the original utterance representation is added element-wise with the context recursive representation and the speaker recursive representation. The fusion results are then normalized to obtain multi-path intermediate residual fusion features.
[0068] Subsequently, the fusion features of each residual path are concatenated along the feature dimension, and fusion compensation features are generated through linear mapping and nonlinear activation operations.
[0069] Finally, the fusion compensation features are superimposed with the outputs of each residual path and the original utterance representation, and the superposition result is normalized to obtain the final output features of the multi-path identity perception recursive emotion modeling module.
[0070] The expression for the cross-residual fusion network is as follows:
[0071] (5)
[0072] (6)
[0073] (7)
[0074] Where Input represents input. Represents the number of inputs. For cross-residual fusion, Representing different residual output results, This represents the number of different residual combinations. For feedforward neural networks The output, The original input features, For residual fusion, This is the residual output.
[0075] Step S3: Construct a cross-modal attention fusion network
[0076] Figure 4 This is a schematic diagram of the cross-modal attention fusion network structure provided by the present invention. The network mainly consists of two parts: (i) a global-local cross-modal attention fusion module, and (ii) a multimodal information fusion network.
[0077] In this invention, the global-local cross-modal attention fusion module (such as...) Figure 5 This is used to achieve hierarchical interaction and fusion between the primary modality and multiple auxiliary modalities. Specifically, it first involves the primary modality features... and features of each auxiliary mode Self-attention modeling is performed separately to obtain the modal context-enhanced representation. Then, the main modal feature enhanced by self-attention is used as the query term and cross-attention calculation is performed with each auxiliary modal feature to achieve local cross-modal information fusion. On this basis, a global attention network is introduced to model the auxiliary modal features as a whole, extract global multimodal association information, and perform cross-attention calculation again under the guidance of global information to generate the global modal fusion feature. The above process can be uniformly represented as the nested attention calculation form shown in Equation (8), and finally the main modal feature representation enhanced by auxiliary modal information is obtained. This is used for subsequent sentiment modeling or classification tasks.
[0078] The expression for the global-local cross-modal attention fusion module is as follows:
[0079] (8)
[0080] in, This represents the dominant modality feature. It is an auxiliary modal feature. Represents the number of auxiliary modes. It is a self-attention network. It is a cross-attention network. It is a global attention network. It is a main modality enhancement feature enhanced by auxiliary modality information.
[0081] In one embodiment of the present invention, the multimodal information fusion network is used to perform final fusion processing on the modal information output by the global-local cross-modal attention fusion module.
[0082] The expression for the multimodal information fusion network is as follows:
[0083] (9)
[0084] in, This represents a fully connected network, used for the final integration of multimodal features. Represents the number of dominant modes. It is the output of the entire cross-modal attention fusion network.
[0085] Step S4: Introduce the Label-Guided Identity Transfer Assisted Modeling (LSSM) module to explicitly model speaker switching between dialogues.
[0086] In this invention, the tag-guided identity transfer auxiliary modeling module (such as...) Figure 6 The steps are as follows:
[0087] Specifically, a cross-modal attention fusion network with two shared parameters is used to perform two forward computations on the same multimodal input feature. While keeping the model structure and parameters consistent, two feature outputs with different representations but consistent emotional semantics are obtained, forming a set of fused modal feature representations. Three sets of single-modal feature representations are output from the global-local cross-modal attention fusion module, a sub-module of the cross-modal attention fusion network with two shared parameters.
[0088] Subsequently, the corresponding representations of each utterance are obtained from the above four sets of feature representations, and the features of any two utterances in the dialogue are analyzed. and Combining features along the feature dimension forms a discourse-level feature representation. The utterance-level feature representations are input into the identity transfer discriminant network. The calculation is performed to output the identity transition probability distribution of the corresponding discourse pairs. Based on the real speaker annotation information corresponding to the discourse, construct identity transfer monitoring tags. The identity transfer auxiliary loss is calculated using the cross-entropy loss function. .
[0089] The tag-guided identity transfer auxiliary modeling module is expressed as follows:
[0090] (10)
[0091] (11)
[0092] (12)
[0093] (13)
[0094] in, and They represent the first The first statement and the second Two distinct features of a discourse are represented. This represents a concatenation operation along the feature dimension, used to construct a joint representation of discourse pairs. , For identity transfer identification networks, Indicates the first The first statement and the second Probability distribution of identity transfer prediction between discourses. To construct supervisory labels based on the actual speaker annotation information, This indicates a change in speaker identity. This indicates that the identity remains unchanged. This represents the identity transfer auxiliary loss, which is composed of cross-entropy loss.
[0095] Step S5: Jointly optimize model training using the main sentiment recognition task loss and identity transfer auxiliary loss.
[0096] During training, the identity transfer auxiliary loss and the main emotion recognition task loss are jointly optimized to constrain the model to explicitly model the speaker identity transfer relationship between utterances in the dialogue while learning emotion representation.
[0097] The joint expression for the main emotion recognition task loss and the identity transfer auxiliary loss is as follows:
[0098]
[0099] in, Loss in the main emotion recognition task This refers to the loss for identity transfer assistance based on text, acoustics, and vision. This represents the joint modality identity transfer auxiliary loss. , , , These are hyperparameters, all set to 1.
[0100] Step S6: Use the trained model to perform multimodal information fusion and sentiment classification.
[0101] Furthermore, to verify the feasibility of the proposed identity-aware network-based multimodal dialogue emotion recognition method, this invention proposes specific embodiments. The method is trained and tested using the publicly available datasets MELD and IEMOCAP. MELD is the dataset proposed in the paper "MELD: A multimodal multi-party dataset for emotion recognition in conversations," and IEMOCAP is the dataset proposed in the paper "IEMOCAP: interactive emotional dyadic motion capture database." The experimental setup consisted of an AMD EPYC 7642 48-Core Processor, 512GB of RAM, and an NVIDIA GeForce RTX 4090 graphics card. A wide range of metrics were used for the multimodal dialogue emotion recognition method: accuracy (ACC) and weighted F1 score (W-F1).
[0102] The identity-aware network-based multimodal dialogue emotion recognition method (IA-DEEM) described in this invention is compared with seven state-of-the-art methods, including Speaker-Aware Cognitive network with Cross-ModalAttention (SACCMA), Novel Graph network based Multimodal Fusion Technique (GraphMFT), Masked Graph Learning with Recursive Alignment (MGLRA), Context-Aware Hierarchical Graph Fusion (CA-HGF), Speaker-centric multimodal fusion network (SCMFN), Hypergraph based Contextual Relationship Modeling Method (HyperCRM), and Identity and modality attributes driven multimodality fusion network (IMDNet). The results are shown in Table 1 (the best results are marked in bold). This invention surpasses existing multimodal dialogue emotion recognition methods.
[0103] Table 1: Comparison of quantitative experimental data on the IEMOCAP and MELD datasets
[0104]
[0105] To further verify the effectiveness of the method in real-world dialogue scenarios, typical dialogue segments from the IEMOCAP and MELD datasets were selected, and the sentiment prediction results of the complete model and different ablation models were compared and analyzed. The experimental results are as follows: Figure 7 As shown.
[0106] The comparative results show that when key modules are removed, the model is prone to weakening of emotions and confusion of emotion categories in multi-turn dialogues, especially when speaker switching or implicit emotional expression is involved, the prediction stability drops significantly. However, when the complete method of this invention is used, the model can more accurately depict the evolution of emotions with the context of the dialogue and effectively distinguish between semantically similar but emotionally intense utterances, demonstrating a more stable and reliable emotion recognition effect.
[0107] The above description discloses only one preferred embodiment of the present invention, and should not be construed as limiting the scope of the present invention. Those skilled in the art will understand that all or part of the processes of the above embodiments can be implemented, and equivalent changes made in accordance with the claims of the present invention are still within the scope of the invention.
Claims
1. A multimodal dialogue emotion recognition method based on an identity-aware network, characterized in that, Includes the following steps: Step 1: Unify the mapping of multimodal features extracted from the pre-trained model; Step 2: Construct an identity and context-aware network module using recurrent networks and residual structures; Step 3: Construct a cross-modal attention fusion network using multi-head attention; Step 4: Introduce a label-guided identity transfer auxiliary modeling module to explicitly model speaker switching between dialogues; Step 5: Jointly optimize model training using the main sentiment recognition task loss and identity transfer auxiliary loss; Step 6: Use the trained model to perform multimodal information fusion and sentiment classification.
2. The multimodal dialogue emotion recognition method based on an identity-aware network as described in claim 1, characterized in that, The expression for the multimodal feature unified mapping module in step 1 is as follows: in, , For the corresponding modality after feature unification mapping module The output, This represents a fully connected layer. It corresponds to the mode. Input.
3. The multimodal dialogue emotion recognition method based on an identity-aware network as described in claim 2, characterized in that, The identity and context-aware network module includes a context-aware network, an identity-aware network, and a cross-residual network. The core component of the context-aware network is a context recursive encoder, which is used to model the temporal dependencies between utterances in a dialogue sequence. The core component of the identity-aware network is a speaker-level recursive encoder, which is used to characterize the emotional evolution process of different speakers. The cross-residual network is a cross-residual fusion network, which is used to fuse multi-source features.
4. The multimodal dialogue emotion recognition method based on an identity-aware network as described in claim 3, characterized in that, The cross-modal attention fusion network includes a global-local cross-modal attention fusion module and a multimodal information fusion network. The global-local cross-modal attention fusion module is responsible for realizing hierarchical interaction and fusion between the main modality and multiple auxiliary modalities. The multimodal information fusion network is used to perform the final fusion processing on the modal information output by the global-local cross-modal attention fusion module. In step 3, self-attention modeling is first performed on the main modality features and each auxiliary modality feature to obtain the modality's internal context-enhanced representation; Subsequently, the main modality feature enhanced by self-attention is used as the query term, and cross-attention calculation is performed with each auxiliary modality feature to achieve local cross-modal information fusion. During the execution process, a global attention network is introduced to model the auxiliary modality features as a whole, extract global multimodal association information, and perform cross-attention calculation again under the guidance of global information to generate the main modality fusion feature with global modulation.
5. The multimodal dialogue emotion recognition method based on an identity-aware network as described in claim 4, characterized in that, The execution process of step 4 includes the following steps: Step 4.1: Use a cross-modal attention fusion network with two shared parameters to calculate and generate four sets of feature representations; Step 4.2: Based on the four sets of feature representations, construct the discourse pair-level feature representations; Step 4.3: Input the constructed discourse pair-level feature representations into the identity transfer discrimination network to calculate and output the speaker identity transfer probability distribution; Step 4.4: Construct identity transfer supervision labels and calculate identity transfer auxiliary loss.
6. The multimodal dialogue emotion recognition method based on an identity-aware network as described in claim 5, characterized in that, The joint expression for the main emotion recognition task loss and the identity transfer auxiliary loss in step 5 is as follows: in, Loss in the main emotion recognition task This refers to the loss for identity transfer assistance based on text, acoustics, and vision. This represents the joint modality identity transfer auxiliary loss. , , , These are hyperparameters, all set to 1.