A multi-modal dialogue dynamic sentiment recognition method based on relationship subgraph interaction

By constructing an emotional dependency subgraph between speakers and within themselves using the DEDNet model, and combining visual, auditory, and textual modal information, the complexity of modeling emotional dependency relationships in multi-dimensional dialogue scenarios is solved, achieving more efficient emotion recognition results.

CN118820844BActive Publication Date: 2026-06-12CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2024-06-21
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively model complex emotional dependencies in diverse dialogue scenarios, particularly the dynamic changes in emotional dependencies between speakers and within the speaker themselves, resulting in poor accuracy in emotion recognition.

Method used

A dynamic emotion recognition method based on relational subgraph interaction in multimodal dialogue is adopted. Multimodal dialogue data is input into a modality feature extractor, an audiovisual modality encoder, a relational subgraph interaction module, and an emotion classifier through the DEDNet model. Emotional dependency subgraphs between speakers and within the speaker are constructed. InterGAT and IntraGAT networks are used for progressive interactive learning, and emotion recognition is performed by combining visual, auditory, and textual modal information.

🎯Benefits of technology

It improves the accuracy of emotion recognition in diverse dialogue scenarios, better captures and understands emotional dynamics, enhances the model's ability to recognize different relationships, and improves the accuracy and consistency of emotion recognition results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118820844B_ABST
    Figure CN118820844B_ABST
Patent Text Reader

Abstract

The application belongs to the field of multi-modal emotion recognition and dialogue system, and relates to a multi-modal dialogue dynamic emotion recognition method based on relationship subgraph interaction, comprising: obtaining a multi-modal dialogue data set; inputting data of the multi-modal dialogue data set into a modal feature extractor to extract features of each modal, to obtain text modal features, auditory modal features and visual modal features; inputting the auditory modal features and the visual modal features into an audio-visual modal encoder respectively, to obtain final auditory modal features and visual modal visual features; inputting the text modal features into a relationship subgraph interaction module, to obtain final text modal features; inputting the final text modal features, the auditory modal features and the visual modal features into an emotion classifier, to obtain emotion recognition results; and the application models the dialogue into a speaker-to-speaker emotion dependency subgraph and a speaker's own emotion dependency subgraph according to emotion dependency relationships, so as to better capture and understand emotion dynamics in a multi-element dialogue scene.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of multimodal emotion recognition and dialogue systems, and relates to a multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction. Background Technology

[0002] Emotions are inherent in humans, guiding behavior and indicating underlying thought processes. Multimodal dialogue emotion recognition is significant for advancing human-computer interaction into a new era of affective computing. This task aims to determine the emotional state of each sentence in a continuous dialogue based on external information conveyed by two or more parties, including language, tone of voice, facial expressions, etc. Dialogue emotion recognition has been widely applied in fields such as medical diagnosis, opinion mining, and building empathy systems, and is gradually attracting more attention from researchers. In multimodal dialogue emotion recognition, dynamic emotion dependencies exist. These represent the changes in the interdependencies between emotional states during a dialogue, reflecting the propagation and interaction of emotions within the dialogue. Modeling these dependencies helps models better understand the interactions between emotional states and enhances the understanding of emotional dynamics.

[0003] A common scenario in multi-speech dialogues involves multiple participants expressing and exchanging emotions. However, in such scenarios, model performance often suffers, failing to provide accurate sentiment analysis results. This is primarily due to the complexity of emotional dependencies in multi-speech contexts. For example, the emotional influence between different speakers and the emotional continuation within each speaker are intertwined, making them difficult for models to accurately capture and model. Unlike static sentiment classification of individual statements, emotion in a conversation is a dynamic process, largely dependent on the context and driven by the speaker's own and inter-speech emotional influences. By definition, inter-speech emotional dependence is the process by which one person or group influences another person's or group's emotions or behaviors by consciously or unconsciously inducing emotional states and behavioral attitudes; speaker-specific emotional dependence can be defined as resistance to emotional changes, formalized as the extent to which a person's current emotional state can be predicted by their previous emotional state. Both types of emotional dependence persist throughout the dialogue, guiding emotional changes. Therefore, modeling dynamic emotional dependence is necessary to effectively capture the constantly changing speaker states throughout the dialogue.

[0004] To achieve this goal, researchers have undertaken numerous studies. Dialogue modeling techniques can be broadly categorized into three directions: The first is speaker embedding-based methods, which fuse speaker identity representations as fixed sentence features with multimodal features to obtain modal features carrying speaker emotional dependencies. The problem with this approach is that capturing speaker emotional dependencies is a static process, failing to effectively model dynamic emotional dependencies. The second is sequence-based speaker emotional dependencies. This method dynamically generates or updates speaker-specific information based on historical dialogues or historical speaker information. Recurrent Neural Networks (RNNs) or Long Short-Term Memory Networks (LSTMs) are typically used to model the continuity and dynamism of emotions. A drawback of this method is the difficulty in long-distance emotional information interaction. The last is graph-based speaker emotional dependencies. This method focuses on utilizing graph structures to model the emotional dependencies existing in dialogue. Graphs, as a powerful data structure, effectively compensate for the shortcomings of the previous two methods. Using graph structures to simulate dialogue, by defining the emotional dependencies between utterances as nodes and edges, can effectively represent complex relationships and dependencies, enabling dynamic interaction of speaker emotional information. The challenge of this method lies in how to construct a suitable graph structure to accurately model the emotional dependencies in dialogue.

[0005] Dynamic sentiment dependence includes inter-speaker sentiment dependence and speaker-specific sentiment dependence. Most methods model inter-speaker sentiment dependence and speaker-specific sentiment dependence together within a single graph structure for learning. However, these are essentially two different mechanisms, and unifying them in a single model can lead to confusion and poor sentiment recognition performance. Summary of the Invention

[0006] To address the aforementioned technical problems, this invention employs a multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction, comprising: acquiring multimodal dialogue data, inputting the multimodal dialogue data into a trained DEDNet model, and obtaining emotion recognition results; wherein the DEDNet model includes: a modality feature extractor, an audiovisual modality encoder, a relational subgraph interaction module, and an emotion classifier; wherein, the DEDNet is a dynamic emotion dependency network;

[0007] The training process of the DEDNet model includes:

[0008] S1. Obtain a multimodal dialogue dataset. The multimodal dialogue dataset includes multiple multimodal dialogues. Each multimodal dialogue includes a multimodal utterance sequence and a speaker sequence. The multimodal modes include: text mode, auditory mode, and visual mode.

[0009] S2. Extract the features of each modality from the multimodal dialogue input modality feature extractor to obtain text modality features, auditory modality features, and visual modality features;

[0010] S3. Input the auditory modality features and visual modality features into the audiovisual modality encoder respectively to obtain the final auditory modality features and visual modality features;

[0011] S4. Input the text modality features into the relation subgraph interaction module to obtain the final text modality features;

[0012] S5. Input the final text modality features, auditory modality features, and visual modality features into the emotion classifier to obtain the emotion recognition results;

[0013] S6. Calculate the loss function value based on the emotion recognition results, update the model parameters based on the loss function value, and complete the model training when the loss function value is minimized.

[0014] There is a mapping relationship R = [u] between the discourse sequence and the speaker sequence. i ;s j ], where [u i ;s j ] indicates that the i-th sentence was spoken by the j-th speaker, u i For the i-th utterance, s j Let j be the j-th speaker.

[0015] The modal feature extractor includes: a RoBERTa Large model, an openSMILE model, a DenseNet model, and a linear layer. The modal feature extractor processes multimodal dialogue by: using the RoBERTa Large model to extract the initial text modal features of the multimodal dialogue, using the openSMILE model to extract the initial auditory modal features of the multimodal dialogue, using the DenseNet model to extract the initial visual modal features of the multimodal dialogue, and inputting the initial text modal features, initial auditory modal features, and initial visual modal features into the linear layer for dimensionality unification to obtain text modal features, auditory modal features, and visual modal features.

[0016] Both the encoder and the auditory modality encoder include: an embedding module, a Transformer encoder, and a GRU; the visual modality encoder processes visual modality features including:

[0017] S31. Embed the speaker sequence to obtain the speaker embedding vector;

[0018] S32. Perform position embedding on the discourse sequence to obtain the discourse position embedding vector;

[0019] S33. The speaker embedding vector and the utterance position embedding vector are fused with the visual modal features to obtain the fused features;

[0020] S34. Input the fused features into the Transformer encoder, and input the output of the Transformer encoder into the GRU to obtain the final visual modal features.

[0021] The relational subgraph interaction module includes: InterGAT (Inter-Speaker Emotional Dependency Graph Attention Network) and IntraGAT (Speaker-Specific Emotional Dependency Graph Attention Network); the relational subgraph interaction module processes text modal features, including: constructing an inter-speaker emotional dependency subgraph. Speaker's own emotional dependence subgraph Text modality features and subgraphs Input InterGAT to obtain the updated text modality features, and then combine the updated text modality features with the subgraph. Input IntraGAT to obtain the final text modal features.

[0022] Constructing an inter-speaker affective dependency subgraph include:

[0023] Treat all statements as nodes; if node v i Corresponding utterances and nodes v j If the corresponding utterances do not belong to the same speaker, then at node v i With node v j Construct two directed edges between them; otherwise, node v i With node v j Not connected; if node v i The corresponding utterance is at node v j Before the corresponding utterance, it will start from node v i to node v j The edge type of the directed edge is set to inter-past, which will start from node v j to node v i By setting the edge type of the directed edges to inter-future, we obtain the speaker-inter-emotional dependency subgraph. Among them, inter-past refers to inter-speaker past discourse, and inter-future refers to inter-speaker future discourse.

[0024] Constructing a subgraph of the speaker's own emotional dependence include:

[0025] Treat all utterances as nodes, if node v i Corresponding utterances and nodes v j If the corresponding utterances belong to the same speaker, then at node v i With node v j Construct two directed edges between them; otherwise, node v i With node v jNot connected; if node v i The corresponding utterance is at node v j Before the corresponding utterance, it will start from node v i to node v j Setting the edge type of the directed edge to intra-past will start from node v j to node v i By setting the edge type of the directed edges to intra-future, we obtain the speaker-to-speaker affective dependency subgraph. Among them, intra-past refers to the speaker's own past utterances, and intra-future refers to the speaker's own future utterances.

[0026] Text modality features and subgraphs Inputting InterGAT includes:

[0027]

[0028] Where, r ij Subgraph slave node v j to node v i The edge type, O() represents one-hot encoding, Rembedding represents a learnable embedding matrix, d r e represents the dimension of the edge type embedding. ij Let σ represent the edge-type embedding vector, LeakyReLU, and σ represent the non-linear activation function. and Representing node v respectively i and node v j The corresponding text modal features, where W and a represent trainable parameters, and α ij Indicates from node v j to node v i The edge weights, || denote the concatenation operation, N i Represents node v i The neighborhood group, This indicates the updated node v i Text modal features.

[0029] The sentiment classifier processes the final text modality features, auditory modality features, and visual modality features in the following ways:

[0030] The final text modality, auditory modality, and visual modality features are processed using the softmax function to obtain the emotion recognition results for the text modality, auditory modality, and visual modality, respectively. The final text modality, auditory modality, and visual modality features are then fused to obtain multimodal fusion features. The multimodal fusion features are then processed using the softmax function to obtain the emotion recognition results for the multimodal fusion features.

[0031] loss function for:

[0032]

[0033] Where t, a, and v represent the text modality, auditory modality, and visual modality, respectively. Let λ be the loss function for the m-mode. m Let m be the weight and m be the index of the mode. The loss function is the feature loss function for multimodal fusion.

[0034] Beneficial effects:

[0035] 1. This invention models dialogues as two relational interaction subgraphs based on emotional dependence: an inter-speaker emotional dependence subgraph and a speaker's own emotional dependence subgraph, thereby better capturing and understanding emotional dynamics in diverse dialogue scenarios. 2. In the two relational interaction subgraphs, this invention sets learnable relational embeddings for different types of edges to enhance the model's ability to recognize different relationships. 3. This invention implements a progressive interactive learning strategy to hierarchically learn the emotional dependence present in dialogues: First, the InterGAT inter-speaker emotional dependence graph attention network is used to focus on learning the emotional dependence between speakers in the dialogue based on the inter-speaker emotional dependence subgraph, capturing the transmission and interaction effects of emotions; then, the speaker's own emotional dependence graph attention network is used. IntraGAT delves into the speaker's internal emotional dependency learning based on the speaker's own emotional dependency subgraph to meticulously depict changes in individual emotional states. This hierarchical learning method not only considers the dynamics of emotions but also ensures that the model can comprehensively understand the emotional complexity in dialogue from both global and individual levels, improving the accuracy of emotion recognition results. 4. To enrich the input features of dialogue emotion recognition, the model of this invention not only relies on textual modal information but also incorporates visual and auditory modal information, improving the accuracy of emotion recognition results. 5. This invention uses an auxiliary modality fusion loss strategy to fuse the single-modal loss function and multimodal loss function of three modalities, reducing modal heterogeneity differences and improving the emotional consistency of different modal features. Attached Figure Description

[0036] Figure 1A flowchart of a multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction provided in this embodiment of the invention;

[0037] Figure 2 This is a structural diagram of the DEDNet model provided in an embodiment of the present invention;

[0038] Figure 3 This is a flowchart of the relationship subgraph construction provided in an embodiment of the present invention;

[0039] Figure 4 This is a performance comparison chart of DEDNet and three baseline models provided in an embodiment of the present invention;

[0040] Figure 5 This is a schematic diagram illustrating the prediction results of the DEDNet model and the baseline model provided in an embodiment of the present invention. Detailed Implementation

[0041] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0042] like Figure 1 , Figure 2 As shown, this invention employs a multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction, comprising:

[0043] Multimodal dialogue data is acquired and input into a trained DEDNet model to obtain emotion recognition results; the DEDNet model includes: a modal feature extractor, an audiovisual modal encoder, a relational subgraph interaction module, and an emotion classifier;

[0044] The training process of the DEDNet model includes:

[0045] S1. Obtain the multimodal dialogue dataset, which includes multiple multimodal dialogues. Each multimodal dialogue includes a multimodal dialogue sequence and a speaker sequence. The multimodal modes include: text, auditory, and visual.

[0046] This invention was conducted on the IEMOCAP and MELD multimodal dialogue datasets. The IEMOCAP dataset consists of binary dialogues from 12 hours of video from 10 different speakers. Each sentence is labeled with one of six emotions: Happy, Sad, Neutral, Angry, Excited, and Frustrated. The MELD dataset contains 1433 dialogues with 13708 utterances and 304 different speakers. Unlike IEMOCAP, MELD includes dialogues with three or more speakers. Each sentence in the dialogue is labeled with one of seven emotion categories: anger, disgust, fear, joy, neutral, sad, and surprise.

[0047] The discourse sequence is U = {u1, u2, ..., u} N The speaker sequence is S = {s1, s2, ..., s}. M}, where N represents the number of utterances in the dialogue, and M represents the number of speakers in the dialogue; there are mapping relationships between sequences, such as mapping [u i ;s j The ] indicates that the i-th utterance was spoken by the j-th speaker. The multimodal dialogue sentiment recognition task aims to predict the sentiment label for each utterance.

[0048] S2. Input the multimodal dialogue data into the modality feature extractor to extract the features of each modality, and obtain text modality features, auditory modality features and visual modality features;

[0049] In the text modality, this invention employs the RoBERTa Large model (a robust optimization pre-training method based on a Transformer bidirectional encoder) to obtain primary text features. To adapt the model to the sentiment classification task, this invention fine-tunes the RoBERTa model; finally, the [CLS] token embeddings from the last layer of the RoBERTa Large model are used as the initial text modality features, with a dimension set to 1024.

[0050] In the auditory modality, this invention uses openSMILE (a large-space extraction implementation of open language and music parsing method) to extract acoustic features and obtain initial auditory modality features; the auditory feature dimensions of the IEMOCAP dataset and the MELD dataset are set to 1582 and 300, respectively.

[0051] In the visual modality, this invention uses a DenseNet (dense convolutional network) model pre-trained on the facial expression recognition dataset FER to extract visual features and obtain initial visual modality features.

[0052] The initial features extracted from the text modality, auditory modality, and visual modality are represented as H = [H a H t H v ],in Let d represent the feature of the m-mode corresponding to the i-th sentence, where t, a, and v represent the text mode, auditory mode, and visual mode, respectively. m It is the initial modal feature dimension.

[0053] Features of different modalities have different dimensions. This invention uses a linear layer to achieve dimensional uniformity, enabling the model to calculate and process features more efficiently without having to consider the dimensional differences between different modalities.

[0054]

[0055] in, H represents the m-modal features output by the linear layer. m is the initial feature of the m-modality, N represents the dialogue length, and d represents the common dimension.

[0056] S3. Input the auditory modality features and visual modality features into the audiovisual modality encoder respectively to obtain the final auditory modality features and visual modality features;

[0057] The audiovisual modal encoder includes a visual modal encoder and an auditory modal encoder; both the visual modal encoder and the auditory modal encoder include an embedding module, a Transformer encoder, and a GRU.

[0058] The embedding module includes speaker embedding and location embedding;

[0059] Speaker embedding includes: In order to distinguish speakers, the present invention embeds speakers into auditory modality features and visual modality features:

[0060]

[0061] in, For learnable embedding matrices, This represents converting a speaker label into a corresponding one-hot vector, where u is the utterance. i It is the speaker s j So, This indicates the speaker's s j Speaker embedding, SE represents the speaker embedding of the entire dialogue.

[0062] Location embedding includes: Adding location embedding helps integrate the location and sequence information of a conversation into auditory and visual modal features.

[0063]

[0064] Where pos is the discourse index and i is the dimension index; the speaker and location embeddings are fused with the auditory modality features and the visual modality features respectively:

[0065]

[0066] Transformer Encoder: To enhance the sequence representation of audiovisual modal features, this invention uses a Transformer encoder to establish global dependencies within the input sequence. The Transformer model employs a self-attention mechanism, which can identify dependencies between different parts of the input sequence. Through self-attention, the Transformer can simultaneously focus on different parts of the entire audiovisual sequence, helping to capture changes or continuity in emotional states.

[0067]

[0068] in, This refers to the output characteristics of the Transformer encoder.

[0069] GRU: for filtering To remove irrelevant information, this invention utilizes a gating mechanism. Wherein, W gate b are learnable parameters, and σ is the sigmoid function. It is a product of elements. This indicates a filter gate.

[0070]

[0071] S4. Input the text modality features into the relation subgraph interaction module to obtain the final text modality features;

[0072] Dialogue is typically an alternating process, with different participants taking turns speaking. This invention proposes a progressive interactive learning strategy for the relational subgraph interaction module. This approach allows the model to interact with information from different speakers in stages, thereby capturing the emotional dependencies between speakers. Subsequently, it interacts with information from its own dialogue, thereby capturing the speaker's own emotional dependencies. This method can better simulate the temporal characteristics of dialogue.

[0073] like Figure 3 As shown, a complete dialogue can be represented as U = {u1, u2, ..., u}. N} can be represented as a directed graph Every word u i All are represented as graphs A node v i , Let ε represent the set of all nodes in the graph, and let ε represent all edges in the graph. Each edge has two attributes: edge type. and edge weights Emotional dependence exists between different speakers, while a speaker's own emotional dependence exists within that speaker. Based on this, the present invention will... It is divided into two relational subgraphs: the speaker-to-speaker emotional dependence subgraph. Speaker's own emotional dependence subgraph

[0074] Constructing an inter-speaker affective dependency subgraph The process includes:

[0075] Treat all statements as nodes; if node v i Corresponding utterances and nodes v j If the corresponding utterances do not belong to the same speaker, then at node v i With node v j Construct two directed edges between them; otherwise, node v i With node v j Not connected; if node v i The temporal relationship of the corresponding utterances is at node v j Before the corresponding utterance, it will start from node v i to node v j The directed edges are set to the type inter-past (speaker-to-past utterances), and will start from node v. j to node v i By setting the edge type of the directed edges to inter-future (inter-speaker-future discourse), we obtain the inter-speaker affective dependency subgraph. in, Let ε represent the set of all nodes. inter Represents the set of all edges. For edge type, The edge weight is denoted as .

[0076] Constructing a subgraph of the speaker's own emotional dependence The process includes:

[0077] Treat all utterances as nodes, if node v i Corresponding utterances and nodes v j If the corresponding utterances belong to the same speaker, then at node v i With node v j Construct two directed edges between them; otherwise, node v i With node v j Not connected; if node v i The corresponding utterance is at node v jBefore the corresponding utterance, it will start from node v i to node v j The directed edge type is set to intra-past (speaker self-past utterance), which will start from node v j to node v i By setting the edge type of the directed edges to intra-future (speaker self-future discourse), we obtain the speaker-to-speaker emotional dependency subgraph. in, Let ε represent the set of nodes. intra Describe the set of edges. For edge type, The edge weight is denoted as .

[0078] RSI is built on the basis of graph attention network (GAT), and constructs the inter-speaker affective graph attention network InterGAT and the speaker's own affective graph attention network IntraGAT;

[0079] The InterGAT inter-speaker affective dependency graph attention network, combined with edge type encoding, further integrates the dependency type and temporal information between utterances:

[0080]

[0081] Where, N r d represents the number of edge types. r Indicates the dimension of the edge type embedding. It is a learnable embedding matrix. The process of converting edge type labels into one-dimensional one-hot vectors is defined, r ij Subgraph slave node v i to node v j The edge type. Compared to other graph neural networks, GAT dynamically calculates the weights between nodes through an attention mechanism, allowing for a better differentiation of node importance. Specifically, for a given center node v... i and its neighboring node v j The steps of the InterGAT (Inter-Government Attention Network) for aggregating information based on inter-speaker affective dependency graphs are as follows:

[0082]

[0083] Where, α ij Describes the neighbor v j to node v i The edge weights, LeakyReLU and σ are non-linear activation functions, W and a represent trainable parameters, || represents the concatenation operation, and N i Represents node v iThe set of neighbors, h i,out This indicates that in the updated node v i Features This represents the updated text modal features.

[0084] The updated text modality features and speaker's own affective dependency subgraph are used. The input speaker's own affective dependency graph is processed by the IntraGAT network, with the same processing steps as InterGAT, to obtain the final text modality features.

[0085] S5. Input the final text modal features, auditory modal features, and visual modal features into the emotion classifier to obtain the emotion recognition results of the dialogue;

[0086] After obtaining the final features from each modality, multimodal feature fusion is performed. The multimodal sentiment classifier includes a softmax function to compute sentiment labels for each utterance.

[0087]

[0088] in, W represents the low-dimensional features output after dimensionality reduction of the three modal features through a linear layer. τ and b τ These are trainable parameters, where c represents the sentiment category;

[0089] Multimodal fusion is performed using element-wise addition. f Indicates the characteristics after fusion:

[0090]

[0091] in This indicates the emotional label for all the words spoken in the entire dialogue. It is an emotion probability vector, representing the utterance u i The emotion recognition results.

[0092] S6. Calculate the loss function value based on the emotion recognition results, update the model parameters based on the loss function value, and complete the model training when the loss function value is minimized.

[0093] The loss calculation is as follows:

[0094]

[0095] Cross-entropy is used as the loss function for model prediction, where, This represents the loss of three single-mode functions. The fusion loss is calculated for the fused multimodal features, where N represents the number of utterances in the dialogue and C represents the number of sentiment types. y represents the emotion recognition result calculated based on the fused multimodal features. i,j Indicates the true label, The emotion recognition result is calculated based on the m-modal features; λ m The learning factor is set to 0.1, which is the classification loss of modal features with poor classification performance, in order to prioritize learning modal features with poor classification performance.

[0096] Performance comparison of DEDNet with three baseline models—MM-DFN (Multimodal Dynamic Fusion Network), MultEMO (Attention-Based Association-Aware Multimodal Fusion Framework), and SDT (Transformer-Based Self-Distillation Model)—on cross-speaker sentiment dependency samples and same-speaker sentiment dependency samples in the IEMOCAP test set. Figure 4 As shown; the evaluation metrics are Wa-F1 (weighted F1 score) and Wa-Acc (weighted accuracy); a test case of DEDNet and two baseline models MMGCN (Multimodal Fusion Graph Convolutional Network) and SDT (Transformer-based self-distillation model) on the IEMOCAP dataset is shown. Figure 5 As shown, by Figure 4 , Figure 5 It can be seen that the DEDNet model outperforms other baseline models.

[0097] The above-described embodiments further illustrate the purpose, technical solution, and advantages of the present invention. It should be understood that the above-described embodiments are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made to the present invention within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A multi-modal dialogue dynamic sentiment recognition method based on relationship subgraph interaction, characterized in that, include: Acquire multimodal dialogue data, input the multimodal dialogue data into the trained DEDNet model, and obtain the emotion recognition results; The DEDNet model includes: a modal feature extractor, an audiovisual modal encoder, a relational subgraph interaction module, and an emotion classifier; wherein, DEDNet is a dynamic emotion dependency network. The training process of the DEDNet model includes: S1. Obtain a multimodal dialogue dataset. The multimodal dialogue dataset includes multiple multimodal dialogues. Each multimodal dialogue includes a multimodal utterance sequence and a speaker sequence. The multimodal modes include: text mode, auditory mode, and visual mode. S2. Extract the features of each modality from the multimodal dialogue input modality feature extractor to obtain text modality features, auditory modality features, and visual modality features; S3. Input the auditory modality features and visual modality features into the audiovisual modality encoder respectively to obtain the final auditory modality features and visual modality features; S4. Input the text modality features into the relation subgraph interaction module to obtain the final text modality features; The relationship subgraph interaction module comprises an inter-speaker emotion dependency graph attention network InterGAT and an intra-speaker emotion dependency graph attention network IntraGAT; the relationship subgraph interaction module performs processing on the text modality feature, comprising: constructing an inter-speaker emotion dependency subgraph and an intra-speaker emotion dependency subgraph , inputting the text modality feature and the subgraph into the InterGAT to obtain an updated text modality feature, inputting the updated text modality feature and the subgraph into the IntraGAT to obtain a final text modality feature; Constructing an inter-speaker affective dependency subgraph include: Treat all statements as nodes; if a node Corresponding words and nodes If the corresponding utterances do not belong to the same speaker, then at the node... With nodes Construct two directed edges between them; otherwise, the nodes... With nodes Not connected; if node The corresponding words are at the node Before the corresponding words, the node will be used. To the node Setting the edge type of a directed edge to inter-past will start from the node To the node By setting the edge type of the directed edges to inter-future, we obtain the speaker-inter-emotional dependency subgraph. Among them, inter-past refers to inter-speaker past discourse, and inter-future refers to inter-speaker future discourse; Constructing a subgraph of the speaker's own emotional dependence include: Treat all statements as nodes, if a node Corresponding words and nodes If the corresponding utterances belong to the same speaker, then at the node... With nodes Construct two directed edges between them; otherwise, the nodes... With nodes Not connected; if node The corresponding words are at the node Before the corresponding words, the node will be used. To the node Setting the edge type of a directed edge to intra-past will move from the node To the node By setting the edge type of the directed edges to intra-future, we obtain the speaker-to-speaker affective dependency subgraph. Among them, intra-past refers to the speaker's own past utterances, and intra-future refers to the speaker's own future utterances. Text modality features and subgraphs Inputting InterGAT includes: in, Subgraph slave node To the node edge type, Indicates one-hot encoding. This represents a learnable embedding matrix. Indicates the dimension of the edge type embedding. An embedding vector representing the edge type. and Represents a non-linear activation function. and Representing nodes respectively and nodes The corresponding text modal features, and Indicates trainable parameters, Indicates from node To the node edge weights, This indicates a splicing operation. Represents a node The neighborhood group, Indicates the updated node Text modal features; S5. Input the final text modality features, auditory modality features, and visual modality features into the emotion classifier to obtain the emotion recognition results; S6. Calculate the loss function value based on the emotion recognition results, update the model parameters based on the loss function value, and complete the model training when the loss function value is minimized.

2. The multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction according to claim 1, characterized in that, There is a mapping relationship between the discourse sequence and the speaker sequence. ,in, This indicates that the i-th sentence was spoken by the j-th speaker. For the i-th utterance, Let j be the j-th speaker.

3. The multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction according to claim 1, characterized in that, The modal feature extractor includes: a RoBERTa Large model, an openSMILE model, a DenseNet model, and a linear layer. The modal feature extractor processes multimodal dialogue by: using the RoBERTa Large model to extract the initial text modal features of the multimodal dialogue, using the openSMILE model to extract the initial auditory modal features of the multimodal dialogue, using the DenseNet model to extract the initial visual modal features of the multimodal dialogue, and inputting the initial text modal features, initial auditory modal features, and initial visual modal features into the linear layer for dimensionality unification to obtain text modal features, auditory modal features, and visual modal features.

4. The multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction according to claim 1, characterized in that, The audiovisual modal encoder includes a visual modal encoder and an auditory modal encoder; both the visual modal encoder and the auditory modal encoder include an embedding module, a Transformer encoder, and a GRU; the visual modal encoder processes visual modal features including: S31. Embed the speaker sequence to obtain the speaker embedding vector; S32. Perform position embedding on the discourse sequence to obtain the discourse position embedding vector; S33. The speaker embedding vector and the utterance position embedding vector are fused with the visual modal features to obtain the fused features; S34. Input the fused features into the Transformer encoder, and input the output of the Transformer encoder into the GRU to obtain the final visual modal features.

5. The multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction according to claim 1, characterized in that, The sentiment classifier processes the final text modality features, auditory modality features, and visual modality features in the following ways: The final text modality, auditory modality, and visual modality features are processed using the softmax function to obtain the emotion recognition results for the text modality, auditory modality, and visual modality, respectively. The final text modality, auditory modality, and visual modality features are then fused to obtain multimodal fusion features. The multimodal fusion features are then processed using the softmax function to obtain the emotion recognition results for the multimodal fusion features.

6. The multimodal dialogue dynamic emotion recognition method based on relational subgraph interaction according to claim 1, characterized in that, loss function for: in, They represent text modality, auditory modality, and visual modality, respectively. Let m be the loss function for the m-mode. Let m be the weight and m be the modality index. The loss function is the feature fusion function for multimodal features.