A Multimodal Sentiment Analysis Method and System Based on Graph Structure Optimization and Representation Separation

By employing graph structure optimization and representation separation methods, and utilizing techniques such as quantum long short-term memory networks and graph convolutional networks, the problems of insufficient intramodal representation learning and intermodal heterogeneity in multimodal sentiment analysis are solved, achieving more efficient extraction and prediction of sentiment information.

CN119514592BActive Publication Date: 2026-06-30ZHENGZHOU UNIVERSITY OF LIGHT INDUSTRY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHENGZHOU UNIVERSITY OF LIGHT INDUSTRY
Filing Date
2024-10-29
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing multimodal sentiment analysis techniques, intramodal representation learning is insufficient and intermodal heterogeneity is severe, resulting in inadequate mining of sentiment information and affecting the accuracy and robustness of the model.

Method used

We employ graph structure optimization and representation separation methods to extract multimodal representations through quantum long short-term memory networks, utilize graph convolutional networks and graph attention networks for inter-sample data augmentation, and combine quantum-inspired cross-modal attention mechanisms and positive and negative correlation attention networks to decouple emotional information and improve the extraction effect of intermodal correlation information.

Benefits of technology

It improves the accuracy and robustness of multimodal sentiment analysis, enhances the ability to mine sentiment-related information between modalities, and improves the accuracy and generalization ability of sentiment prediction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119514592B_ABST
    Figure CN119514592B_ABST
Patent Text Reader

Abstract

This invention discloses a multimodal sentiment analysis method and system based on graph structure optimization and representation separation, belonging to the field of multimodal sentiment analysis technology. The method includes: processing raw video data to obtain multimodal representations carrying time-series information; performing inter-sample data augmentation on the multimodal representations; performing context interaction processing on the augmented multimodal representations to obtain context-interactive enhanced multimodal representations; performing sentiment representation decoupling processing on the context-interactive enhanced multimodal representations to obtain multimodal representations containing sentiment information; and performing sentiment prediction based on the sentiment-infused multimodal representations to obtain the final sentiment prediction result. This invention fully leverages intramodal representation learning, the heterogeneity of multimodal representations, and the extraction of sentiment information, improving the accuracy and robustness of multimodal sentiment analysis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multimodal sentiment analysis technology, and more specifically to a multimodal sentiment analysis method and system that optimizes graph structure and separates representations. Background Technology

[0002] As a key technology in the field of artificial intelligence, sentiment analysis can improve machine intelligence perception capabilities, enhance the quality of human-computer interaction, and drive the iterative upgrades of wearable smart devices, digital home smart terminal devices, and intelligent sensing and control devices, which is of great significance. Currently, the information provided by single-modal data can no longer meet the needs of intelligent systems to perceive and understand the world. Multimodal cognitive computing, which aims to simulate human cognition for efficient perception and comprehensive understanding of multimodal inputs, has become crucial for achieving general artificial intelligence. Sentiment analysis has also gradually evolved from its initial single-modal analysis to multimodal sentiment analysis.

[0003] Multimodal sentiment analysis (MMA) centers on establishing a reliable mapping between multimodal data and sentiment polarity. This task aims to learn human emotional information from multimodal sequences, including text, visual, and audio data, and to uncover emotional connections and complex relationships between modalities. Related technologies have attempted to fuse sentiment information from heterogeneous data by constructing complex deep learning-based models, promoting information interaction between multimodal data. While these methods have achieved some success, numerous challenges and technical bottlenecks remain.

[0004] First, learning effective representations is a major challenge. Many existing methods utilize attention mechanisms for modal representation learning, but this approach doesn't explicitly model the complex relationships within each modality, preventing the model from fully utilizing the complete representational information within each modality. Second, heterogeneity exists between different modal data, leading to differences in the distribution of sentiment information representations. This difference directly affects the extraction of sentiment information during modal interactions, thus impacting the multimodal fusion results. Furthermore, another major challenge is how to fully mine sentiment-related information while suppressing interference from irrelevant information. Existing methods focus more on removing noise information between modalities, neglecting the weakly relevant sentiment information contained in the modalities after interaction, which may lead to decreased model accuracy and generalization ability. In summary, the main problems facing current multimodal sentiment analysis technology are intramodal representation learning, heterogeneity of multimodal representations, and insufficient mining of sentiment information. These problems severely restrict the development and application of multimodal sentiment analysis technology.

[0005] Therefore, how to propose a multimodal sentiment analysis method and system that optimizes graph structure and separates representations, improves the extraction of intramodal and intermodal correlation information, mines sentiment-related information after modal interaction, and enhances the accuracy and robustness of multimodal sentiment analysis are problems that urgently need to be solved by those skilled in the art. Summary of the Invention

[0006] In view of this, the present invention provides a multimodal sentiment analysis method and system that optimizes graph structure and separates representations, improves the extraction of intramodal and intermodal correlation information, mines sentiment-related information after modal interactions, and enhances the accuracy and robustness of multimodal sentiment analysis. To achieve the above objectives, the present invention adopts the following technical solution:

[0007] A multimodal sentiment analysis method based on graph structure optimization and representation separation includes:

[0008] The raw video data is processed to obtain a multimodal representation carrying time-series information;

[0009] Inter-sample data augmentation for multimodal representations;

[0010] The multimodal representations enhanced with inter-sample data are subjected to context interaction processing to obtain context interaction-enhanced multimodal representations.

[0011] By decoupling the context-interaction-enhanced multimodal representation from the sentiment representation, a multimodal representation containing sentiment information is obtained.

[0012] Sentiment prediction is performed based on multimodal representations that contain emotional information, and the final sentiment prediction result is obtained.

[0013] Optionally, the multimodal representation includes: text modal representation, audio modal representation, and visual modal representation.

[0014] Optionally, obtaining the multimodal representation carrying time series information includes:

[0015] Divide a raw video data into multiple short video segments;

[0016] Extract discourse-level representations of text modality, audio modality, and visual modality from multiple short video clips;

[0017] We use quantum long short-term memory networks to analyze the temporal correlation information between the discourse-level representations of text modality, audio modality, and visual modality and other video segment representations, and then fuse the correlation information with the discourse-level representations to obtain the representations of each modality.

[0018] Optionally, the inter-sample data augmentation of the multimodal representation includes:

[0019] Visual modal representations are input into a graph convolutional network. The adjacency matrix is ​​calculated based on the correlation between nodes. Then, the graph adjacency matrix is ​​optimized using constraints to obtain visual modal representations with augmented data between samples.

[0020] Text modality representations and audio modality representations are input into a graph attention network. By constraining the feature representations between low-correlation nodes through similarity calculation, a relationship network between samples is constructed to obtain data-enhanced text modality representations and data-enhanced audio modality representations between samples.

[0021] Optionally, constructing the relationship network between samples includes:

[0022] The correlation between samples can be obtained by calculating the cosine similarity based on the features of two samples.

[0023] The sentiment difference between samples is obtained by calculating the absolute difference between the labels of the two samples using their true labels.

[0024] Both the similarity threshold and the sentiment threshold are network parameters;

[0025] A relationship network between samples is established by comparing the difference between the correlation between samples and the similarity threshold, as well as the difference between the sentiment difference between samples and the affect threshold.

[0026] Optionally, obtaining the context-interaction-enhanced multimodal representation includes:

[0027] By jointly inputting the text modal representation enhanced by inter-sample data and the audio modal representation enhanced by inter-sample data into a quantum-inspired cross-modal attention network, a context-interaction-enhanced audio modal representation is obtained.

[0028] By jointly inputting the text modal representation enhanced by inter-sample data and the visual modal representation enhanced by inter-sample data into a quantum-inspired cross-modal attention network, a context-interaction-enhanced visual modal representation is obtained.

[0029] The audio modality representation enhanced by inter-sample data and the visual modality representation enhanced by inter-sample data are concatenated and input into the text representation enhanced by inter-sample data into a positive and negative correlation attention network to obtain the context-interaction enhanced text modality representation.

[0030] Optionally, obtaining the multimodal representation containing emotional information includes:

[0031] By jointly inputting context-interactive enhanced text modality representation and context-interactive enhanced audio modality representation into an emotion feature fusion network, text-audio modality information is obtained.

[0032] By jointly inputting context-interactive enhanced text modal representations and context-interactive enhanced visual modal representations into an emotion feature fusion network, text-visual modal information is obtained.

[0033] By merging text-audio modal information and text-visual modal information, we can obtain emotion-related information with strong correlation.

[0034] The context-interactive enhanced text modal representation, context-interactive enhanced audio modal representation, and context-interactive enhanced visual modal representation are respectively input into an independent encoder with strong emotion-related information to obtain text emotion-weakly related information, audio emotion-weakly related information, and visual emotion-weakly related information.

[0035] We obtain the modal representations containing emotional information by weighting and summing the strongly related emotional information with the weakly related emotional information in the text, audio, and visual.

[0036] Optionally, obtaining the final sentiment prediction result includes:

[0037] The text modal representation containing emotional information, the audio modal representation containing emotional information, and the visual modal representation containing emotional information are concatenated to form a fusion vector;

[0038] The fusion vector is input into the first part of the prediction network, and the weights of each modality feature are learned through the neural network to obtain the weight fusion vector;

[0039] The weighted fusion vector is input into the second part of the prediction network to gradually reduce the dimensionality of the modal information, and finally obtain the sentiment prediction result of the video data.

[0040] Optionally, a multimodal sentiment analysis system with graph structure optimization and representation separation includes:

[0041] Preprocessing module: Used to process raw video data to obtain multimodal representations carrying time-series information;

[0042] Inter-sample data augmentation module: used to perform inter-sample data augmentation on multimodal representations;

[0043] Context interaction processing module: used to perform context interaction processing on the enhanced multimodal representation of data between samples to obtain context interaction enhanced multimodal representation;

[0044] The sentiment representation decoupling module is used to decouple the context-interaction-enhanced multimodal representation from the sentiment representation to obtain a multimodal representation containing sentiment information.

[0045] Sentiment prediction module: Used to perform sentiment prediction based on multimodal representations containing sentiment information, and obtain the final sentiment prediction result.

[0046] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a multimodal sentiment analysis method and system for graph structure optimization and representation separation, which has the following beneficial effects:

[0047] This invention discloses a multimodal sentiment analysis method and system based on graph structure optimization and representation separation. The method includes: processing raw video data to obtain multimodal representations carrying time-series information; performing inter-sample data augmentation on the multimodal representations; performing context interaction processing on the augmented multimodal representations to obtain context-interactive enhanced multimodal representations; performing sentiment representation decoupling processing on the context-interactive enhanced multimodal representations to obtain multimodal representations containing sentiment information; and performing sentiment prediction based on the sentiment-infused multimodal representations to obtain the final sentiment prediction result. This invention fully leverages intramodal representation learning, the heterogeneity of multimodal representations, and the extraction of sentiment information, thereby improving the accuracy and robustness of multimodal sentiment analysis. Attached Figure Description

[0048] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0049] Figure 1(a) is a schematic diagram of the multimodal sentiment analysis method for graph structure optimization and representation separation provided by the present invention;

[0050] Figure 1(b) is a framework diagram of a multimodal sentiment analysis method for graph structure optimization and representation separation provided by the present invention;

[0051] Figure 2(a) is a schematic diagram of a quantum LSTM provided by the present invention;

[0052] Figure 2(b) is a schematic diagram of a quantum circuit provided by the present invention;

[0053] Figure 3(a) is a schematic diagram of a quantum-inspired cross-modal attention network provided by the present invention;

[0054] Figure 3(b) is a schematic diagram of a positive and negative correlation attention network provided by the present invention;

[0055] Figure 4 This is a schematic diagram of a decoupler implementation provided by the present invention. Detailed Implementation

[0056] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0057] This invention discloses a multimodal sentiment analysis method based on graph structure optimization and representation separation, comprising:

[0058] The raw video data is processed to obtain a multimodal representation carrying time-series information;

[0059] Inter-sample data augmentation for multimodal representations;

[0060] The multimodal representations enhanced with inter-sample data are subjected to context interaction processing to obtain context interaction-enhanced multimodal representations.

[0061] By decoupling the context-interaction-enhanced multimodal representation from the sentiment representation, a multimodal representation containing sentiment information is obtained.

[0062] Sentiment prediction is performed based on multimodal representations that contain emotional information, and the final sentiment prediction result is obtained.

[0063] In a specific implementation, a multimodal sentiment analysis method based on graph structure optimization and representation separation is shown in Figure 1(a). This multimodal sentiment analysis method based on graph structure optimization and representation separation may include: a text representation extraction module 101, an audio representation extraction module 102, a visual representation extraction module 103, a representation enhancement module 104, a context interaction module 105, a sentiment representation decoupling module 106, and a sentiment prediction module 107.

[0064] The raw video data is input into the text representation extraction module 101, audio representation extraction module 102, and visual representation extraction module 103, respectively. After processing by these modules, corresponding text, audio, and visual representations are obtained. Specifically, the raw video data can be divided into multiple short video segments, and then the text, audio, and visual modal representations can be extracted from each short video segment using the text, audio, and visual representation modules 101, 102, and 103, respectively.

[0065] The obtained text representation, audio representation and visual representation are input into the representation enhancement module 104 to obtain the text representation, audio representation and visual representation after data enhancement between samples.

[0066] The enhanced text, audio, and visual representations of the obtained inter-sample data are input into the context interaction module 105 to obtain context-enhanced text, audio, and visual representations.

[0067] The text, audio, and visual representations enhanced by contextual interaction are input into the emotion representation decoupling module 106 to obtain text, audio, and visual representations containing emotional information.

[0068] The text representation, audio representation, and visual representation containing emotional information are input into the sentiment prediction module 107 to obtain the final multimodal sentiment prediction result.

[0069] In a specific implementation, a multimodal sentiment analysis method based on graph structure optimization and representation separation includes the following steps:

[0070] Step 1: Input the raw video data into the feature extraction module to obtain modal representations carrying time series information.

[0071] As shown in Figure 1(b), a raw video data segment is divided into multiple short video segments based on each sentence. Then, feature extraction techniques are used to extract discourse-level modal representations of text, audio, and visual modalities from each short video segment. Specifically, the GloVe model, COVAREP model, and Facet model are used to encode the letter, acoustic, and image modal signals in the video segments, respectively, ultimately obtaining text modal features f of size 1*k. t Audio modal features f a and visual modal features f v Text modal features are input into the BERT model, while audio and visual modal features are input into a quantum long short-term memory network model to mine contextual information between different video segments, thus obtaining modal representations F for the three modalities. t F a and F v Their size is 1*128.

[0072] Figure 2(a) shows the structure of the QLSTM network model. Each memory cell also consists of a forget gate, an input gate, and an output gate. Unlike the classic LSTM, the QLSTM network model replaces the different weight parameter matrices in the classic LSTM with five newly constructed VQCs with different parameters. Each VQC performs a different function depending on its position in the gate. A VQC is a quantum circuit with adjustable parameters, allowing for iterative optimization and extending the classic LSTM to the quantum realm. Figure 2(b) shows the structure of the VQC, which consists of an encoding layer, a variable layer, and a measurement layer.

[0073] In a specific implementation, during quantum circuit processing, the encoding layer is used to convert classical data into quantum states. This invention uses an angle encoding method to process the original data. The specific process is implemented by the RX gate, which can be expressed as formula (1). By using the RX gate, classical data can be effectively encoded into the rotation angle of the qubit, thereby utilizing the advantages of quantum computing to perform complex quantum state transformations and processing. The total operation of the encoding layer can be expressed as formula (2).

[0074]

[0075]

[0076] Among them, f i Representing different classical data characteristics, n represents the number of qubits, and I is the identity matrix. For tensor product, h t-1 h represents the hidden state at time step t-1. t C represents the hidden state at time step t. t-1 For the cell state at time step t-1, C t Let y be the cell state at time step t. t is the output at time step t, and X is the Pauli-X gate.

[0077] Specifically, variable layering optimizes quantum states through variational adjustments using parameterized quantum gates. First, an RX gate is applied to each qubit to rotate its state around the X-axis, with the parameter θ being trainable. This operation transforms the input state into a new quantum state. Next, a Hadamard (H) gate is applied to each qubit to transform it into a uniform superposition state, where each qubit's state is a superposition of |0> and |1>. Finally, to generate entangled states, a CNOT gate is applied to each pair of adjacent qubits to conditionally flip the state of the target qubit. Variable layering optimizes and entangles quantum states through a series of trainable RX, H, and CNOT gates, forming complex quantum states adapted to specific tasks. These operations not only change the amplitude of the quantum states but also introduce entanglement relationships between them, thus providing rich information for subsequent measurements and calculations. The formula for variable layering is shown in formula (3).

[0078]

[0079]

[0080] Among them, H i θ represents the application of an H gate on the i-th qubit. i Let CNOT be the rotation angle of the i-th qubit. i,i+1CNOT represents the application of controlled NOT gates on the i-th and i+1-th qubits. n-1,0 This represents the application of a controlled NOT gate on the last qubit and the first qubit.

[0081] Specifically, in the measurement layer, as shown in Equation (4), the quantum state after the encoding layer and the variable layer transformation is measured to extract classical information. In order to retain the information of the input data to the maximum extent, the present invention uses Pauli-Z basis measurement. The measurement result reflects the projection of the quantum state in the Z-axis direction, thereby extracting useful features for further processing. The overall measurement result is shown in Equation (5).

[0082] <Z i >=<ψ″|Z i |ψ″)(4);

[0083] Measurement = { <z0>, <Z1),...,<Z n-1 >} (5);

[0084] Where i represents the i-th qubit, and n represents the number of qubits. In this invention, n = 8.

[0085] Step 2: The modal representations carrying time-series information obtained in Step 1 are input into the representation enhancement module. The processing is divided into two parts using a graph neural network. In the first part, the visual modal representations are fed into a graph convolutional network to capture the local structure and boundary information of the image. In the second part, the textual and audio modal representations are fed into a graph attention network to aggregate the features of different samples on a sample-by-sample basis to enhance the expressive power of the samples. Cosine similarity and label difference are used to constrain the nodes and construct a relationship network between samples, thereby learning the modal representations between similar samples and obtaining the data-enhanced modal representations between samples.

[0086] In a specific implementation, the first part includes, in order to create nodes capable of information interaction between different samples of the same modality, when constructing the graph, all samples within a batch are placed into an optimization pool, and these samples are constructed into a graph, with each sample serving as a node. If two nodes are close in feature embedding, they are assumed to be connected. This invention uses a simple cross-node self-attention mechanism to determine the correlation between nodes, thereby defining the adjacency matrix A of the graph. v As shown in formula (6).

[0087]

[0088] Where A v ∈R N×N Represents the adjacency matrix, N v ∈R N×d This represents the embedding vector of N sample nodes. represents the learnable parameters, f is set as the ReLU activation function, I is the identity matrix, and d is the hidden layer dimension.

[0089] Specifically, as shown in formula (7), the graph structure optimization strategy optimizes the relationship between nodes by calculating similarity. The optimization strategy is as follows: if the similarity between two nodes is greater than a certain threshold and the absolute value of the difference between the true labels is less than a certain threshold, then the two nodes are determined to be neighbors. Then, all nodes in the optimization pool are traversed, the edges between nodes that do not meet the requirements are broken, and nodes with high similarity are connected, thereby dynamically adjusting the adjacency matrix and reconstructing the graph.

[0090]

[0091]

[0092]

[0093] Where S represents the similarity between two nodes, and Y represents the absolute value of the difference between the true labels. It is the feature representation of the i-th and j-th samples. Let U represent the true label of the i-th, j-th sample, and let U represent the optimizer. It is the updated adjacency matrix, θ v These are optimizer parameters.

[0094] Specifically, as shown in formula (8), the GCN algorithm is used to enrich the feature representation of each sample, help the model understand and utilize the interrelationships between samples, ensure that the node can accurately learn the features of all similar nodes, rather than relying solely on the information of a single sample, and improve the potential correlation between sample features.

[0095]

[0096] Where, D∈R N×N It is a degree matrix, X v ∈R N×d This represents the sample features after GCN enhancement.

[0097] In a specific implementation, the second part includes treating each sample as a node in both the text and audio graphs. The optimization strategy is to select K nodes that are most similar to the sentiment expression of the node itself as its neighbor nodes, filter out other low-relevance nodes, construct a relationship network between samples, and strengthen the feature expression of the nodes in the graph.

[0098] Specifically, the selection of neighboring nodes is also based on the difference between the similarity between samples and the true label, selecting K other samples that are highly similar to the sample. As shown in formula (9), according to the characteristics of GAT, for vertex i, the similarity coefficient e between its neighbors j and itself is calculated one by one. ij First, we utilize the shared parameter W∈R d×d The linear mapping increases the dimensionality of the vertex features, and then the vertex h is... i and h j The transformed features are concatenated, and finally, the concatenated high-dimensional features are mapped to a real number b. After softmax processing, the attention coefficient α is obtained. ij Attention coefficients are used to replace elements of the initial adjacency matrix to obtain the final adjacency matrix. Then, the classic GAT algorithm is used to assign different weights to different neighbor nodes, thereby obtaining a richer node augmentation representation.

[0099]

[0100]

[0101]

[0102] Where m∈(t, a) represents different modes, and b∈R 2d×1 Map the concatenated matrix to a real number. X represents the similarity coefficient. m ∈R 1×d These are sample features enhanced by GAT. Let be the feature vector of the i-th sample. Let be the feature vector of the j-th sample, and W be the linear layer.

[0103] The method of using cosine similarity and label difference to constrain node relationships is as follows: for each modality sample representation of the input graph neural network, the cosine similarity and the absolute difference of the true label are calculated pairwise. At the same time, thresholds are set for these two constraints. The relationship network between the samples is established only when the cosine similarity between two samples is greater than the similarity threshold and the absolute difference of the true label is less than the label threshold.

[0104] Step 3: The modal representations augmented by inter-sample data are input into the interactor. A quantum-inspired cross-modal attention mechanism is used to leverage the dominant role of the augmented text modal representations, injecting richer emotional information into the augmented audio and visual modal representations. Considering the advantages of the augmented audio and visual modal representations in specific contexts, positive and negative correlation attention is used to balance the emotional information of the three modalities; thus, the enhanced modal representations are obtained. Figures 3(a) and 3(b) are schematic diagrams of a quantum-inspired cross-modal attention network and a positive and negative correlation attention network provided by embodiments of the present invention.

[0105] In a specific implementation, as shown in formula (10), taking text modality enhancement of visual modality as an example, the text and visual modalities are converted into queries (Q), keys (K), and values ​​(V) through a linear layer, which will serve as inputs to the attention module. The attention score is calculated by the inner product of Q and K, and then weighted and summed with V to obtain a new attention vector. To further enhance the model's ability to capture data complexity, a quantum layer is introduced after the attention calculation to enhance expressiveness. The design of the quantum circuit is shown in Figure 2(b). Since multi-head attention allows the model to capture information at different levels of abstraction, the model uses multiple sets of W... q W k W v The input data is combined to learn a multidimensional representation, and finally the outputs of multiple heads are concatenated to obtain a cross-modal representation. To facilitate one mode receiving information from another mode.

[0106]

[0107] F att =Concat(head1, ..., head) h W o (10);

[0108] Among them, F att It is the sample feature vector after multi-head attention enhancement. It is the Query, Key, and Value vector corresponding to the i-th attention head. W o W is a learnable weight parameter. Q Used to transform data into appropriate dimensions for quantum processing, head i For the number of attention heads, d k For the feature dimension, Let i be the text vector of the i-th attention head. Let be the visual vector of the i-th attention head.

[0109] Furthermore, as shown in Equation (11), to prevent degradation during deep network training, the result after multi-head attention processing is residually connected to the initial information. Simultaneously, a feedforward layer is added to enhance the model's nonlinear modeling and expressive capabilities.

[0110]

[0111]

[0112] in, This represents the non-text modality feature vector after text modality enhancement, where FFN represents the feedforward neural network layer and LN represents Layer Normalization. A visual vector that carries textual information.

[0113] In a specific implementation, as shown in formula (12), the audio and visual modalities are concatenated as the key (K) and value (V) of the attention mechanism, and the text modality is regarded as the query (Q). A cross-modal attention mechanism is used to capture remote nonverbal emotional information and generate text-based nonverbal embeddings. Specifically, this invention designs a dual-channel interaction method. In the first channel, a positively correlated single-head attention mechanism is used to promote the interaction of contextual information expressing similar emotions in different modalities. In the other channel, a negatively correlated attention mechanism is used to focus on the differences in emotional expression between different modalities, thereby obtaining a cross-modal enhanced representation with complete emotional semantics.

[0114] F va =Concat(X) v X a );

[0115]

[0116]

[0117]

[0118]

[0119] in, These are learnable parameters, where p represents positively correlated attention and n represents negatively correlated attention. This is a scale parameter. During training, since the linguistic and non-linguistic modalities reside in two different feature spaces, their correlation is very small, resulting in small elements in the weight matrix. To better facilitate model learning, a scaling factor is introduced during the calculation of the weight coefficients. Scaling the matrix This is a single-head attention text query vector. For audio-visual keys, For audio-visual values, For audio-visual attention vectors, This is the transpose of the audio-visual key.

[0120] Furthermore, as shown in Equation (13), the positively correlated attention result vector is... and negatively correlated attention result vector The data is summed and time-series information is added. Based on this, a dynamic weighting operation is implemented to combine the weighted fusion result with the original text vector. This process ensures that the text modality is effectively enhanced without losing textual information.

[0121]

[0122]

[0123]

[0124] Where, k va This represents the size of the convolution kernel, β is the adaptive factor, and ||X t ||2 and They represent X respectively t and The L2 norm, μ is a learnable parameter, and δ is equal to 10e-6 to prevent the denominator from being 0.

[0125] Step 4: Rearrange and combine the enhanced modal representations, extracting emotional information from both feature and temporal dimensions, and then concatenate them to generate strongly related emotional information. The enhanced modal representations and strongly related emotional information are then fed into the encoder to obtain weakly related emotional information independent of each modality. The strongly and weakly related emotional information are then dynamically summed to obtain the decoupled modal representations. Figure 4 The diagram shows a decoupler implementation. The steps for implementing the decoupler are as follows:

[0126] As shown in formula (14), the text-visual modality and the text-audio modality are respectively input into two independent feedforward layers to promote the deep interactive fusion of different modalities in different dimensions.

[0127]

[0128]

[0129]

[0130] Where r∈(a,v), CMLP and TMLP represent the channel feedforward layer and the time feedforward layer, respectively. and These are the parameters of the two feedforward layers. This is the vector after processing by the channel feedforward layer. This is the vector after processing by the time feedforward layer. These are the text features after processing by a graph attention network. These are the non-textual modal features after graph processing.

[0131] Furthermore, as shown in formula (15), the generated feature vectors are concatenated and then subjected to a nonlinear transformation to generate a gating factor. Then, the representations of the two modal pairs are multiplied by their respective gating vectors to calculate the sentiment correlation vector F. s .

[0132]

[0133] g a =Sigmoid(W ga [F ta ;F tv ]+b ga );

[0134] g v =Sigmoid(W gv [F ta ;F tv ]+b gv );

[0135] F s =g a ·(W ta F ta )+g v ·(W tv F tv (15);

[0136] Among them, W ga and W gv It is a learnable gating matrix, b ga and b gv It is the corresponding bias scalar, W ta and W tv These are learnable parameters. This is a vector of the text-audio bimodal mode after processing by a channel and time feedforward layer. The vector g is the text-visual bimodal modality processed by channel and temporal feedforward layers. a For audio modulation factors, g v It is a visual regulatory factor.

[0137] Furthermore, as shown in Equation (16), the strongly correlated sentiment vectors and the source representations of their respective modalities are fed into independent encoders to generate the weakly correlated sentiment vector F for each modality. w This further refines and supplements the modality representation. Finally, the strongly correlated and weakly correlated sentiment vectors are combined using a dynamic weighting method to obtain the unimodal sentiment fusion vector F. y .

[0138] F wc =E(F s F c θ s c∈(a, t, v);

[0139] F yc =F s +αF wc (16);

[0140] Where E is the encoder, used to generate weakly related information about emotion, θ s These are encoder parameters, and α is a hyperparameter.

[0141] Step 5: Input the decoupled modal representations into the sentiment prediction layer, which consists of a two-part multimodal sentiment prediction neural network. The first part receives the decoupled modal representations of each modality and learns the weights of each modal feature through the neural network to highlight the importance of different modalities. The second part consists of a three-layer neural network that gradually reduces the dimensionality of the modal information and finally outputs the sentiment prediction result.

[0142] Specifically, as shown in formula (17), the three-modal feature vectors obtained above are concatenated, the weight matrix att is calculated, and then a dot product is performed with the concatenated vectors to obtain the result of the three-modal fusion. The final prediction... After passing through a linear layer, sentiment scores are generated with predicted values ​​distributed between -3 and 3.

[0143] h = Concat(F) ya F yt F yv );

[0144] att = W2(RELU(W1*h));

[0145] h f =att*h;

[0146]

[0147] Where h is the fused vector after concatenation of the three modalities, att is the attention weight matrix, and h f It is a weighted fusion vector, W1, W2, W out It is a linear layer, b out This is a bias term.

[0148] In a specific embodiment, Table (1) shows a comparison of the results of the method provided in this embodiment with various representative methods. This comparison was conducted on the CMU-MOSEI public dataset in the field of multimodal sentiment analysis, mainly testing the model's sentiment 2-classification performance, i.e., determining whether the sentiment category of a sample is "positive" or "negative". The test selected sentiment 2-classification accuracy, 7-classification accuracy, and F1 score as evaluation indicators. Accuracy refers to the ratio of the number of correctly classified sentiments in the sample prediction results to the total number of samples. The higher the value, the better the prediction effect. The F1 score is the harmonic mean of precision and recall. The higher the value, the better the prediction effect. As shown in Table 1, compared with several classic models in the field of multimodal sentiment analysis, this embodiment of the invention achieved better performance in the sentiment 2-classification test. Compared with the suboptimal MMML model, the 2-classification accuracy, 7-classification accuracy, and F1 score were improved by 1.41% / 4.13%, 11.5%, and 1.25% / 4.01%, respectively.

[0149] Table 1

[0150]

[0151]

[0152] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0153] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of multi-modal sentiment analysis with graph structure optimization and feature separation, characterized in that, include: The raw video data is processed to obtain a multimodal representation carrying time-series information; The obtained multimodal representation carrying time series information includes: Divide a raw video data into multiple short video segments; Extract discourse-level representations of text modality, audio modality, and visual modality from multiple short video clips; Quantum long short-term memory networks are used to analyze the temporal series correlation information between the discourse-level representations of text modality, audio modality, and visual modality and other video segment representations. The correlation information is then fused with the discourse-level representations to obtain the representations of each modality. Inter-sample data augmentation for multimodal representations; The inter-sample data augmentation of the multimodal representation includes: Visual modal representations are input into a graph convolutional network. The adjacency matrix is ​​calculated based on the correlation between nodes. Then, the graph adjacency matrix is ​​optimized using constraints to obtain visual modal representations with augmented data between samples. Text modality representation and audio modality representation are input into a graph attention network. By constraining the feature representation between low-correlation nodes through similarity calculation, a relationship network between samples is constructed to obtain the text modality representation and audio modality representation after data augmentation between samples. The construction of the relationship network between samples includes: The correlation between samples can be obtained by calculating the cosine similarity based on the features of two samples. The sentiment difference between samples is obtained by calculating the absolute difference between the labels of the two samples using their true labels. Both the similarity threshold and the sentiment threshold are network parameters; A relationship network between samples is established by comparing the difference between the correlation between samples and the similarity threshold, as well as the difference between the sentiment difference between samples and the affect threshold. The multimodal representations enhanced with inter-sample data are subjected to context interaction processing to obtain context interaction-enhanced multimodal representations. The obtained context-interaction enhanced multimodal representation includes: By jointly inputting the text modal representation enhanced by inter-sample data and the audio modal representation enhanced by inter-sample data into a quantum-inspired cross-modal attention network, a context-interaction-enhanced audio modal representation is obtained. By jointly inputting the text modal representation enhanced by inter-sample data and the visual modal representation enhanced by inter-sample data into a quantum-inspired cross-modal attention network, a context-interaction-enhanced visual modal representation is obtained. The audio modality representation enhanced by inter-sample data and the visual modality representation enhanced by inter-sample data are concatenated and input into the text representation enhanced by inter-sample data into a positive and negative correlation attention network to obtain a context-interaction enhanced text modality representation. By decoupling the context-interaction-enhanced multimodal representation from the sentiment representation, a multimodal representation containing sentiment information is obtained. Sentiment prediction is performed based on multimodal representations that contain emotional information, and the final sentiment prediction result is obtained.

2. The multimodal sentiment analysis method for graph structure optimization and representation separation according to claim 1, characterized in that, The multimodal representations include: text modal representation, audio modal representation, and visual modal representation.

3. The multimodal sentiment analysis method for graph structure optimization and representation separation according to claim 1, characterized in that, The obtained multimodal representation containing emotional information includes: By jointly inputting context-interactive enhanced text modality representation and context-interactive enhanced audio modality representation into an emotion feature fusion network, text-audio modality information is obtained. By jointly inputting context-interactive enhanced text modal representations and context-interactive enhanced visual modal representations into an emotion feature fusion network, text-visual modal information is obtained. By merging text-audio modal information and text-visual modal information, we can obtain emotion-related information with strong correlation. The context-interactive enhanced text modal representation, context-interactive enhanced audio modal representation, and context-interactive enhanced visual modal representation are respectively input into an independent encoder with strong emotion-related information to obtain text emotion-weakly related information, audio emotion-weakly related information, and visual emotion-weakly related information. We obtain the modal representations containing emotional information by weighting and summing the strongly related emotional information with the weakly related emotional information in the text, audio, and visual.

4. The multimodal sentiment analysis method for graph structure optimization and representation separation according to claim 1, characterized in that, The final sentiment prediction result includes: The text modal representation containing emotional information, the audio modal representation containing emotional information, and the visual modal representation containing emotional information are concatenated to form a fusion vector; The fusion vector is input into the first part of the prediction network, and the weights of each modality feature are learned through the neural network to obtain the weight fusion vector; The weighted fusion vector is input into the second part of the prediction network to gradually reduce the dimensionality of the modal information, and finally obtain the sentiment prediction result of the video data.

5. A multimodal sentiment analysis system featuring graph structure optimization and representation separation, characterized in that, include: Preprocessing module: Used to process raw video data to obtain multimodal representations carrying time-series information; The obtained multimodal representation carrying time series information includes: Divide a raw video data into multiple short video segments; Extract discourse-level representations of text modality, audio modality, and visual modality from multiple short video clips; Quantum long short-term memory networks are used to analyze the temporal series correlation information between the discourse-level representations of text modality, audio modality, and visual modality and other video segment representations. The correlation information is then fused with the discourse-level representations to obtain the representations of each modality. Inter-sample data augmentation module: used to perform inter-sample data augmentation on multimodal representations; The inter-sample data augmentation of the multimodal representation includes: Visual modal representations are input into a graph convolutional network. The adjacency matrix is ​​calculated based on the correlation between nodes. Then, the graph adjacency matrix is ​​optimized using constraints to obtain visual modal representations with augmented data between samples. Text modality representation and audio modality representation are input into a graph attention network. By constraining the feature representation between low-correlation nodes through similarity calculation, a relationship network between samples is constructed to obtain the text modality representation and audio modality representation after data augmentation between samples. The construction of the relationship network between samples includes: The correlation between samples can be obtained by calculating the cosine similarity based on the features of two samples. The sentiment difference between samples is obtained by calculating the absolute difference between the labels of the two samples using their true labels. Both the similarity threshold and the sentiment threshold are network parameters; A relationship network between samples is established by comparing the difference between the correlation between samples and the similarity threshold, as well as the difference between the sentiment difference between samples and the affect threshold. Context interaction processing module: used to perform context interaction processing on the enhanced multimodal representation of data between samples to obtain context interaction enhanced multimodal representation; The obtained context-interaction enhanced multimodal representation includes: By jointly inputting the text modal representation enhanced by inter-sample data and the audio modal representation enhanced by inter-sample data into a quantum-inspired cross-modal attention network, a context-interaction-enhanced audio modal representation is obtained. By jointly inputting the text modal representation enhanced by inter-sample data and the visual modal representation enhanced by inter-sample data into a quantum-inspired cross-modal attention network, a context-interaction-enhanced visual modal representation is obtained. The audio modality representation enhanced by inter-sample data and the visual modality representation enhanced by inter-sample data are concatenated and input into the text representation enhanced by inter-sample data into a positive and negative correlation attention network to obtain a context-interaction enhanced text modality representation. The sentiment representation decoupling module is used to decouple the context-interaction-enhanced multimodal representation from the sentiment representation to obtain a multimodal representation containing sentiment information. Sentiment prediction module: Used to perform sentiment prediction based on multimodal representations containing sentiment information, and obtain the final sentiment prediction result.