A consistent and inconsistent multi-modal sentiment analysis method based on contrastive learning

By decomposing multimodal features into shared and specific knowledge representations through a contrastive learning approach and using a text-based attention mechanism for fine-grained modeling, the prediction bias caused by modality conflict in multimodal sentiment analysis is resolved, thereby improving the accuracy and stability of sentiment analysis.

CN122241566APending Publication Date: 2026-06-19NORTHWEST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NORTHWEST UNIV
Filing Date
2026-03-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multimodal sentiment analysis methods are prone to misjudging sentiment polarity when faced with conflicting multimodal signals, leading to prediction bias. This is especially true when there are inconsistencies in the data collection process or differences in feature extraction methods, as the dominant modality inhibits the subordinate modality, resulting in increased ambiguity.

Method used

We employ a contrastive learning-based approach to decompose multimodal features into shared knowledge representations and specific knowledge representations. We also construct a dynamic interaction channel between text, speech, and visual modalities through a text prompt attention mechanism to perform fine-grained modeling and extraction of consistent and inconsistent information.

Benefits of technology

It effectively avoids the loss of modality-specific information during the fusion process, improves the accuracy and stability of multimodal sentiment analysis, and enables stable and effective sentiment feature modeling in complex multimodal sentiment scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241566A_ABST
    Figure CN122241566A_ABST
Patent Text Reader

Abstract

This application relates to a consistent and inconsistent multimodal sentiment analysis method based on contrastive learning, comprising: constructing a sentiment prediction model, the sentiment prediction module including a feature extraction module, a feature decomposition module, a text cue attention module, and a multilayer perceptron; training the sentiment prediction model based on a training dataset to obtain a trained sentiment prediction model; and inputting the multimodal data to be predicted into the trained sentiment prediction model to obtain multimodal sentiment prediction results. Under a unified contrastive loss constraint, this application decomposes multimodal representations into shared knowledge representations and specific knowledge representations, thereby effectively avoiding the loss of modality-specific information during the fusion process. Simultaneously, a text cue attention mechanism is designed to construct a dynamic interaction channel between text, speech, and visual modalities, enabling fine-grained modeling and extraction of consistent and inconsistent information across modalities, further improving the accuracy and stability of multimodal sentiment analysis.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of multimodal sentiment analysis, and more specifically, to a consistent and inconsistent multimodal sentiment analysis method based on contrastive learning. Background Technology

[0002] Multimodal sentiment analysis simulates the comprehensive processing of multisensory information by humans during emotional expression. By jointly modeling heterogeneous modal information such as text, vision, and speech, it leverages the correlation and complementarity between different modalities to improve the accuracy of sentiment detection and analysis. Compared to text-based sentiment analysis methods, incorporating auxiliary modalities such as vision and speech can acquire more comprehensive emotional features, thereby enhancing the model's stability and robustness in complex application scenarios. This makes it suitable for applications such as human-computer interaction, public opinion monitoring, and intelligent recommendation.

[0003] Existing multimodal sentiment analysis methods mainly fall into two categories: one focuses on multimodal representation construction, and the other focuses on multimodal fusion. The former emphasizes refined modeling of features of each modality to improve the accuracy of sentiment polarity prediction; the latter, by introducing interactive learning mechanisms, characterizes the semantic relationships between different modalities and enhances single-modal representations through cross-modal information transfer and knowledge transfer, thereby improving the overall sentiment analysis performance.

[0004] While these methods improve overall model performance by integrating complementary cross-modal information, they can still mislead sentiment polarity judgments when faced with conflicting multimodal signals. Such discrepancies may stem from inconsistencies in data acquisition or differences in feature extraction methods, leading to divergences in sentiment predictions across different modalities. In existing joint learning frameworks, the dominant modality (e.g., text) often suppresses subordinate modalities (e.g., speech or vision), amplifying the ambiguity that leads to biased predictions. Summary of the Invention

[0005] To overcome at least one deficiency in the prior art, this application provides a consistent and inconsistent multimodal sentiment analysis method based on contrastive learning.

[0006] Firstly, a method for consistent and inconsistent multimodal sentiment analysis based on contrastive learning is provided, including: A sentiment prediction model is constructed, and the sentiment prediction module includes a feature extraction module, a feature decomposition module, a text prompt attention module, and a multilayer perceptron; The sentiment prediction model is trained based on the training dataset to obtain the trained sentiment prediction model; the samples in the training dataset are multimodal data; the multimodal data includes text modal data, visual modal data and speech modal data; During training, the feature extraction module extracts features from the input multimodal data to obtain features for each modality; the feature decomposition module decomposes the features for each modality to obtain shared knowledge representations and specific knowledge representations for each modality; the text prompting attention module performs consistency fusion on the shared knowledge representations of each modality to obtain a consistent fused representation, and the text prompting attention module performs inconsistency fusion on the specific knowledge representations of each modality to obtain an inconsistency fused representation; the multilayer perceptron performs multimodal sentiment prediction based on the consistent fused representation, the inconsistency fused representation, the shared knowledge representation of the text modality, and the specific knowledge representation, and obtains the multimodal sentiment prediction result. During training, the contrastive loss is calculated based on the shared and specific knowledge representations of each modality; the multimodal prediction loss is calculated based on the multimodal sentiment prediction results and multimodal ground truth labels; the shared and specific knowledge representations of each modality are input into a multilayer perceptron to obtain the unimodal sentiment prediction results for each modality, and the unimodal prediction loss is calculated based on the unimodal sentiment prediction results and unimodal ground truth labels; the contrastive loss, multimodal prediction loss, and unimodal prediction loss are weighted and calculated to obtain the total loss. The multimodal data to be predicted is input into the trained sentiment prediction model to obtain the multimodal sentiment prediction results.

[0007] In one embodiment, the feature extraction module includes a pre-trained BERT model, a visual feature encoder, and a speech feature encoder. The pre-trained BERT model is used to extract features from text modal data to obtain text modal features; A visual feature encoder is used to extract features from visual modal data to obtain visual modal features; A speech feature encoder is used to extract features from speech modal data to obtain speech modal features.

[0008] In one embodiment, the feature projection module includes feature projectors corresponding to each modality, which are used to decompose the features of each modality to obtain shared knowledge representation and specific knowledge representation; The feature projector consists of a layer normalization layer, a linear transformation layer with a Tanh activation function, and a Dropout layer connected in sequence.

[0009] In one embodiment, the text prompting attention module performs consistent fusion of shared knowledge representations across modalities, including: The shared knowledge representation of the text modality is input into the text encoder for feature modeling to obtain a fine-grained shared knowledge text representation; Using fine-grained shared knowledge text representations as query vectors, attention is calculated with shared knowledge representations of the speech modality and the visual modality, respectively, to obtain the first similarity weight matrix and the second similarity weight matrix. Based on the first and second similarity weight matrices, the shared knowledge representations of the speech modality and the visual modality are weighted and fused to obtain a consistent fused representation.

[0010] In one embodiment, the text prompting attention module performs inconsistent fusion of specific knowledge representations for each modality, including: The specific knowledge representation of the text modality is input into the text encoder for feature modeling to obtain a fine-grained specific knowledge text representation; Using fine-grained specific knowledge text representations as query vectors, attention is calculated between them and specific knowledge representations of the speech modality and the visual modality, respectively, resulting in the third and fourth similarity weight matrices. Based on the third and fourth similarity weight matrices, the specific knowledge representations of the speech modality and the specific knowledge representations of the visual modality are weighted and fused to obtain an inconsistent fusion representation.

[0011] In one embodiment, the contrastive loss is calculated based on the shared knowledge representation and the specific knowledge representation of each modality, including: For a set of samples in a training batch, for any sample, denoted as the current sample, calculate the cosine similarity between the current sample and other samples; The samples are sorted from largest to smallest according to their cosine similarity. A set of highly similar samples is formed by selecting the top proportion of samples in the sorted set, and a set of low similar samples is formed by selecting the bottom proportion of samples in the sorted set. Samples with the same label as the current sample are selected from the highly similar sample set to form a similar sample set SS. The remaining samples in the highly similar sample set form a set of easily confused samples SD. Samples with different labels from the current sample are selected from the low similar sample set to form a set of dissimilar samples DD. A set number of samples are selected from the similar sample set SS to form a positive sample set, and a set number of samples are selected from the easily confused sample set SD and the dissimilar sample set DD to form a negative sample set. Within the same sample, construct in-sample positive and in-sample negative pairs using the following formula:

[0012]

[0013] in, For the sample Positive sample pairs within the sample, For the sample In-sample negative sample pairs, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities, Indicates sample The set of positive samples Indicates sample The set of negative samples; Representing samples respectively Specific knowledge representation of text modality, sample Specific knowledge representation of visual modalities, samples Specific knowledge representation of speech modalities, Representing samples respectively Specific knowledge representation of text modality, sample Specific knowledge representation of visual modalities, samples Specific knowledge representation of speech modalities; Construct positive sample pairs between different samples. Negative sample pairs between samples The following formula is used:

[0014]

[0015] in, For the sample Positive sample pairs between samples, For the sample Negative sample pairs between samples, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities; Construct the set of positive sample pairs and the set of negative sample pairs:

[0016]

[0017] in, For the sample The set of positive sample pairs For the sample The set of negative sample pairs; The comparative loss is:

[0018] in, For the sample The comparative loss, Represents the cosine similarity function. For temperature parameters, for The elements in p represents two values ​​in the element. for The elements in , These are the two values ​​in the element.

[0019] In one embodiment, the multimodal prediction loss is:

[0020] in, For multimodal prediction loss, For the sample Multimodal real labels For the sample The multimodal sentiment prediction results, This indicates the number of samples in a training batch.

[0021] In one embodiment, the single-modal prediction loss is:

[0022] in, This represents the single-mode prediction loss. For the sample Single-modal real labels, For the sample The single-modal sentiment prediction results This indicates the number of samples in a training batch. This represents the square of the L2 norm.

[0023] Secondly, a contrarian learning-based device for consistent and inconsistent multimodal sentiment analysis is provided, comprising: The model building module is used to build the sentiment prediction model. The sentiment prediction module includes a feature extraction module, a feature decomposition module, a text prompting attention module, and a multilayer perceptron. The training module is used to train the sentiment prediction model based on the training dataset to obtain the trained sentiment prediction model; the samples in the training dataset are multimodal data; the multimodal data includes text modal data, visual modal data and speech modal data; During training, the feature extraction module extracts features from the input multimodal data to obtain features for each modality; the feature decomposition module decomposes the features for each modality to obtain shared knowledge representations and specific knowledge representations for each modality; the text prompting attention module performs consistency fusion on the shared knowledge representations of each modality to obtain a consistent fused representation, and the text prompting attention module performs inconsistency fusion on the specific knowledge representations of each modality to obtain an inconsistency fused representation; the multilayer perceptron performs multimodal sentiment prediction based on the consistent fused representation, the inconsistency fused representation, the shared knowledge representation of the text modality, and the specific knowledge representation, and obtains the multimodal sentiment prediction result. During training, the contrastive loss is calculated based on the shared and specific knowledge representations of each modality; the multimodal prediction loss is calculated based on the multimodal sentiment prediction results and multimodal ground truth labels; the shared and specific knowledge representations of each modality are input into a multilayer perceptron to obtain the unimodal sentiment prediction results for each modality, and the unimodal prediction loss is calculated based on the unimodal sentiment prediction results and unimodal ground truth labels; the contrastive loss, multimodal prediction loss, and unimodal prediction loss are weighted and calculated to obtain the total loss. The prediction module is used to input the multimodal data to be predicted into the trained sentiment prediction model to obtain the multimodal sentiment prediction results.

[0024] Compared to existing technologies, this application offers the following advantages: The contrastive learning-based consistent and inconsistent multimodal sentiment analysis method decomposes multimodal representations into shared knowledge representations and specific knowledge representations under a unified contrastive loss constraint, effectively preventing the loss of modality-specific information during the fusion process. Simultaneously, a text-based attention mechanism is designed to construct a dynamic interaction channel between text, speech, and visual modalities, enabling fine-grained modeling and extraction of consistent and inconsistent information across modalities, further improving the accuracy and stability of multimodal sentiment analysis. Attached Figure Description

[0025] This application can be better understood by referring to the description given below in conjunction with the accompanying drawings, which, together with the detailed description below, are incorporated in and form part of this specification. In the drawings: Figure 1 A flowchart of a consistent and inconsistent multimodal sentiment analysis method based on contrastive learning is shown. Detailed Implementation

[0026] Exemplary embodiments of the present application will be described below with reference to the accompanying drawings. For clarity and brevity, not all features of the actual embodiments are described in the specification. However, it should be understood that many embodiment-specific decisions can be made in the development of any such actual embodiment to achieve the developer’s specific objectives, and these decisions may vary as the embodiments differ.

[0027] It should also be noted that, in order to avoid obscuring this application with unnecessary details, only the device structure closely related to the solution of this application is shown in the accompanying drawings, while other details that are not closely related to this application are omitted.

[0028] It should be understood that this application is not limited to the described embodiments by virtue of the following description with reference to the accompanying drawings. In this document, embodiments may be combined with each other, features may be substituted or borrowed between different embodiments, and one or more features may be omitted in one embodiment, where feasible.

[0029] In practical applications of multimodal sentiment analysis, consistent and inconsistent sentiment information often coexists across text, speech, and visual modalities. For example, in some samples, the textual sentiment expression is clear, while the speech or visual cues are weak; in other samples, different modalities may exhibit inconsistent sentiment. To address these issues, simply fusing multimodal features can easily lead to the weakening or loss of modality-specific information during the fusion process, thus affecting the accuracy of the sentiment analysis results.

[0030] This application introduces a contrastive representation reconstruction mechanism, decomposing multimodal features into shared knowledge representations and specific knowledge representations under a unified contrastive learning constraint. This enables the model to maintain cross-modal consistency modeling capabilities while effectively preserving the differential information of each modality, thus fundamentally avoiding information confusion and representation degradation. Simultaneously, a text-based attention mechanism is used to guide interactions between different modal features, allowing the model to perform fine-grained modeling of consistent and inconsistent information around text semantics. Therefore, the method in this application can achieve stable and effective sentiment feature modeling in complex multimodal sentiment scenarios, improving the overall performance of multimodal sentiment analysis tasks, and demonstrating good feasibility and beneficial effects.

[0031] This application provides a method for consistent and inconsistent multimodal sentiment analysis based on contrastive learning. Figure 1 A flowchart of a consistent and inconsistent multimodal sentiment analysis method based on contrastive learning is shown. See [link / reference]. Figure 1 The method mainly includes the following steps: Step S1: Construct a sentiment prediction model. The sentiment prediction module includes a feature extraction module, a feature decomposition module, a text prompt attention module, and a multilayer perceptron.

[0032] Step S2: Train the sentiment prediction model based on the training dataset to obtain the trained sentiment prediction model; the samples in the training dataset are multimodal data. The multimodal data includes text modal data, visual modal data, and speech modal data.

[0033] During training, the feature extraction module extracts features from the input multimodal data to obtain features for each modality; the feature decomposition module decomposes the features for each modality to obtain shared knowledge representations and specific knowledge representations for each modality; the text prompting attention module performs consistency fusion on the shared knowledge representations of each modality to obtain a consistent fused representation, and the text prompting attention module performs inconsistency fusion on the specific knowledge representations of each modality to obtain an inconsistency fused representation; the multilayer perceptron performs multimodal sentiment prediction based on the consistent fused representation, the inconsistency fused representation, the shared knowledge representation of the text modality, and the specific knowledge representation, and obtains the multimodal sentiment prediction result. During training, the contrastive loss is calculated based on the shared and specific knowledge representations of each modality; the multimodal prediction loss is calculated based on the multimodal sentiment prediction results and multimodal ground truth labels; the shared and specific knowledge representations of each modality are input into a multilayer perceptron to obtain the unimodal sentiment prediction results for each modality, and the unimodal prediction loss is calculated based on the unimodal sentiment prediction results and unimodal ground truth labels; the contrastive loss, multimodal prediction loss, and unimodal prediction loss are weighted and calculated to obtain the total loss. Step S3: Input the multimodal data to be predicted into the trained sentiment prediction model to obtain the multimodal sentiment prediction results.

[0034] In this embodiment, under a unified contrastive loss constraint, multimodal representation is decomposed into shared knowledge representation and specific knowledge representation, thereby effectively avoiding the loss of modality-specific information during the fusion process. Simultaneously, a text-based attention mechanism is designed to construct a dynamic interaction channel between text, speech, and visual modalities. This mechanism enables fine-grained modeling and extraction of consistent and inconsistent information across modalities, further improving the accuracy and stability of multimodal sentiment analysis.

[0035] In one embodiment, the feature extraction module includes a pre-trained BERT model, a visual feature encoder, and a speech feature encoder, where each encoder is based on a Transformer structure.

[0036] The pre-trained BERT model is used to extract features from text modal data to obtain text modal features. ; A visual feature encoder is used to extract features from visual modality data to obtain visual modality features. ; A speech feature encoder is used to extract features from speech modal data to obtain speech modal features. .

[0037] In one embodiment, the feature projection module includes feature projectors corresponding to each modality, which are used to decompose the features of each modality to obtain a shared knowledge representation. and specific knowledge representation ,in , Indicates the sample index.

[0038] The feature projector consists of a layer normalization layer, a linear transformation layer with a Tanh activation function, and a Dropout layer connected in sequence.

[0039] In one embodiment, to fully utilize the complementary information between multimodal shared knowledge and modality-specific knowledge, text modality is used as the guiding core to perform consistency fusion and non-consistency fusion processing on the shared knowledge representation and the specific knowledge representation, respectively.

[0040] The text prompting attention module performs consistent fusion of shared knowledge representations across modalities, including: Representing shared knowledge in text modalities The input text encoder performs feature modeling to obtain a fine-grained shared knowledge text representation. Here, the text encoder is a Transformer encoder. Fine-grained shared knowledge text This is represented as a query vector, and is associated with the shared knowledge representation of the speech modality. Shared knowledge representation of visual modalities Attention calculation is performed, resulting in the first similarity weight matrix. Second similarity weight matrix The following formula is used:

[0041] in, For learnable parameters, For each dimension of attention head, Indicates transpose;

[0042] in, , These are learnable parameters.

[0043] Based on the first similarity weight matrix Second similarity weight matrix Shared knowledge representation of speech modalities Shared knowledge representation of visual modalities Weighted fusion is performed to obtain a consistent fusion representation. The following formula is used:

[0044] in, , These are learnable parameters.

[0045] In one embodiment, the text prompting attention module performs inconsistent fusion of specific knowledge representations for each modality, including: Representing specific knowledge of text modalities The input text encoder performs feature modeling to obtain a fine-grained, knowledge-specific text representation. Here, the text encoder is a Transformer encoder.

[0046] Representing specific knowledge in fine-grained text As query vectors, they are respectively associated with the specific knowledge representation of the speech modality. Specific knowledge representation of visual modalities Attention calculation is performed, resulting in the third similarity weight matrix. and the fourth similarity weight matrix The following formula is used:

[0047] in, , These are learnable parameters.

[0048]

[0049] Based on the third similarity weight matrix and the fourth similarity weight matrix Specific knowledge representation of speech modalities Specific knowledge representation of visual modalities Weighted fusion is performed to obtain an inconsistent fusion representation. The following formula is used:

[0050] in, , These are learnable parameters.

[0051] In one embodiment, after obtaining a consistent fusion representation Inconsistent fusion representation During the process, the shared knowledge representation and specific knowledge representation of the text modality are based on the text encoder to obtain a fine-grained shared knowledge text representation. and fine-grained specific knowledge text representation ,Will , , , The data is then concatenated and input into a multilayer perceptron to obtain multimodal sentiment prediction results. .

[0052] In one embodiment, the contrastive loss is calculated based on the shared knowledge representation and the specific knowledge representation of each modality, including: For a set of samples in a training batch, for any sample, denoted as the current sample, calculate the cosine similarity between the current sample and other samples using the following formula:

[0053] in, Indicates sample Cosine similarity between them Indicates will , , To splice, For the sample Text modal features, For the sample Visual modal features, For the sample The speech modal features; Indicates will , , To splice, For the sample Text modal features, For the sample Visual modal features, For the sample The speech modal features.

[0054] The samples are sorted from largest to smallest according to their cosine similarity. A set of highly similar samples is formed by selecting the top proportion of samples in the sorted set, and a set of low similar samples is formed by selecting the bottom proportion of samples in the sorted set. Samples with the same label as the current sample are selected from the highly similar sample set to form a similar sample set SS. The remaining samples in the highly similar sample set form a set of easily confused samples SD. Samples with different labels from the current sample are selected from the low similar sample set to form a set of dissimilar samples DD. A set number of samples are selected from the similar sample set SS to form a positive sample set, and a set number of samples are selected from the easily confused sample set SD and the dissimilar sample set DD to form a negative sample set. The specific allocation ratio is not specifically limited.

[0055] Within the same sample, construct in-sample positive and in-sample negative pairs using the following formula:

[0056]

[0057] in, For the sample Positive sample pairs within the sample, For the sample In-sample negative sample pairs, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities, Indicates sample The set of positive samples Indicates sample The set of negative samples; Representing samples respectively Specific knowledge representation of text modality, sample Specific knowledge representation of visual modalities, samples Specific knowledge representation of speech modalities, Representing samples respectively Specific knowledge representation of text modality, sample Specific knowledge representation of visual modalities, samples Specific knowledge representation of speech modalities; Construct positive sample pairs between different samples. Negative sample pairs between samples The following formula is used:

[0058]

[0059] in, For the sample Positive sample pairs between samples, For the sample Negative sample pairs between samples, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities; Construct the set of positive sample pairs and the set of negative sample pairs:

[0060]

[0061] in, For the sample The set of positive sample pairs For the sample The set of negative sample pairs; The comparative loss is:

[0062] in, For the sample The comparative loss, Represents the cosine similarity function. For temperature parameters, for The elements in p represents two values ​​in the element. for The elements in These are the two values ​​in the element.

[0063] For example, Element is ,but for , for .

[0064] In one embodiment, the multimodal prediction loss is:

[0065] in, For multimodal prediction loss, For the sample Multimodal real labels For the sample The multimodal sentiment prediction results, This indicates the number of samples in a training batch.

[0066] In one embodiment, to further constrain the semantic consistency between shared knowledge representation and specific knowledge representation, the shared knowledge representation and specific knowledge representation of text, visual, and speech modalities are input into a weight-shared multilayer perceptron to obtain the corresponding single-modal sentiment prediction results. :

[0067] The single-mode prediction loss is:

[0068] in, This represents the single-mode prediction loss. For the sample Single-modal real labels, For the sample The single-modal sentiment prediction results This indicates the number of samples in a training batch. This represents the square of the L2 norm.

[0069] Specifically, total loss for:

[0070] in, and These are hyperparameters representing the weights of different loss terms.

[0071] Employing the same inventive concept as the consistent and inconsistent multimodal sentiment analysis method based on contrastive learning, this embodiment also provides a corresponding consistent and inconsistent multimodal sentiment analysis apparatus based on contrastive learning, comprising: The model building module is used to construct the sentiment prediction model. This module includes a feature extraction module, a feature decomposition module, a text-based attention module, and a multilayer perceptron. The training module is used to train the sentiment prediction model based on the training dataset to obtain the trained sentiment prediction model; the samples in the training dataset are multimodal data; the multimodal data includes text modal data, visual modal data and speech modal data.

[0072] During training, the feature extraction module extracts features from the input multimodal data to obtain features for each modality; the feature decomposition module decomposes the features for each modality to obtain shared knowledge representations and specific knowledge representations for each modality; the text prompting attention module performs consistency fusion on the shared knowledge representations of each modality to obtain a consistent fused representation, and the text prompting attention module performs inconsistency fusion on the specific knowledge representations of each modality to obtain an inconsistency fused representation; the multilayer perceptron performs multimodal sentiment prediction based on the consistent fused representation, the inconsistency fused representation, the shared knowledge representation of the text modality, and the specific knowledge representation, and obtains the multimodal sentiment prediction result. During training, the contrastive loss is calculated based on the shared and specific knowledge representations of each modality; the multimodal prediction loss is calculated based on the multimodal sentiment prediction results and multimodal ground truth labels; the shared and specific knowledge representations of each modality are input into a multilayer perceptron to obtain the unimodal sentiment prediction results for each modality, and the unimodal prediction loss is calculated based on the unimodal sentiment prediction results and unimodal ground truth labels; the contrastive loss, multimodal prediction loss, and unimodal prediction loss are weighted and calculated to obtain the total loss. The prediction module is used to input the multimodal data to be predicted into the trained sentiment prediction model to obtain the multimodal sentiment prediction results.

[0073] The consistent and inconsistent multimodal sentiment analysis device based on contrastive learning in this embodiment has the same inventive concept as the consistent and inconsistent multimodal sentiment analysis method based on contrastive learning described above. Therefore, the specific implementation of this device can be found in the embodiment section of the consistent and inconsistent multimodal sentiment analysis method based on contrastive learning described above, and its technical effects correspond to the technical effects of the above method, so it will not be repeated here.

[0074] To verify the beneficial effects of this application, comparative experiments were conducted on several publicly available multimodal sentiment analysis datasets, comparing the proposed method with existing mainstream methods. Experimental results show that, under the same experimental conditions, the proposed method achieves superior performance across multiple evaluation metrics. Further analysis indicates that by introducing a contrastive representation reconstruction mechanism, the model can better maintain the discriminative power of features when facing multimodal data with both consistent and inconsistent sentiment information, mitigating the performance degradation caused by direct fusion. Simultaneously, the use of a text-based attention mechanism enables the model to more stably focus on key sentiment information during multimodal interaction.

[0075] In summary, this application solves the ambiguity problem caused by modal differences, and effectively utilizes complementary cross-modal information to achieve stable and effective emotional feature modeling in complex multimodal emotional scenarios, thereby improving the overall performance of multimodal emotion analysis tasks. It has good feasibility and beneficial effects.

[0076] The above descriptions are merely various embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for consistent and inconsistent multimodal sentiment analysis based on contrastive learning, characterized in that, include: A sentiment prediction model is constructed, wherein the sentiment prediction module includes a feature extraction module, a feature decomposition module, a text prompting attention module, and a multilayer perceptron; The sentiment prediction model is trained based on the training dataset to obtain the trained sentiment prediction model; the samples in the training dataset are multimodal data; the multimodal data includes text modal data, visual modal data, and speech modal data; During training, the feature extraction module extracts features from the input multimodal data to obtain features for each modality; the feature decomposition module decomposes each modality feature to obtain shared knowledge representations and specific knowledge representations for each modality; the text prompting attention module performs consistency fusion on the shared knowledge representations of each modality to obtain a consistent fusion representation, and the text prompting attention module performs inconsistency fusion on the specific knowledge representations of each modality to obtain an inconsistency fusion representation; the multilayer perceptron performs multimodal sentiment prediction based on the consistent fusion representation, the inconsistency fusion representation, the shared knowledge representation of the text modality, and the specific knowledge representation, to obtain the multimodal sentiment prediction result. During training, the contrastive loss is calculated based on the shared knowledge representation and specific knowledge representation of each modality; Based on the multimodal sentiment prediction results and multimodal ground truth labels, the multimodal prediction loss is calculated; the shared knowledge representation and specific knowledge representation of each modality are input into the multilayer perceptron to obtain the unimodal sentiment prediction results corresponding to each modality, and based on the unimodal sentiment prediction results and unimodal ground truth labels, the unimodal prediction loss is calculated; the contrast loss, the multimodal prediction loss, and the unimodal prediction loss are weighted and calculated to obtain the total loss; The multimodal data to be predicted is input into the trained sentiment prediction model to obtain the multimodal sentiment prediction result.

2. The method as described in claim 1, characterized in that, The feature extraction module includes a pre-trained BERT model, a visual feature encoder, and a speech feature encoder; The pre-trained BERT model is used to extract features from the text modal data to obtain text modal features; The visual feature encoder is used to extract features from the visual modality data to obtain visual modality features; The speech feature encoder is used to extract features from the speech modal data to obtain speech modal features.

3. The method as described in claim 1, characterized in that, The feature projection module includes feature projectors corresponding to each modality, which are used to decompose the features of each modality to obtain shared knowledge representation and specific knowledge representation; The feature projector comprises a layer normalization layer, a linear transformation layer with a Tanh activation function, and a Dropout layer connected in sequence.

4. The method as described in claim 1, characterized in that, The text prompting attention module performs consistent fusion of shared knowledge representations across modalities, including: The shared knowledge representation of the text modality is input into the text encoder for feature modeling to obtain a fine-grained shared knowledge text representation; Using the fine-grained shared knowledge text representation as the query vector, attention is calculated with the shared knowledge representation of the speech modality and the shared knowledge representation of the visual modality, respectively, to obtain the first similarity weight matrix and the second similarity weight matrix. Based on the first similarity weight matrix and the second similarity weight matrix, the shared knowledge representations of the speech modality and the visual modality are weighted and fused to obtain a consistent fused representation.

5. The method as described in claim 1, characterized in that, The text prompting attention module performs inconsistent fusion of specific knowledge representations for each modality, including: The specific knowledge representation of the text modality is input into the text encoder for feature modeling to obtain a fine-grained specific knowledge text representation; Using the fine-grained specific knowledge text representation as the query vector, attention is calculated with the specific knowledge representation of the speech modality and the specific knowledge representation of the visual modality, respectively, to obtain the third similarity weight matrix and the fourth similarity weight matrix. Based on the third and fourth similarity weight matrices, the specific knowledge representations of the speech modality and the specific knowledge representations of the visual modality are weighted and fused to obtain a non-consistent fused representation.

6. The method as described in claim 1, characterized in that, in, The contrastive loss is calculated based on the shared knowledge representation and specific knowledge representation of each modality, including: For a set of samples in a training batch, for any sample, denoted as the current sample, calculate the cosine similarity between the current sample and other samples; The samples are sorted from largest to smallest according to their cosine similarity. A set of highly similar samples is formed by selecting the top-ranked samples and a set of low-similarity samples is formed by selecting the bottom-ranked samples. Samples with the same label as the current sample are selected from the highly similar samples to form a set of similar samples (SS). The remaining samples in the highly similar samples form a set of easily confused samples (SD). Samples with different labels from the current sample are selected from the low-similarity samples to form a set of dissimilar samples (DD). A set number of samples are selected from the similar sample set SS to form a positive sample set, and a set number of samples are selected from the easily confused sample set SD and the dissimilar sample set DD to form a negative sample set; Within the same sample, construct in-sample positive and in-sample negative pairs using the following formula: in, For the sample Positive sample pairs within the sample, For the sample within-sample negative sample pairs, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities, Indicates sample The set of positive samples, Indicates sample The set of negative samples; Representing samples respectively Specific knowledge representation of text modality, sample Specific knowledge representation of visual modalities, samples Specific knowledge representation of speech modalities, Representing samples respectively Specific knowledge representation of text modality, sample Specific knowledge representation of visual modalities, samples Specific knowledge representation of speech modalities; Construct positive sample pairs between different samples. Negative sample pairs between samples The following formula is used: in, For the sample Positive sample pairs between samples, For the sample Negative sample pairs between samples, Representing samples respectively Shared knowledge representation of text modalities, samples Shared knowledge representation of visual modalities, samples Shared knowledge representation of speech modalities; Construct the set of positive sample pairs and the set of negative sample pairs: in, For the sample The set of positive sample pairs For the sample The set of negative sample pairs; The comparative loss is: in, For the sample The comparative loss, Represents the cosine similarity function. For temperature parameters, for The elements in p represents two values ​​in the element. for The elements in , These are the two values ​​in the element.

7. The method as described in claim 1, characterized in that, The multimodal prediction loss is: in, For multimodal prediction loss, For the sample Multimodal real labels For the sample The multimodal sentiment prediction results, This indicates the number of samples in a training batch.

8. The method as described in claim 1, characterized in that, The single-modal prediction loss is: in, This represents the single-mode prediction loss. For the sample Single-modal real labels, For the sample The single-modal sentiment prediction results This indicates the number of samples in a training batch. This represents the square of the L2 norm.

9. A device for consistent and inconsistent multimodal sentiment analysis based on contrastive learning, characterized in that, include: The model building module is used to build a sentiment prediction model. The sentiment prediction module includes a feature extraction module, a feature decomposition module, a text prompting attention module, and a multilayer perceptron. The training module is used to train the sentiment prediction model based on the training dataset to obtain the trained sentiment prediction model; the samples in the training dataset are multimodal data; the multimodal data includes text modal data, visual modal data and speech modal data; During training, the feature extraction module extracts features from the input multimodal data to obtain features for each modality; the feature decomposition module decomposes each modality feature to obtain shared knowledge representations and specific knowledge representations for each modality; the text prompting attention module performs consistency fusion on the shared knowledge representations of each modality to obtain a consistent fusion representation, and the text prompting attention module performs inconsistency fusion on the specific knowledge representations of each modality to obtain an inconsistency fusion representation; the multilayer perceptron performs multimodal sentiment prediction based on the consistent fusion representation, the inconsistency fusion representation, the shared knowledge representation of the text modality, and the specific knowledge representation, to obtain the multimodal sentiment prediction result. During training, the contrastive loss is calculated based on the shared knowledge representation and specific knowledge representation of each modality; Based on the multimodal sentiment prediction results and multimodal ground truth labels, the multimodal prediction loss is calculated; the shared knowledge representation and specific knowledge representation of each modality are input into the multilayer perceptron to obtain the unimodal sentiment prediction results corresponding to each modality, and based on the unimodal sentiment prediction results and unimodal ground truth labels, the unimodal prediction loss is calculated; the contrast loss, the multimodal prediction loss, and the unimodal prediction loss are weighted and calculated to obtain the total loss; The prediction module is used to input the multimodal data to be predicted into the trained sentiment prediction model to obtain the multimodal sentiment prediction result.