Sentiment analysis method based on multi-modal progressive fusion and enhanced network
By using multimodal progressive fusion and augmentation networks, the problems of low computational efficiency and suboptimal cross-modal fusion in multimodal sentiment analysis are solved, achieving efficient sentiment analysis and accurate feature fusion, and improving the robustness and accuracy of the model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG UNIV OF TECH
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-12
AI Technical Summary
Existing multimodal sentiment analysis models are computationally inefficient, neglect unimodal features, and ignore the long-term contextual characteristics of modal sequences when fusing across modalities, resulting in suboptimal fusion results.
A multimodal progressive fusion and enhancement network is adopted. The original features of text, audio and video are extracted and aligned temporally and dimensionally. The multimodal progressive fusion module and the enhancement module are used for deep cross-modal interaction. The implicit and explicit alignment mechanisms are combined to calculate the fusion weights and perform weighted integration, and finally perform sentiment classification.
It improves the computational efficiency of long sequence modeling, enhances the robustness and universality of the model, achieves more complete and accurate cross-modal feature alignment, and improves the accuracy of sentiment judgment.
Smart Images

Figure CN122196204A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of natural language processing, and more specifically, to a sentiment analysis method based on multimodal progressive fusion and augmentation networks. Background Technology
[0002] With the development of the internet, social media platforms such as Weibo, Douyin, and Xiaohongshu have become an integral part of people's daily lives. This has spurred the generation of multimodal data, including text, language, and video. Multimodal data has become a way for people to express emotions, presenting both opportunities and challenges for multimodal sentiment computing. Compared to unimodal data, multimodal data can integrate and analyze multiple modalities to produce more accurate results. Therefore, how to effectively utilize multimodal data to accurately predict sentiment deserves in-depth research.
[0003] However, certain shortcomings remain in multimodal sentiment analysis. Most recent sentiment analysis models use Transformers to model sequences, but this is computationally inefficient. Mamba, introduced recently, offers a more efficient solution for modeling long sequences. Secondly, existing methods directly output cross-modal fusion, neglecting to utilize the original single-modal features, potentially leading to suboptimal fusion results. Furthermore, mainstream methods generally rely on cross-modal attention mechanisms for feature alignment and fusion, often ignoring the long-term contextual characteristics of the modal sequences themselves and placing the alignment task entirely on the attention mechanism.
[0004] Chinese invention application No. 202210743773.3 discloses a "Text-Guided Hierarchical Adaptive Fusion Multimodal Sentiment Analysis Method," which includes: firstly, extracting features from three modalities—text, speech, and vision—separately; then, employing a cross-modal attention mechanism guided by text modal information to achieve pairwise representation between modalities, obtaining closely related speech and visual features; next, using a multimodal adaptive gating mechanism to effectively filter the three single-modal features with modality-related features, obtaining modality-specific features; then, employing a multimodal hierarchical fusion strategy to integrate multimodal features and modality-important information; and finally, using a linear transformation to predict sentiment polarity. Summary of the Invention
[0005] To address the technical problems existing in the background art, this invention provides a sentiment analysis method based on multimodal progressive fusion and enhancement networks. The technical solution adopted by this invention is as follows:
[0006] The first aspect of this invention provides a sentiment analysis method based on multimodal progressive fusion and enhancement networks, the method comprising: S1: Extract the original features of text, audio and video sample data, and perform temporal and dimensional alignment and encoding to obtain the initial sequence features of each modality; S2: Input the initial sequence features of each modality into a multimodal progressive fusion module and execute a fusion process containing N layers in parallel. Each layer includes a preliminary feature interaction stage and a subsequent feature fusion stage. By stacking N layers, the fused features of each modality after deep cross-modal interaction are output. S3: Input the fused features of each modality after deep cross-modal interaction into a multimodal enhancement module. Through stacked enhancement layers, the features of each modality are further optimized to obtain the enhanced multimodal features of each modality. S4: Based on the distribution differences between the initial sequence features of each modality, calculate the fusion weight between the initial sequence features of each modality and the enhanced multimodal features; S5: Using the fusion weights, the pooled single-modal features and the enhanced fusion features are weighted and integrated to obtain the final multimodal fusion representation; S6: Based on the final multimodal fusion representation, perform sentiment classification to obtain the classification result.
[0007] As a preferred approach, in step S1, the method for extracting the original features of text, audio, and video sample data, and aligning and encoding them according to time and dimension to obtain the initial sequence features of each modality includes: Pre-trained text feature extraction models, audio feature extraction models, and video feature extraction models were used to extract the original features of text, audio, and video sample data, respectively. One-dimensional convolutions are performed on the original features corresponding to audio and video to align with the text sequence length; linear layers map the original features corresponding to the three modalities to the same dimension, and a Transformer encoder learns the global relationships between feature sequences to obtain the initial sequence features of each modality. ,in , Represents text, Indicates audio. This refers to a video.
[0008] As a preferred embodiment, in step S2, the preliminary feature interaction stage is: using a sequence modeling network with shared parameters to implicitly align the features of all modalities; the subsequent feature fusion stage is: using the target modal feature of the current branch as the query, and using the other modal features after implicit alignment as the key and value, to perform cross-modal attention calculation to achieve explicit alignment, and then fusing the aligned information with the target modal feature.
[0009] As a preferred embodiment, the implicit alignment is achieved through a bidirectional state-space model with a shared transition matrix, the process of which is expressed as follows:
[0010] in, Represents the features of the (k-1)th layer. SA This indicates the self-attention mechanism. This represents the bidirectional state-space model operation of the shared transition matrix. This represents the features after coarse-grained alignment.
[0011] As a preferred approach, a method that uses the target modal features of the current branch as the query and other modal features after implicit alignment as keys and values to perform cross-modal attention calculation to achieve explicit alignment, and then fuses the aligned information with the target modal features, includes: For any target mode Calculate its relationship with two auxiliary modes. j , k Cross-modal attention, where j , k It is the set of three modes excluding the target mode. m For the other two distinct modalities, the calculation process for cross-modal attention is as follows: , ,
[0012]
[0013] in, The auxiliary modal features after the implicit alignment. , , For learnable parameters, Scaling factor Indicates from auxiliary mode To target mode Fine-grained alignment features; similarly, another auxiliary mode can be obtained. arrive Alignment features ; Subsequently, the fusion was completed through the fusion enhancement module to obtain the first... Layer in modal Centered fusion feature output:
[0014] in, Indicates a linear projection layer. This indicates a splicing operation. This refers to the fusion enhancement module.
[0015] As a preferred embodiment, in step S3, each enhancement layer in the multimodal enhancement module sequentially performs feature transformation based on a bidirectional state-space model and feature transformation based on a feedforward network, and residual connection and layer normalization operations are performed after each transformation. The process is expressed as follows:
[0016] in, This is the output of the (k-1)th enhancement layer. This represents the operation of a bidirectional state-space model. This indicates a feedforward network operation. This is the output of the k-th enhancement layer.
[0017] As a preferred embodiment, the method for calculating the fusion weights includes: Initial features of each single mode are obtained through a variational autoencoder. x m Prior distribution:
[0018] Calculate the KL divergence values of the prior distributions among all different mode pairs:
[0019] The average value of the KL divergence is input into the Sigmoid function to obtain the modality difference score, which is then used as the fusion weight.
[0020] The enhanced multimodal features were analyzed separately. and single-modal initial features x m Perform average pooling operation; Using the modal difference score score The pooled features are then weighted and integrated to obtain the final multimodal fusion representation. f :
[0021] in, This is a gate function used to filter redundant information. For splicing operations, This is for average pooling operations.
[0022] As a preferred embodiment, the method further includes the following step before step S1: Perform context modeling to extract contextual data features associated with the sample data; Step S6 also includes: The contextual data features are fused with the multimodal fusion representation, and then sentiment classification is performed to obtain the classification result.
[0023] As a preferred embodiment, the method further includes: A composite loss function is used for model training, and the composite loss function is expressed as follows:
[0024] in, , , represents the loss of the text modality subnet, the loss of the audio modality subnet, and the loss of the video modality subnet, respectively. This represents the loss of the context after passing through the subnet, while This represents the multimodal fusion loss that incorporates the context.
[0025] A second aspect of the present invention provides a computer device including a storage medium, a processor, and a computer program stored in the storage medium and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the aforementioned sentiment analysis method based on multimodal progressive fusion and augmentation networks.
[0026] Compared with the prior art, the beneficial effects of this invention are: This invention employs a hybrid sequence modeling structure combining Mamba and Transformer, significantly improving the computational efficiency of long sequence modeling while maintaining model representation capabilities, and better capturing long-term contextual dependencies of modalities. By constructing a progressive fusion network centered on multiple modalities, it enhances the model's robustness when facing poor quality in a particular modality, ensuring that any modality can effectively guide cross-modal alignment, thus improving the method's universality and reliability. Through a two-stage fusion mechanism combining implicit and explicit alignment, it achieves progressive interaction from coarse-grained to fine-grained, resulting in more complete and accurate cross-modal feature alignment and thus obtaining richer fusion representations. By introducing a modal difference learning module based on KL divergence, it can dynamically evaluate and adaptively adjust the contribution weights of single-modal and multi-modal fusion features, synergistically utilizing the advantages of both to optimize the final sentiment judgment accuracy. Attached Figure Description
[0027] Figure 1 This is a flowchart of the sentiment analysis method based on multimodal progressive fusion and enhancement networks provided in this embodiment.
[0028] Figure 2 This is a network architecture diagram of the sentiment analysis method based on multimodal progressive fusion and enhancement networks provided in this embodiment; Figure 3This is a schematic diagram of the multimodal progressive fusion layer structure provided in this embodiment; Figure 4 This is a schematic diagram of the multimodal enhancement layer structure provided in this embodiment. Detailed Implementation
[0029] The accompanying drawings are for illustrative purposes only and should not be construed as limiting the invention. It should be understood that the described embodiments are merely some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort are within the scope of protection of the embodiments of this application.
[0030] The terminology used in the embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the embodiments of this application. The singular forms “a,” “the,” and “the” used in the embodiments of this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.
[0031] In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims. In the description of this application, it should be understood that the terms "first," "second," "third," etc., are used only to distinguish similar objects and are not necessarily used to describe a specific order or sequence, nor should they be construed as indicating or implying relative importance. Those skilled in the art can understand the specific meaning of the above terms in this application according to the specific circumstances.
[0032] Furthermore, in the description of this application, unless otherwise stated, "multiple" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. The invention will be further described below with reference to the accompanying drawings and embodiments.
[0033] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0034] Example 1 Please refer to Figure 1This embodiment provides a sentiment analysis method based on multimodal progressive fusion and enhancement networks, the method comprising: S1: Extract the original features of text, audio and video sample data, and perform temporal and dimensional alignment and encoding to obtain the initial sequence features of each modality; In a specific embodiment, the method for extracting the original features of text, audio, and video sample data in step S1, and aligning and encoding them according to time sequence and dimension to obtain the initial sequence features of each modality includes: Pre-trained text feature extraction models, audio feature extraction models, and video feature extraction models were used to extract the original features of text, audio, and video sample data, respectively. One-dimensional convolutions are performed on the original features corresponding to audio and video to align with the text sequence length; linear layers map the original features corresponding to the three modalities to the same dimension, and a Transformer encoder learns the global relationships between feature sequences to obtain the initial sequence features of each modality. ,in , Represents text, Indicates audio. This refers to a video.
[0035] In one specific embodiment, the method further includes the following step before step S1: Perform context modeling to extract contextual data features associated with the sample data.
[0036] S2: Input the initial sequence features of each modality into a multimodal progressive fusion module and execute a fusion process containing N layers in parallel. Each layer includes a preliminary feature interaction stage and a subsequent feature fusion stage. By stacking N layers, the fused features of each modality after deep cross-modal interaction are output. In a specific embodiment, in step S2, the preliminary feature interaction stage is: using a sequence modeling network with shared parameters to implicitly align the features of all modalities; the subsequent feature fusion stage is: using the target modal feature of the current branch as the query, and using the other modal features after implicit alignment as the key and value, to perform cross-modal attention calculation to achieve explicit alignment, and then fusing the aligned information with the target modal feature.
[0037] In one specific embodiment, the implicit alignment is achieved through a bidirectional state-space model with a shared transition matrix, the process of which is represented as follows:
[0038] in, Represents the features of the (k-1)th layer. SA This indicates the self-attention mechanism. This represents the bidirectional state-space model operation of the shared transition matrix. This represents the features after coarse-grained alignment.
[0039] In a specific embodiment, a method for performing cross-modal attention calculation to achieve explicit alignment using the target modal features of the current branch as the query and other modal features after implicit alignment as keys and values, and then fusing the aligned information with the target modal features, includes: For any target mode Calculate its relationship with two auxiliary modes. j , k Cross-modal attention, where j , k It is the set of three modes excluding the target mode. m For the other two distinct modalities, the calculation process for cross-modal attention is as follows: , ,
[0040]
[0041] in, The auxiliary modal features after the implicit alignment. , , For learnable parameters, Scaling factor Indicates from auxiliary mode To target mode Fine-grained alignment features; similarly, another auxiliary mode can be obtained. arrive Alignment features ; Subsequently, the fusion was completed through the fusion enhancement module to obtain the first... Layer in modal Centered fusion feature output:
[0042] in, Indicates a linear projection layer. This indicates a splicing operation. This refers to the fusion enhancement module.
[0043] S3: Input the fused features of each modality after deep cross-modal interaction into a multimodal enhancement module. Through stacked enhancement layers, the features of each modality are further optimized to obtain the enhanced multimodal features of each modality. In a specific embodiment, in step S3, each enhancement layer in the multimodal enhancement module sequentially performs feature transformation based on a bidirectional state-space model and feature transformation based on a feedforward network, and after each transformation, residual connection and layer normalization operations are performed. The process is represented as follows:
[0044] in, This is the output of the (k-1)th enhancement layer. This represents the operation of a bidirectional state-space model. This indicates a feedforward network operation. This is the output of the k-th enhancement layer.
[0045] S4: Based on the distribution differences between the initial sequence features of each modality, calculate the fusion weight between the initial sequence features of each modality and the enhanced multimodal features; In one specific embodiment, the method for calculating the fusion weight includes: Initial features of each single mode are obtained through a variational autoencoder. x m Prior distribution:
[0046] Calculate the KL divergence values of the prior distributions among all different mode pairs:
[0047] The average value of the KL divergence is input into the Sigmoid function to obtain the modality difference score, which is then used as the fusion weight.
[0048] The enhanced multimodal features were analyzed separately. and single-modal initial features x m Perform average pooling operation; Using the modal difference score score The pooled features are then weighted and integrated to obtain the final multimodal fusion representation. f :
[0049] in, This is a gate function used to filter redundant information. For splicing operations, This is for average pooling operations.
[0050] S5: Using the fusion weights, the pooled single-modal features and the enhanced fusion features are weighted and integrated to obtain the final multimodal fusion representation; S6: Perform sentiment classification based on the final multimodal fusion representation to obtain the classification result; In one specific embodiment, step S6 further includes: The contextual data features are fused with the multimodal fusion representation, and then sentiment classification is performed to obtain the classification result.
[0051] In one specific embodiment, the method further includes: A composite loss function is used for model training, and the composite loss function is expressed as follows:
[0052] in, , , represents the loss of the text modality subnet, the loss of the audio modality subnet, and the loss of the video modality subnet, respectively. This represents the loss of the context after passing through the subnet, while This represents the multimodal fusion loss that incorporates the context.
[0053] Example 2 Please refer to Figure 1 This embodiment provides a sentiment analysis method based on multimodal progressive fusion and enhancement networks, the method comprising: S1: Extract the original features of text, audio and video sample data, and perform temporal and dimensional alignment and encoding to obtain the initial sequence features of each modality; In a specific embodiment, the method for extracting the original features of text, audio, and video sample data in step S1, and aligning and encoding them according to time sequence and dimension to obtain the initial sequence features of each modality includes: Pre-trained text feature extraction models, audio feature extraction models, and video feature extraction models were used to extract the original features of text, audio, and video sample data, respectively. One-dimensional convolutions are performed on the original features corresponding to audio and video to align with the text sequence length; linear layers map the original features corresponding to the three modalities to the same dimension, and a Transformer encoder learns the global relationships between feature sequences to obtain the initial sequence features of each modality. ,in , Represents text, Indicates audio. This refers to a video.
[0054] Specifically, the text feature extraction model is Roberta, the audio feature extraction model is Whisper, and the video feature extraction model is TimeSformer or Openface.
[0055] In one specific embodiment, the method further includes the following step before step S1: Perform context modeling to extract contextual data features associated with the sample data.
[0056] S2: Input the initial sequence features of each modality into a multimodal progressive fusion module and execute a fusion process containing N layers in parallel. Each layer includes a preliminary feature interaction stage and a subsequent feature fusion stage. By stacking N layers, the fused features of each modality after deep cross-modal interaction are output. In a specific embodiment, in step S2, the preliminary feature interaction stage is: using a sequence modeling network with shared parameters to implicitly align the features of all modalities; the subsequent feature fusion stage is: using the target modal feature of the current branch as the query, and using the other modal features after implicit alignment as the key and value, to perform cross-modal attention calculation to achieve explicit alignment, and then fusing the aligned information with the target modal feature.
[0057] In one specific embodiment, the implicit alignment is achieved through a bidirectional state-space model with a shared transition matrix, the process of which is represented as follows:
[0058] in, Represents the features of the (k-1)th layer. SA This indicates the self-attention mechanism. This represents the bidirectional state-space model operation of the shared transition matrix. This represents the features after coarse-grained alignment.
[0059] In a specific embodiment, a method for performing cross-modal attention calculation to achieve explicit alignment using the target modal features of the current branch as the query and other modal features after implicit alignment as keys and values, and then fusing the aligned information with the target modal features, includes: For any target mode Calculate its relationship with two auxiliary modes. j , k Cross-modal attention, where j , k It is the set of three modes excluding the target mode. m For the other two distinct modalities, the calculation process for cross-modal attention is as follows: , ,
[0060]
[0061] in, The auxiliary modal features after the implicit alignment. , , For learnable parameters, Scaling factor Indicates from auxiliary mode To target mode Fine-grained alignment features; similarly, another auxiliary mode can be obtained. arrive Alignment features ; Subsequently, the fusion was completed through the fusion enhancement module to obtain the first... Layer in modal Centered fusion feature output:
[0062] in, Indicates a linear projection layer. This indicates a splicing operation. This refers to the fusion enhancement module.
[0063] It should be noted that, please refer to Figure 2 In obtaining After considering the features, cross-modal interaction between different modalities is required. Therefore, multimodal progressive fusion networks have been proposed, such as... Figure 2 As shown, the Multimodal Progressive Fusion Module contains three modal branches: the top branch aligns and merges based on the audio center, the middle branch aligns and merges based on the text center, and the bottom branch aligns and merges based on the visual center. Each branch consists of N layers of Multimodal Progressive Fusion Layers (MPFLs), as shown in the diagram. Figure 3 As shown. This module has three branches, and the initial input for each branch is... , Taking the middle branch (the text center branch) as an example, its input consists of three modal features. , tav represents text, speech, and vision, respectively. These three modalities are implicitly aligned using Bi-Mamba in MPFL, then globally cross-modal display aligned with the text as the center, and finally fused and enhanced by the Fusion Enhancement Block. The output is... Let k represent the output of the k-th layer, and so on. The other two branches also undergo the same operation. In each branch, the input of each layer is the three modal features output from the previous layer. Finally, the Multimodal Progressive Fusion Module outputs the aligned and fused features of the three branches. This module mainly enhances fusion by performing local alignment operations on the three modes after passing them through Bi-Mamba with a shared transition matrix, and then enhancing the fusion through global cross-modal alignment with one mode as the center and the other modes as auxiliary modes.
[0064] S3: Input the fused features of each modality after deep cross-modal interaction into a multimodal enhancement module. Through stacked enhancement layers, the features of each modality are further optimized to obtain the enhanced multimodal features of each modality. In a specific embodiment, in step S3, each enhancement layer in the multimodal enhancement module sequentially performs feature transformation based on a bidirectional state-space model and feature transformation based on a feedforward network, and after each transformation, residual connection and layer normalization operations are performed. The process is represented as follows:
[0065] in, This is the output of the (k-1)th enhancement layer. This represents the operation of a bidirectional state-space model. This indicates a feedforward network operation. This is the output of the k-th enhancement layer.
[0066] It should be noted that, in order to effectively utilize cross-modal interactive fusion features to learn more effective representations, a Multimodal Fusion Enhancement Module (MEM) is proposed. This module mainly consists of N layers of Multimodal Fusion Enhancement Layers (MELs). Each MEL comprises Bi-Mamba and a feedforward network (FFN), as shown below. Figure 2 As shown, the multimodal enhancement module also has three branches. The top branch enhances the fused audio features, the middle branch enhances the fused text features, and the bottom branch enhances the fused visual features. Each branch consists of N layers of multimodal progressive fusion layers (MELs), as shown below. Figure 4 As shown, the multimodal progressive fusion layer (MEL) includes Bi-mamba and FFN, and the initial input of the MEM module is the output of MPFM. The MEM module contains 3 branches, each branch contains N layers of MELs, and the output of the k-th layer of each branch is... , Finally, the output of the three paths after passing through N layers is: .
[0067] S4: Based on the distribution differences between the initial sequence features of each modality, calculate the fusion weight between the initial sequence features of each modality and the enhanced multimodal features; In one specific embodiment, the method for calculating the fusion weight includes: Initial features of each single mode are obtained through a variational autoencoder. x m Prior distribution:
[0068] Calculate the KL divergence values of the prior distributions among all different mode pairs:
[0069] The average value of the KL divergence is input into the Sigmoid function to obtain the modality difference score, which is then used as the fusion weight.
[0070] The enhanced multimodal features were analyzed separately. and single-modal initial features x m Perform average pooling operation; Using the modal difference score score The pooled features are then weighted and integrated to obtain the final multimodal fusion representation. f :
[0071] in, This is a gate function used to filter redundant information. For splicing operations, This is for average pooling operations.
[0072] S5: Using the fusion weights, the pooled single-modal features and the enhanced fusion features are weighted and integrated to obtain the final multimodal fusion representation; S6: Perform sentiment classification based on the final multimodal fusion representation to obtain the classification result; In one specific embodiment, step S6 further includes: The contextual data features are fused with the multimodal fusion representation, and then sentiment classification is performed to obtain the classification result.
[0073] In one specific embodiment, the method further includes: A composite loss function is used for model training, and the composite loss function is expressed as follows:
[0074] in, , , represents the loss of the text modality subnet, the loss of the audio modality subnet, and the loss of the video modality subnet, respectively. This represents the loss of the context after passing through the subnet, while This represents the multimodal fusion loss that incorporates the context.
[0075] Example 3 This embodiment provides a computer device, including a storage medium, a processor, and a computer program stored in the storage medium and executable by the processor. When the computer program is executed by the processor, it implements the steps of the sentiment analysis method based on multimodal progressive fusion and augmentation network described in Embodiment 1 or Embodiment 2.
[0076] Obviously, the above embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the implementation of the present invention. Those skilled in the art can make other variations or modifications based on the above description. It is neither necessary nor possible to exhaustively describe all embodiments here. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the claims of the present invention.
Claims
1. A sentiment analysis method based on multimodal progressive fusion and augmentation networks, characterized in that, The method includes: S1: Extract the original features of text, audio and video sample data, and perform temporal and dimensional alignment and encoding to obtain the initial sequence features of each modality; S2: Input the initial sequence features of each modality into a multimodal progressive fusion module and execute a fusion process containing N layers in parallel. Each layer includes a preliminary feature interaction stage and a subsequent feature fusion stage. By stacking N layers, the fused features of each modality after deep cross-modal interaction are output. S3: Input the fused features of each modality after deep cross-modal interaction into a multimodal enhancement module. Through stacked enhancement layers, the features of each modality are further optimized to obtain the enhanced multimodal features of each modality. S4: Based on the distribution differences between the initial sequence features of each modality, calculate the fusion weight between the initial sequence features of each modality and the enhanced multimodal features; S5: Using the fusion weights, the pooled single-modal features and the enhanced fusion features are weighted and integrated to obtain the final multimodal fusion representation; S6: Based on the final multimodal fusion representation, perform sentiment classification to obtain the classification result.
2. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 1, characterized in that, In step S1, the methods for extracting the original features of text, audio, and video sample data, and aligning and encoding them according to time and dimension to obtain the initial sequence features of each modality include: Pre-trained text feature extraction models, audio feature extraction models, and video feature extraction models were used to extract the original features of text, audio, and video sample data, respectively. One-dimensional convolutions are performed on the original features corresponding to audio and video to align with the text sequence length; linear layers map the original features corresponding to the three modalities to the same dimension, and a Transformer encoder learns the global relationships between feature sequences to obtain the initial sequence features of each modality. ,in , Represents text, Indicates audio. This refers to a video.
3. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 1, characterized in that, In step S2, the initial feature interaction stage is: using a sequence modeling network with shared parameters to implicitly align the features of all modalities; the subsequent feature fusion stage is: using the target modal feature of the current branch as the query, and using the other modal features after implicit alignment as the key and value, to perform cross-modal attention calculation to achieve explicit alignment, and then fusing the aligned information with the target modal feature.
4. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 3, characterized in that, The implicit alignment is achieved through a bidirectional state-space model with a shared transition matrix, and the process is represented as follows: in, Represents the features of the (k-1)th layer. SA This indicates the self-attention mechanism. This represents the bidirectional state-space model operation of the shared transition matrix. This represents the features after coarse-grained alignment.
5. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 3, characterized in that, The method of using the target modal features of the current branch as the query and other modal features after implicit alignment as keys and values to perform cross-modal attention calculation to achieve explicit alignment, and then fusing the aligned information with the target modal features, includes: For any target mode Calculate its relationship with two auxiliary modes. j , k Cross-modal attention, where j , k It is the set of three modes excluding the target mode. m For the other two distinct modalities, the calculation process for cross-modal attention is as follows: , , in, The auxiliary modal features after the implicit alignment. , , For learnable parameters, Scaling factor Indicates from auxiliary mode To target mode Fine-grained alignment features; similarly, another auxiliary mode can be obtained. arrive Alignment features ; Subsequently, the fusion was completed through the fusion enhancement module to obtain the first... Layer in modal Centered fusion feature output: in, Indicates a linear projection layer. This indicates a splicing operation. This refers to the fusion enhancement module.
6. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 1, characterized in that, In step S3, each enhancement layer in the multimodal enhancement module sequentially performs feature transformation based on a bidirectional state-space model and feature transformation based on a feedforward network. After each transformation, residual connection and layer normalization operations are performed. The process is represented as follows: in, This is the output of the (k-1)th enhancement layer. This represents the operation of a bidirectional state-space model. This indicates a feedforward network operation. This is the output of the k-th enhancement layer.
7. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 1, characterized in that, The method for calculating the fusion weight includes: Initial features of each single mode are obtained through a variational autoencoder. x m Prior distribution: Calculate the KL divergence values of the prior distributions among all different mode pairs: The average value of the KL divergence is input into the Sigmoid function to obtain the modality difference score, which is then used as the fusion weight. The enhanced multimodal features were analyzed separately. and single-modal initial features x m Perform average pooling operation; Using the modal difference score score The pooled features are weighted and integrated to obtain the final multimodal fusion representation. f : in, This is a gate function used to filter redundant information. For splicing operations, This is for average pooling operations.
8. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 1, characterized in that, The steps preceding step S1 also include: Perform context modeling to extract contextual data features associated with the sample data; Step S6 also includes: The contextual data features are fused with the multimodal fusion representation, and then sentiment classification is performed to obtain the classification result.
9. The sentiment analysis method based on multimodal progressive fusion and enhancement networks according to claim 8, characterized in that, The method further includes: A composite loss function is used for model training, and the composite loss function is expressed as follows: in, , , represents the loss of the text modality subnet, the loss of the audio modality subnet, and the loss of the video modality subnet, respectively. This represents the loss of the context after passing through the subnet, while This represents the multimodal fusion loss that incorporates the context.
10. A computer device, characterized in that: The method includes a storage medium, a processor, and a computer program stored in the storage medium and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the sentiment analysis method based on multimodal progressive fusion and augmentation networks as described in any one of claims 1 to 9.