An emotion analysis method and device, electronic equipment and storage medium

By calculating the similarity between aspect words and target regions in text and images, and utilizing multi-head interactive attention mechanism and low-rank bilinear pooling, the problem of insufficient fusion of image and text features in existing models is solved, thereby improving the accuracy and reliability of sentiment analysis.

CN116541520BActive Publication Date: 2026-06-16EAST CHINA UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
EAST CHINA UNIV OF SCI & TECH
Filing Date
2023-03-20
Publication Date
2026-06-16

Smart Images

  • Figure CN116541520B_ABST
    Figure CN116541520B_ABST
Patent Text Reader

Abstract

Embodiments of the present application relate to the technical field of natural language processing, and disclose a sentiment analysis method and device, electronic equipment and a storage medium. In the present application, text in a data set and pictures corresponding to the text are obtained; the text contains at least one aspect word; the aspect word is part of a sentence in the text; at least one target region is obtained from the pictures; global similarity between the aspect word and the text and local similarity between the aspect word and the target region are calculated respectively, and a corresponding relationship between the aspect word and the target region is calculated according to the local similarity and the global similarity; and the sentiment polarity corresponding to the aspect word is determined according to the corresponding relationship and the text. Through the above manner, most visual noise can be filtered, and local information useful for sentiment analysis can be captured at the same time, picture noise information can be effectively filtered, sufficient information interaction can be performed at the text and picture fine granularity, and the determination of the sentiment polarity of the aspect word is accurately and reliably realized.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and in particular to a sentiment analysis method, apparatus, electronic device, and storage medium. Background Technology

[0002] With the rise of online e-commerce platforms for lifestyle services, merchants on these platforms are committed to connecting consumers and merchants through artificial intelligence, striving to provide consumers with a higher quality experience. Restaurants and hotels, as core businesses of these platforms, meet users' consumption needs for food, accommodation, and entertainment while traveling. In serving millions of merchants and hundreds of millions of users, these platforms have accumulated massive amounts of user review data. As these platforms mature, users are increasingly using attached images to express their genuine experiences and opinions, making images a key data type for emotional expression. Effectively extracting key emotional polarities and opinions from rich text and image content can not only assist more users in making consumption decisions but also help merchants collect user feedback on their products to improve service quality and thus enhance their business performance.

[0003] Due to the emergence of a large amount of data from different modalities, multimodal aspect-level sentiment analysis has received increasing attention. Several deep learning methods have emerged for this task in recent years. Inspired by the advantage of attention mechanisms in gaining contextual information in other natural language processing tasks, Yu, Xu, and Liu designed different effective attention mechanisms to model the interaction between aspect words, text, and images. Yu and Jiang designed a model called TwitterBERT, combining pre-training and fine-tuning to adjust the existing pre-trained language model BERT to capture the interaction between text and images, achieving excellent results. Yu et al. proposed a fine-tuning method based on multimodal cues to solve sentiment prediction tasks at different granularities. Zhao et al. helped the model align text and images by extracting adjective-noun pairs from images. Fu proposed a Transformer-based model that translates images into auxiliary sentences, combining the original and auxiliary sentences for targeted sentiment classification. Yu et al. designed a hierarchical interactive multimodal transformer to capture the interaction information between text and images and eliminate the semantic differences between them. Ju et al. proposed an end-to-end approach to jointly extract aspect words and their sentiment polarities.

[0004] The inventors discovered at least the following problems in the related technologies: all sentiment analysis models, including the above models, use the overall features of images and text to fuse together, without sufficient information interaction, and cannot effectively filter out image noise information, which greatly reduces the accuracy of sentiment analysis of the other party's words. Summary of the Invention

[0005] The purpose of this invention is to provide a sentiment analysis method, device, electronic device, and storage medium that effectively filters image noise by utilizing the strong correspondence between aspect words and local image information, and performs fine-grained image-text fusion interaction, thereby improving the accuracy of sentiment classification based on aspect words.

[0006] To address the aforementioned technical problems, embodiments of the present invention provide a sentiment analysis method, comprising: acquiring text and images corresponding to the text from a dataset; wherein the text contains at least one aspect word; the aspect word is part of a sentence in the text; acquiring at least one target region from the image; calculating the global similarity between the aspect word and the text, and the local similarity between the aspect word and the target region, respectively; calculating the correspondence between the aspect word and the target region based on the local similarity and the global similarity; and determining the sentiment polarity corresponding to the aspect word based on the correspondence and the text.

[0007] An embodiment of the present invention also provides a sentiment analysis device, comprising: a data acquisition module, configured to acquire text and an image corresponding to the text from a dataset; wherein the text contains at least one aspect word; and to acquire at least one target region from the image; a data alignment module, configured to calculate the global similarity between the aspect word and the text, and the local similarity between the aspect word and the target region, and to calculate the correspondence between the aspect word and the target region based on the local similarity and the global similarity; and a sentiment analysis module, configured to determine the sentiment polarity corresponding to the aspect word based on the correspondence and the text.

[0008] Embodiments of the present invention also provide an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the sentiment analysis method as described above.

[0009] Embodiments of the present invention also provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described sentiment analysis method.

[0010] In this embodiment of the invention, text and corresponding images are obtained from a dataset; wherein the text contains at least one aspect word; the aspect word is part of a sentence in the text; at least one target region is obtained from the image; the global similarity between the aspect word and the text, and the local similarity between the aspect word and the target region are calculated respectively; the correspondence between the aspect word and the target region is calculated based on the local similarity and the global similarity; the sentiment polarity corresponding to the aspect word is determined based on the correspondence and the text. This method can filter out most visual noise while capturing local information useful for sentiment analysis. Because it emphasizes these local similarities and uses a reliable fine-grained alignment mechanism to effectively filter image noise, and because it does not completely use local similarity as the sole parameter when calculating the correspondence, it allows for sufficient information exchange between the text and image at a fine-grained level, avoiding overemphasis on local information and interference with the judgment of other aspect words, thus achieving a correct, accurate, and reliable judgment of the sentiment polarity of aspect words.

[0011] Furthermore, the step of calculating the correspondence between the aspect words and the visual features based on the local similarity and the global similarity includes: applying a confidence constraint to the local similarity based on the global similarity, and using the constrained local similarity to perform multi-layer self-attention calculation to obtain the correspondence between the aspect words and the visual features. Combining global similarity with local similarity makes the obtained similarity as a judgment of correspondence more accurate and reliable, increases the correlation after fine-grained alignment, accurately describes the detailed correspondence between aspect words and target regions, and performs visual semantic alignment between different modalities.

[0012] Furthermore, determining the sentiment polarity corresponding to the aspect word based on the correspondence and the text includes: calculating the multimodal vector corresponding to the aspect word through a multi-head interactive attention mechanism based on the context of the aspect word in the text and the correspondence; and inputting the multimodal vector into a normalized exponential function to determine the sentiment polarity corresponding to the aspect word. Based on fine-grained alignment between aspect words and target regions, the multimodal vector allows for full interaction, collaboration, and complementarity among aspect words, text, visual objects, and complete image information, primarily achieved through a multi-head attention mechanism.

[0013] Furthermore, the step of calculating the multimodal vector corresponding to the aspect word through a multi-head interaction attention mechanism based on the context corresponding to the aspect word in the text and the correspondence includes: obtaining the target region corresponding to the aspect word according to the correspondence; calculating the cross-modal fine-grained interaction information between the aspect word and the image, and between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text, through a low-rank bilinear pooling method; fusing the cross-modal fine-grained interaction information between the aspect word and the image and the cross-modal fine-grained interaction information between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text; and calculating the multimodal vector corresponding to the aspect word based on the fusion result. Compared to simple feature concatenation, this method, through the cross-modal fine-grained interactions between aspect words and images, and between visual entities and text, highlights the higher-order interactions between various information types.

[0014] In addition, the step of extracting at least one target region from the image includes: extracting multiple image regions from the image using a convolutional neural network model, and selecting at least one target region from the multiple image regions using a trained object detection model.

[0015] Furthermore, the convolutional neural network model is a residual network model. It exhibits good performance in image processing tasks and is able to capture high-level features useful for the task.

[0016] In addition, the value calculated using the cross-entropy loss function is used to determine whether the sentiment polarity corresponding to the aspect word is accurate. If the value calculated using the cross-entropy loss function is less than a preset threshold, the determination is considered accurate. Attached Figure Description

[0017] One or more embodiments are illustrated by way of example with reference numerals in the accompanying drawings. These illustrations do not constitute a limitation on the embodiments. Elements with the same reference numerals in the drawings are denoted as similar elements. Unless otherwise stated, the figures in the drawings are not to be limited by scale.

[0018] Figure 1 This is a flowchart of a sentiment analysis method according to an embodiment of the present invention;

[0019] Figure 2 This is a data comparison diagram illustrating the technical effects produced by one embodiment of the present invention and other methods in the art;

[0020] Figure 3 A data comparison diagram illustrating the technical effects produced by adjusting the value of k according to an embodiment of the present invention;

[0021] Figure 4 This is a schematic diagram of an emotion analysis device according to another embodiment of the present invention;

[0022] Figure 5 This is a schematic diagram of the structure of an electronic device provided according to another embodiment of the present invention. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the various embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, those skilled in the art will understand that many technical details are presented in the various embodiments of the present invention to facilitate a better understanding of this application. However, the technical solutions claimed in this application can be implemented even without these technical details and various changes and modifications based on the following embodiments. The division of the various embodiments below is for ease of description and should not constitute any limitation on the specific implementation of the present invention. The various embodiments can be combined with and referenced by each other without contradiction.

[0024] One embodiment of the present invention relates to a sentiment analysis method that can be applied to terminal devices such as mobile phones and computers. In this embodiment, text and corresponding images are acquired from a dataset; wherein the text contains at least one aspect word; the aspect word is part of a sentence in the text; at least one target region is acquired from the image; the global similarity between the aspect word and the text, and the local similarity between the aspect word and the target region are calculated respectively; the correspondence between the aspect word and the target region is calculated based on the local and global similarities; the sentiment polarity corresponding to the aspect word is determined based on the correspondence and the text. This method can filter out most visual noise while capturing local information useful for sentiment analysis. A reliable fine-grained alignment mechanism effectively filters image noise information, and the fine-grained interaction between text and image allows for sufficient information exchange, enabling correct, accurate, and reliable determination of the sentiment polarity of aspect words. The implementation details of the sentiment analysis method of this embodiment are described below. The following content is only for ease of understanding and is not essential for implementing this solution.

[0025] like Figure 1As shown, in step 101, terminal devices such as mobile phones and computers acquire the text and corresponding images from the dataset. To facilitate understanding of the object of sentiment analysis in this application, this example will provide a dataset output after sentiment analysis in this application, which is divided into three parts: image, text, and sentiment prediction. The dataset obtained above only includes the image and text parts. The image part displays the images corresponding to the text, and the text part displays the text composed of sentences, each containing at least one aspect word. The aspect word is part of the sentence in the text. The sentiment prediction part uses the prediction result generated by the method of this application. Assuming the result is Positive, this result is a sentiment analysis based on the aspect word. In this example, the sentiment tendency of the text for the aspect word is neutral, but the area containing the aspect word in the image part shows a certain positive sentiment tendency, so it is a positive analysis result. If the content described by some aspect words does not show an obvious sentiment tendency in the image part, but only has matching elements, then the sentiment analysis result is Neutral, indicating that the sentiment analysis result is accurate.

[0026] In a specific example, when training the model, this application is given a training sample dataset D, where each sample d∈D contains a sentence T={ with n words. , , , ..., }, and an image I related to the sentence, and an aspect term A containing m words = { , , , ..., In sentence L, aspect terms are part of sentence L, where 'a' is the position of the starting word of the aspect term in the text. Each aspect term in the sample has a corresponding sentiment polarity label y, y∈{positive, negative, neutral} (meaning positive, negative, and neutral, respectively). The task of this model is to use D as the training dataset to train a model that can accurately determine the sentiment polarity of aspect term A in the sample based on T and I.

[0027] In step 102, a terminal device such as a mobile phone or computer obtains at least one target region from the image. In image processing tasks, it is necessary to obtain visual features that are useful for performing the task. These visual features exist in the target region, or in other words, the calculated visual features can help us lock the region in the image, such as multiple image regions framed by various visual boxes, among which the target region is useful for the task.

[0028] In one example, a method for extracting at least one target region from an image includes: using a convolutional neural network (CNN) model to extract multiple image regions from the image, where the CNN model can employ a more advanced ResNet model; and then using a trained object detection model to select at least one target region from the multiple image regions.

[0029] Deep CNN models perform well in most image processing tasks, capturing high-level features useful for the task. In the specific example mentioned above, this application needs to obtain a vectorized global representation of the text before processing the image. This text representation requires using a pre-trained word embedding matrix GloVe to obtain a fixed initial word embedding vector for each word. Let the word embedding matrix be... Where d is the dimension of the word vector. This is the dictionary size; each word in the text corresponds to a row in matrix M. The transformed sentence is represented as... .in Then The text is fed into a bidirectional LSTM to obtain its contextual dependencies, and the hidden state of the last layer is used as the final text vector representation. If an aspect term consists of multiple words, then the average of the word embeddings of all words is taken as the final vector representation of the aspect term. Based on these hidden states, a widely used attention mechanism is further employed to compute the global representation of the text. ,use As the query vector in the attention mechanism, its calculation process is as follows:

[0030]

[0031]

[0032]

[0033]

[0034]

[0035]

[0036] in, These represent the forward and reverse hidden states of the k-th layer of the bidirectional LSTM, respectively, and the attention weights. yes and Normalized similarity between them.

[0037] After processing the global vector representation of the text, the visual representation of the image is processed. When processing the input image I, this application first resizes it to a fixed 224*224 format to meet the network input requirements. Then, the converted image is fed into the ResNet model, using the output of the model's last convolutional layer as the image's visual feature representation. Then use a linear transformation function visual features Projected into the same space as the aforementioned text features. Among them The calculation process is as follows:

[0038]

[0039]

[0040] It can be observed that In this study, 49 represents the number of image regions. However, the aspect terms have a strong consistency with the objects in the image but are unrelated to other regions. Using an attention mechanism across all regions not only introduces noise but also makes it more difficult for the model to extract useful features from the image. Therefore, to extract object-level image information, this paper uses a pre-trained FasteRCNN object detection model to detect salient regions in the image. Typically, only the more salient regions in an image are related to textual information; therefore, only the top k image regions with high classification scores are selected, or more precisely, the top k visual entity regions after non-maximum suppression processing. And use ResNet to analyze the detected visual regions Encode to obtain Then through linear projection Will Transformed into the same vector intermediate as the text, thus obtaining a fine-grained representation of the final image I. and use The max pooling result is used as the global representation of the image. .

[0041] The above text has provided a specific example of how to obtain the target region in the image in this application. The calculated text global representation and image global representation will play an important role in the detailed correspondence between terms and target regions in the following text.

[0042] In step 103, terminal devices such as mobile phones and computers calculate the global similarity between aspect words and text, and the local similarity between aspect words and target regions, respectively. The correspondence between aspect words and target regions is calculated based on the local and global similarities. This application uses a strong correspondence between aspect words and local image information rather than global information to reduce image noise. In one example, the method for calculating the correspondence between aspect words and visual features based on the local and global similarities can be: applying a confidence constraint to the local similarity based on the global similarity, and using the constrained local similarity to perform multi-layer self-attention calculation to obtain the correspondence between aspect words and visual features.

[0043] To describe the detailed correspondence between aspect terms and visual regions, we continue with the specific example mentioned above to introduce how to constrain the confidence of the local similarity based on the global similarity, and then use the constrained local similarity to perform multi-layer self-attention computation to obtain the correspondence between the aspect terms and the visual features. This paper uses a standardized distance-based representation to reflect the semantic similarity between heterogeneous patterns. Specific image regions... and aspect words Local semantic similarity between The calculation process is as follows:

[0044]

[0045] in, , It is a learnable parameter matrix, where p is a hyperparameter. This paper further measures the entire image. and full text Global semantic similarity between ,same, , It is also a learnable parameter matrix.

[0046]

[0047] In this specific example, global semantic similarity is used. and local semantic similarity Normalized similarity between them to match confidence levels The calculation method is as follows:

[0048]

[0049]

[0050] in, It is by The learnable parameter vector formed, i.e. The above , This indicates that corresponding elements of two vectors are multiplied. Here, is the sigmoid activation function, and LayerNorm represents the normalization operation. The key idea behind this confidence score is, as mentioned above, how much semantic similarity between aspect words and visual regions is included in the overall semantic similarity of the image and text—in other words, whether the visual region truly describes the aspect words in the text from a global perspective. To filter out unreliable similarity matches between visual regions and aspect words, the similarity score of each visual region is used... Multiply by the corresponding confidence level Therefore, the global semantic similarity and the local similarity constrained by confidence are collected together as follows:

[0051]

[0052] Then to Perform multi-layer self-attention computation to enhance fine-grained information alignment between modalities:

[0053]

[0054] in, and These are the parameter matrices used to transform the query vector and key value at level l. This is used to map the output dimension to a parameter matrix suitable for the input of layer l+1. Then the output of the last layer is... The last k columns are obtained by performing column-wise max pooling. ,from Take the index q of the maximum value and extract the feature representation of the image region with the same index. This is the output of the alignment module.

[0055] To date, in order to describe the detailed correspondence between aspect terms and visual regions, fine-grained alignment of visual semantics has been achieved across different modalities, resulting in the correspondence between aspect terms and target regions.

[0056] In step 104, terminal devices such as mobile phones and computers determine the emotional polarity corresponding to the aspect words based on the correspondence and the text.

[0057] Since fine-grained alignment has been achieved through steps 101-103 above, vector representations can be directly generated and fed into the sentiment analysis model. The model primarily uses the Softmax function to analyze the sentiment polarity of the aspect words in conjunction with the text. However, such direct judgment is not very accurate, so the correspondence needs to be further combined with the text. In one example, the combination of correspondence and text is achieved by determining the sentiment polarity of the aspect words based on the correspondence and the text in the following way: Based on the context of the aspect words in the text and the correspondence, the multimodal vector corresponding to the aspect words is calculated through a multi-head interactive attention mechanism; the multimodal vector is then input into a normalized exponential function to determine the sentiment polarity of the aspect words.

[0058] Based on fine-grained alignment of aspect terms and visual objects, and by allowing aspect terms, text, visual objects, and complete image information to fully interact, collaborate, and complement each other, we can effectively utilize textual contextual relationships for analysis, integrate information, and then comprehensively determine sentiment polarity. Jointly modeling the context of aspect terms is crucial for extracting relevant sentiment information; therefore, this application employs an attention mechanism to determine which parts of the text to focus on.

[0059] In one example, the step of calculating the multimodal vector corresponding to the aspect word through a multi-head interaction attention mechanism based on the context corresponding to the aspect word in the text and the correspondence includes: obtaining the target region corresponding to the aspect word according to the correspondence; calculating the cross-modal fine-grained interaction information between the aspect word and the image, and between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text, through a multi-head interaction attention mechanism; fusing the cross-modal fine-grained interaction information between the aspect word and the image and the cross-modal fine-grained interaction information between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text through a low-rank bilinear pooling; and calculating the multimodal vector corresponding to the aspect word based on the fusion result.

[0060] To facilitate understanding, we will continue using the specific example described above to implement this step, which outputs a feature representation of the image region after alignment. This includes the corresponding complete image information R, which is used to leverage an attention mechanism to help the model focus only on visual patches related to visual entities. The formula for calculating the relevance score in the attention mechanism is as follows:

[0061]

[0062]

[0063]

[0064] in, All parameters are trainable, thus yielding the attention score vector for the text. Similarly, the attention score vector for the image Depend on Calculated with R, based on and The final vector representations of the text and visual context can be calculated separately. and .

[0065]

[0066]

[0067] In this field, cross-modal fine-grained interactions between aspect words and images, and between visual entities and text, can be achieved in various ways. Although many advanced methods use simple feature concatenation, this application does not adopt this approach because it ignores the higher-order interactions between them. Therefore, this paper uses a multi-head interaction attention mechanism to compute cross-modal interaction information.

[0068]

[0069]

[0070]

[0071]

[0072] Where m is the number of interactive attention heads. These correspond to the weight matrices for the query vector, key, and value in the attention mechanism, respectively. It is the parameter matrix of the multi-head interactive attention mechanism, calculated as follows: and These are cross-modal fine-grained interaction information between aspect words and images, and between visual entities and text.

[0073] Subsequently, a low-rank bilinear pooling method was used. and By merging the parameters, the performance of the standard bilinear operator can be maintained with fewer parameters. The calculation process is as follows:

[0074]

[0075] Among them All are trainable parameters, and σ is the nonlinear transformation function tanh. In the formula... The symbol indicates that the two ends are multiplied element-wise, and the result calculated above is... , and Combining these elements yields the final multimodal vector representation. .

[0076]

[0077] At this point, the fused multimodal vector representation The data is fed into a softmax function for aspect-level sentiment classification, and the label with the highest probability in the output is taken as the final result. The above-mentioned... These are learnable parameters.

[0078] To optimize all parameters in the model, in one example, after validation, the cross-entropy loss function is used to determine the accuracy of the sentiment polarity corresponding to the aspect term. If the value calculated using the cross-entropy loss function is less than a preset threshold, the determination is considered accurate. The cross-entropy loss function used is as follows:

[0079]

[0080] To verify the technical effectiveness of this application, two real-world datasets, TWITTER-2015 and TWITTER-2017, will be used for evaluation. These datasets primarily contain multimodal user posts published in 2014-2015 and 2016-2017, with all aspect words belonging to four categories: people, places, organizations, and others. The datasets include text and corresponding images, and are labeled with target aspect words and the sentiment tendency of the text and images towards those aspect words. The sentiment annotation is a three-class dataset divided into training, validation, and test sets in a 3:1:1 ratio. The table below shows the distribution of sentiment labels for the three datasets.

[0081]

[0082] Figure 2To compare the accuracy of our model with baseline models, all experiments were run five times and the average was taken to avoid randomness during model training, thus providing a more objective description of the model results. Observing the experimental results, we can see that on both datasets, our TFGA model outperforms most baseline models in both ACC and F1 scores. This is because the TFGA model performs fine-grained alignment of text and images and fully integrates text, aspect terms, images, and visual objects, weakening the impact of noise in images on the model and thus extracting useful key information. The TD-LSTM model's performance in modeling the context of text aspect terms separately is very limited, indicating that the local context of aspect terms should not be ignored in the overall impact of sentiment analysis. The addition of visual modalities improved the model's performance to some extent, demonstrating that images can indeed support text and provide supplementary information. The Res-aspect model performed poorly, mainly because the contextual information was not well utilized. Furthermore, we can observe that TomBERT's modality performance is better than the modified TomLSTM, which is reasonable because TomBERT uses a pre-trained language model, whose feature extraction ability is superior to LSTM. The MIMN model uses an attention mechanism to model the interaction between text and images, outperforming most models. However, MIMN fuses complete image and text information as the vector representation of the final aspect words. Based on these hidden states, it further adopts a widely used attention mechanism for correspondence and introduces noise information from the image, thus its performance is inferior to the model presented in this paper, fully demonstrating the necessity of fine-grained alignment in our model.

[0083] To further demonstrate the effectiveness of the fine-grained alignment step presented in this paper, we tested the experimental results of fine-grained alignment using TomLSTM, TomLSTM+align, and TFGA on the sentiment classification results of the image target matching dataset randomly selected from Twitter 2017 as proposed in the paper.

[0084] The experimental results are shown in the table below:

[0085]

[0086] First, the results show that the TFGA model outperforms the other two models, indicating that the fine-grained alignment mechanism proposed in this paper has advantages in visual region and aspect word alignment, which can help improve the accuracy of the MABSA task. Second, TomLSTM+align performs worse than TomLSTM. We speculate that this is because the visual features obtained using ResNet contain less visual target information and introduce some noise into the alignment process.

[0087] For the TFGA model mentioned above, this paper compares the model's performance by extracting different numbers of visual regions from the image, such as... Figure 3 As shown, the model accuracy increases with the increase of parameter k (the number of visual regions), reaching a peak when k=8. Then, as k increases again, the accuracy gradually decreases. This is because most samples in the dataset contain no more than 4 aspect words, and excessive k values ​​introduce noise, thus degrading performance.

[0088] In this embodiment, text and corresponding images are obtained from the dataset; wherein the text contains at least one aspect term; the aspect term is part of a sentence in the text; at least one target region is obtained from the image; the global similarity between the aspect term and the text, and the local similarity between the aspect term and the target region are calculated respectively, and the correspondence between the aspect term and the target region is calculated based on the local similarity and the global similarity; the sentiment polarity corresponding to the aspect term is determined based on the correspondence and the text. This method can filter out most visual noise while capturing local information useful for sentiment analysis. A reliable fine-grained alignment mechanism effectively filters image noise information, and fine-grained text-image interaction allows for sufficient information exchange, enabling correct, accurate, and reliable determination of the sentiment polarity of aspect terms.

[0089] The steps described above are for clarity only. In practice, they can be combined into one step or some steps can be broken down into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but without changing the core design of the algorithm and process, are also within the scope of protection of this patent.

[0090] Another embodiment of the present invention relates to an emotion analysis device, such as... Figure 4 As shown, it includes: a data acquisition module 501, used to acquire text and images corresponding to the text from a dataset; wherein the text contains at least one aspect word; the aspect word is part of a sentence in the text; and to acquire at least one target region from the image; a data alignment module 502, used to calculate the global similarity between the aspect word and the text, and the local similarity between the aspect word and the target region, and to calculate the correspondence between the aspect word and the target region based on the local similarity and the global similarity; and a sentiment analysis module 503, used to determine the sentiment polarity corresponding to the aspect word based on the correspondence and the text.

[0091] In one example, calculating the correspondence between the aspect words and the visual features based on the local similarity and the global similarity includes: applying a confidence constraint to the local similarity based on the global similarity, and using the constrained local similarity to perform multi-layer self-attention calculation to obtain the correspondence between the aspect words and the visual features.

[0092] In one example, determining the sentiment polarity of the aspect word based on the correspondence and the text includes: calculating the multimodal vector corresponding to the aspect word through a multi-head interactive attention mechanism based on the context corresponding to the aspect word in the text and the correspondence; inputting the multimodal vector into a normalized exponential function to determine the sentiment polarity of the aspect word.

[0093] In one example, the step of calculating the multimodal vector corresponding to the aspect word through a multi-head interaction attention mechanism based on the context corresponding to the aspect word in the text and the correspondence includes: obtaining the target region corresponding to the aspect word according to the correspondence; calculating the cross-modal fine-grained interaction information between the aspect word and the image, and between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text, through a multi-head interaction attention mechanism; fusing the cross-modal fine-grained interaction information between the aspect word and the image and the cross-modal fine-grained interaction information between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text through a low-rank bilinear pooling; and calculating the multimodal vector corresponding to the aspect word based on the fusion result.

[0094] In one example, extracting at least one target region from the image includes: extracting multiple image regions from the image using a convolutional neural network model, and selecting at least one target region from the multiple image regions using a trained object detection model.

[0095] In one example, the convolutional neural network model is a residual network model.

[0096] In one example, the device further includes a judgment and verification module, used to judge whether the sentiment polarity corresponding to the aspect word is accurate using the value calculated by the cross-entropy loss function; if the value calculated by the cross-entropy loss function is less than a preset threshold, the judgment is considered accurate.

[0097] In this embodiment, text and corresponding images are obtained from the dataset; wherein the text contains at least one aspect term; the aspect term is part of a sentence in the text; at least one target region is obtained from the image; the global similarity between the aspect term and the text, and the local similarity between the aspect term and the target region are calculated respectively, and the correspondence between the aspect term and the target region is calculated based on the local similarity and the global similarity; the sentiment polarity corresponding to the aspect term is determined based on the correspondence and the text. This method can filter out most visual noise while capturing local information useful for sentiment analysis. A reliable fine-grained alignment mechanism effectively filters image noise information, and fine-grained text-image interaction allows for sufficient information exchange, enabling correct, accurate, and reliable determination of the sentiment polarity of aspect terms.

[0098] It is not difficult to see that this embodiment is a device embodiment corresponding to the above method embodiment, and this embodiment can be implemented in conjunction with the above method embodiment. The relevant technical details mentioned in the above method embodiment are still valid in this embodiment, and will not be repeated here to reduce repetition. Accordingly, the relevant technical details mentioned in this embodiment can also be applied to the above method embodiment.

[0099] It is worth mentioning that all modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, a part of a physical unit, or a combination of multiple physical units. Furthermore, to highlight the innovative aspects of this invention, this embodiment does not introduce units that are not closely related to solving the technical problem proposed by this invention; however, this does not mean that other units are absent from this embodiment.

[0100] Another embodiment of the present invention relates to an electronic device, such as Figure 5 As shown, it includes at least one processor 601; and a memory 602 communicatively connected to the at least one processor; wherein the memory 602 stores instructions executable by the at least one processor 601, the instructions being executed by the at least one processor 601 to enable the at least one processor 601 to perform the sentiment analysis method as described above.

[0101] The memory 602 and processor 601 are connected via a bus, which may include any number of interconnecting buses and bridges. The bus connects various circuits of one or more processors 601 and memory 602 together. The bus can also connect various other circuits, such as peripheral devices, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver can be a single element or multiple elements, such as multiple receivers and transmitters, providing a unit for communicating with various other devices over a transmission medium. Data processed by processor 601 is transmitted over a wireless medium via an antenna, which further receives data and transmits it to processor 601.

[0102] The processor manages the bus and general processing, and also provides various functions, including timing, peripheral interfaces, voltage regulation, power management, and other control functions. Memory is used to store data used by the processor during operation.

[0103] Another embodiment of the present invention relates to a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the above-described method embodiments.

[0104] That is, those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. This program is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0105] Those skilled in the art will understand that the above embodiments are specific examples of implementing the present invention, and in practical applications, various changes in form and detail may be made without departing from the spirit and scope of the present invention.

Claims

1. A sentiment analysis method, characterized in that, include: Retrieve the text and the corresponding images from the dataset; The text contains at least one aspect word; the aspect word is part of a sentence in the text. Obtain at least one target region from the image; Calculate the global similarity between the aspect word and the text, and the local similarity between the aspect word and the target region, respectively. Based on the local similarity and the global similarity, calculate the correspondence between the aspect word and the visual features in the target region. Determine the emotional polarity of the aspect words based on the correspondence and the text; The step of calculating the correspondence between the aspect words and the visual features in the target region based on the local similarity and the global similarity includes: The confidence of the local similarity is constrained based on the global similarity, and the constrained local similarity is used to perform multi-layer self-attention calculation to obtain the correspondence between the aspect words and the visual features in the target region; The step of determining the sentiment polarity of the aspect words based on the correspondence and the text includes: Based on the context and correspondence of the aspect words in the text, the multimodal vector corresponding to the aspect words is calculated through a multi-head interactive attention mechanism; The multimodal vector is input into a normalized exponential function to determine the sentiment polarity corresponding to the aspect words; The step of calculating the multimodal vector corresponding to the aspect word through a multi-head interactive attention mechanism based on the context and correspondence of the aspect word in the text includes: Based on the correspondence, the target region corresponding to the aspect word is obtained. Cross-modal fine-grained interaction information between the aspect word and the image, and between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text, is calculated using a multi-head interaction attention mechanism. The cross-modal fine-grained interaction information between the aspect word and the image and between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text are fused using a low-rank bilinear pooling method. Based on the fusion result, the multimodal vector corresponding to the aspect word is calculated.

2. The sentiment analysis method according to claim 1, characterized in that, The step of obtaining at least one target region from the image includes: A convolutional neural network model is used to extract multiple image regions from the image, and a trained object detection model is used to select at least one target region from the multiple image regions.

3. The sentiment analysis method according to claim 2, characterized in that, The convolutional neural network model is a residual network model.

4. The sentiment analysis method according to claim 1, characterized in that, The method further includes: The cross-entropy loss function is used to determine whether the sentiment polarity corresponding to the aspect word is accurate. If the value calculated by the cross-entropy loss function is less than a preset threshold, the determination is considered accurate.

5. An emotion analysis device, characterized in that, include: A data acquisition module is used to acquire text and corresponding images from a dataset; wherein the text contains at least one aspect term; the aspect term is part of a sentence in the text; and at least one target region is acquired from the image. The data alignment module is used to calculate the global similarity between the aspect word and the text, and the local similarity between the aspect word and the target region, and to calculate the correspondence between the visual features in the aspect word and the target region based on the local similarity and the global similarity. The sentiment analysis module is used to determine the sentiment polarity corresponding to the aspect words based on the correspondence and the text; The step of calculating the correspondence between the aspect words and the visual features in the target region based on the local similarity and the global similarity includes: The confidence of the local similarity is constrained based on the global similarity, and the constrained local similarity is used to perform multi-layer self-attention calculation to obtain the correspondence between the aspect words and the visual features in the target region; The step of determining the sentiment polarity of the aspect words based on the correspondence and the text includes: Based on the context and correspondence of the aspect words in the text, the multimodal vector corresponding to the aspect words is calculated through a multi-head interactive attention mechanism; The multimodal vector is input into a normalized exponential function to determine the sentiment polarity corresponding to the aspect words; The step of calculating the multimodal vector corresponding to the aspect word through a multi-head interactive attention mechanism based on the context and correspondence of the aspect word in the text includes: Based on the correspondence, the target region corresponding to the aspect word is obtained. Cross-modal fine-grained interaction information between the aspect word and the image, and between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text, is calculated using a multi-head interaction attention mechanism. The cross-modal fine-grained interaction information between the aspect word and the image and between the target region corresponding to the aspect word and the context corresponding to the aspect word in the text are fused using a low-rank bilinear pooling method. Based on the fusion result, the multimodal vector corresponding to the aspect word is calculated.

6. An electronic device, characterized in that, include: At least one processor; as well as, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the sentiment analysis method as described in any one of claims 1 to 4.

7. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the sentiment analysis method according to any one of claims 1 to 4.