An emotion-enhanced continued training method combining knowledge distillation and contrastive learning
By combining knowledge distillation and contrastive learning in the emotion enhancement training method, and using multimodal emotion knowledge to train an image emotion analysis model, the accuracy and applicability issues of image emotion analysis in existing technologies are solved, achieving more efficient model training and more accurate emotion recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF TECH
- Filing Date
- 2023-06-15
- Publication Date
- 2026-06-26
Smart Images

Figure CN117115505B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision, specifically involving technologies such as deep learning and sentiment analysis. Background Technology
[0002] With the development of the internet and the proliferation of social channels, people are increasingly inclined to share their experiences and express their feelings by posting text, images, and other modalities on social media platforms such as WeChat Moments, Weibo, Twitter, and Facebook. Through this data, which contains emotional information, people can express their joy or vent their frustrations. Extracting emotional trends from massive amounts of images can help understand people's personal preferences and emotional states, playing a crucial role in many real-life applications, such as social media monitoring, personalized recommendations, and the prevention of potential mental health issues.
[0003] The primary mediums for expressing emotions are text, images, and videos. With the fast pace of life and the vastness of social media platforms, people increasingly value the ability of images to efficiently convey emotions in order to share them quickly and widely. Nowadays, people tend to express their emotions by posting one or more photos with a simple caption, or simply by posting an image. Therefore, research on image-based emotion is crucial. How to quickly, accurately, and automatically identify and analyze images containing emotional information has naturally become a research hotspot.
[0004] Current image sentiment analysis methods all initially select an existing backbone network, then focus on designing and fine-tuning the network to train a model with sentiment analysis capabilities. This approach can be summarized as a training paradigm of backbone network plus fine-tuning network. The backbone network plays a crucial role in the model structure. Traditional methods use backbone networks trained through object recognition tasks, using object labels in images as supervision signals. However, this method can only learn limited, shallow semantics of images. Such knowledge-limited backbone networks are only capable of handling object recognition tasks, reducing the model's generalization ability. For deep semantic understanding tasks like image sentiment analysis, label-supervised pre-trained models are difficult to implement. Furthermore, such pre-trained models require expensive training costs, with most training data being manually labeled.
[0005] Recently, large-scale language pre-trained models (LPMs) have been proposed to mine knowledge from text and have achieved incredible results. Some works have shown that these models can serve as a starting point for downstream tasks, significantly improving final experimental results. Among these models, the CLIP model has achieved great success in both visual and language domains. CLIP uses information from text as a visual supervision signal, achieving the fusion of textual and image information while possessing strong generalization capabilities. However, we found that applying this type of natural language supervised pre-trained model directly to sentiment analysis tasks does not yield ideal results due to domain bias. This is attributed to the neglect of task-specific knowledge, such as sentiment knowledge, during training. The backbone network, for example... Figure 1 As shown.
[0006] To address the aforementioned issues, this invention proposes a sentiment enhancement training method combining knowledge distillation and contrastive learning. A large number of image-text pairs with distinct sentiment polarities are selected and collected, and the knowledge distillation method is used to train the model on a large dataset to reduce the gap between domains. Specifically, model training, based on contrastive learning, delves into multi-granularity sentiment knowledge within the image-text space, mining both visual and textual modalities to gain a deeper understanding of the emotional semantic information in images. This helps the pre-trained model output accurate sentiment representations, enabling accurate image sentiment recognition in downstream tasks. Summary of the Invention
[0007] The technical problem this invention aims to solve is to overcome the shortcomings of existing technologies and provide a sentiment enhancement and continuation training method that combines knowledge distillation and contrastive learning, making full use of multimodal sentiment knowledge within the domain. Its main objective is to address the issues of low prediction accuracy and poor applicability in sentiment analysis models.
[0008] This invention designs an emotion-enhancing continuing training method that combines knowledge distillation and contrastive learning. The method consists of four stages: teacher training data acquisition, teacher online training, student training data acquisition, student online training, and downstream task testing.
[0009] This invention includes the following steps:
[0010] S1. Obtain teacher network training data; filter out image-text pairs with sentiment polarity from the CC12M dataset using existing text sentiment classification models, use text sentiment as pseudo-labels for images, and use a sentiment dictionary to label sentiment words in the text. The final dataset is named SR-CC12M (Sentiment Rich-CC12M), as follows. Figure 2 As shown in the figure, the first row contains images, the second row contains the corresponding text, the third row contains sentiment tags for the image-text pair, and the fourth row contains tagged words in the text that have a clear sentiment polarity.
[0011] S2. Teacher network training data preprocessing: The images and text in the training data are formatted and standardized, and sentiment masks are applied to the text to obtain original text samples, text mask samples, and image samples.
[0012] S3. Construct a teacher model; based on the input data, perform comparative learning and affective knowledge learning on the teacher model. The overall structure of the teacher model is as follows: Figure 3 As shown. The student model is initialized by the teacher model's visual encoder.
[0013] S4. Obtain student network training data; use the teacher network as an image sentiment classification tool to analyze the sentiment polarity of each small block in the image, record the location and sentiment category of image blocks with obvious sentiment polarity, obtain the original image block samples, and perform sentiment masking on the image blocks to obtain image block mask samples.
[0014] S5. Construct a student model; based on image patch samples, learn image sentiment knowledge from the student model. The overall structure of the student model is as follows: Figure 4 As shown, the student model is used as a pre-trained model for image sentiment analysis.
[0015] S6. Downstream Task Testing: After preprocessing the images in the test dataset using the same steps as in S2, the image encoder trained by the student network is applied to three commonly used image sentiment analysis datasets: FI, Twitter, and EmotionROI, for experiments on sentiment binary and multi-class classification. Experimental results are obtained under zero-shot, linear probe, and supervised settings.
[0016] Optionally, the process of processing teacher network training data in S2 is as follows:
[0017] Step 1: Scale and crop the input image to obtain a 224x224 pixel value matrix with three channels. Divide the image pixel values into smaller matrices and pass them to the visual embedding layer to obtain the corresponding image block encodings, which serve as image samples.
[0018] Step 2: Encode the input text into tokens using a text embedding layer and set mask probabilities. Based on the mask probabilities and text tokens, obtain the number of masks. Randomly sample the input text tokens according to the number of masks to obtain mask positions. Mask the tokens corresponding to the mask positions; the input text tokens are the original text samples, and the masked text is the mask sample.
[0019] Optionally, the process of masking the token corresponding to the mask position is as follows:
[0020] Step 1: Determine the tokens in the input text tokens that need to be masked based on the mask position;
[0021] Step 2: Replace the tokens that need to be masked in the input text tokens with the corresponding mask tokens according to the required mask tokens and masking strategy.
[0022] Optionally, the masking strategy in step 2 includes:
[0023] (1) There is an 80% probability that the tokens that need to be masked in the input text tokens will be replaced with the [MASK] token.
[0024] (2) There is a 10% probability that the tokens that need to be masked in the input text tokens will be replaced with random tokens from the CLIP pre-trained model vocabulary.
[0025] (3) There is a 10% probability that the tokens that need to be masked in the input text token will remain unchanged.
[0026] Optionally, both the teacher model's visual encoder and text encoder consist of 12-layer Transformer models.
[0027] Optionally, the process of comparative learning and affective knowledge learning of the teacher model based on input data in S3 is as follows:
[0028] Step 1: Contrastive learning; The original text samples corresponding to the image samples are used as positive training examples, and the remaining text samples in the same batch are used as negative samples; Features of the image and text samples are obtained through a transformer network, and the similarity between the image samples and the positive and negative examples in the same training batch is calculated using the following formula:
[0029] sim(f I ,f T )=cos(f I ,f T )
[0030] Where sim represents the similarity function; Indicates the features of image samples; represents the text sample features; cos represents the cosine similarity function.
[0031] Step 2: Calculate the contrast loss based on the similarity between the original sample and the positive and negative examples respectively; the calculation formula is:
[0032]
[0033] Where L cl denoted by contrastive learning loss, N is the number of training images and text samples in a batch, and sim represents the similarity function.
[0034] Step 3: Text sentiment knowledge learning; Use text mask samples as training cases to obtain mask token features through a transformer network; Perform a full connection on the feature vector of the last layer of the transformer and map it to the CLIP dictionary space to obtain the predicted token label at the mask position, and calculate the cross-entropy loss between the predicted token label and the real token label.
[0035] Step 4: Perform a full connection on the token features (except for the first-dimensional category feature vector) output by the last layer of the transformer, map it to the sentiment space, obtain the predicted sentiment label at the mask position, and calculate the cross-entropy loss between the predicted sentiment label and the real sentiment label.
[0036] Optionally, the process of processing student network training data in S4 is as follows:
[0037] Step 1: Same as step one in S2, process the image to obtain an image sample.
[0038] Step 2: Mask Image Generation. Input the image samples into the teacher's network visual encoder to obtain the sentiment score for each image block. Set a threshold of 0.8; images with a score greater than 0.8 are considered to have a strong sentiment polarity. Record the positions and set random masking probabilities to obtain the mask positions. Mask the image blocks corresponding to the mask positions; the input image is the original image sample, and the masked image is the image mask sample.
[0039] Optionally, the process of masking the token corresponding to the mask position is as follows:
[0040] Step 1: Determine the image blocks that need to be masked in the input image block encoding based on the mask position;
[0041] Step 2: Based on the image blocks to be masked and the masking strategy (masking 75% of the emotional image blocks), replace the encoding positions in the input image block encoding that need to be masked with the corresponding mask encoding.
[0042] Optionally, the student model includes a visual encoder and a decoder, wherein the visual encoder is initialized by the teacher model visual encoder, and the decoder is a simple convolutional layer.
[0043] Optionally, the process of learning emotional knowledge from the student model based on input data in S5 is as follows:
[0044] Step 1: Image sentiment knowledge learning; use image mask samples as training cases, obtain mask prediction features through a transformer network; input the last layer mask prediction feature vector of the transformer into the decoder to obtain the predicted pixels, and calculate the L1 loss between the predicted pixels and the real pixels;
[0045]
[0046] Where L min y represents the image mask reconstruction loss. M Indicates the predicted pixel, x M Represents the actual pixel, ||·|| represents the distance between features, Ω(x) M () indicates the number of elements.
[0047] Step 2: Perform a full connection on the image block feature vectors of the last layer of the transformer (except for the first dimension of the category feature vector), map them to the sentiment space, obtain the predicted sentiment label at the mask position, and calculate the cross-entropy loss between the predicted sentiment label and the real sentiment label. Attached Figure Description
[0048] Figure 1 Comparison with backbone networks.
[0049] Figure 2 For example, the SR-CC12M dataset.
[0050] Figure 3 For teacher network structure
[0051] Figure 4 For student network structure Detailed Implementation
[0052] This invention discloses a method for enhancing sentiment through knowledge distillation and contrastive learning for image sentiment classification tasks. The technical solution of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention. The specific implementation steps of this invention are as follows:
[0053] Step 1: Obtain online training data for teachers
[0054] In step one, existing text sentiment classification models are used to filter and obtain image-text pairs with sentiment polarity from the CC12M dataset. Text sentiment is used as pseudo-labels for images, and a sentiment dictionary is used to label sentiment words in the text. Specifically, step one can be divided into two parts: sentence-level data processing and word-level data processing. For sentence-level data, the TWEETEVAL model is applied to filter and process texts T and their corresponding images I from the CC12M image-text dataset, assigning sentiment labels: 1 for positive and 0 for negative. This serves as a weakly labeled image-text dataset for model training. A prompt generator is also designed, with the text template "it is a photo of [sentiment label]", placing the sentiment labels positive and negative in the corresponding positions to obtain the labeled sentence L. To obtain fine-grained sentiment information, word-level data is processed. A word segmenter is used to extract independent words W from the text, and then a sentiment dictionary is used to determine the sentiment polarity of each independent word and assign it a sentiment label: 0 for negative, 1 for neutral, and 2 for positive. The final dataset contains 73,000 image-text pairs, of which 61,000 are positive and 12,000 are negative. This dataset was used directly for model training.
[0055] Step 2: Teacher Network Data Preprocessing and Model Parameter Setting
[0056] The input images and text need to undergo data format standardization. Images are scaled and cropped to obtain a 224x224 three-channel result, which is then fed into the image encoder. Text is converted to its corresponding token code using a token processor, and the length is trimmed to 77. If the length is less than 77, zeros are added to the end; any excess is removed. This result is then fed into the text encoder. The model input batch size is set to 128, and training is performed for 10 epochs. Gradient backpropagation is performed every 1500 steps to update parameters. The initial learning rate is set to 10%. -6 ; Decay to 10 in 3 rounds -7 Configure the AdamW optimizer to learn the parameters of the graph encoder in the model using the gradient descent scheduler. Also, configure two separate AdamW optimizers to learn the parameters of the fully connected layers in the model.
[0057] Step 3: Teacher Network Design
[0058] Step 3.1 Feature Extraction
[0059] Each training step uses N image-text pairs as model inputs: images to the visual encoder and text to the text encoder. After passing through a learnable visual embedding layer and a self-attention module, the overall image feature tensor f is obtained. I ∈R N×512For text, the tokenizer module obtains the token number corresponding to the text, and the text feature tensor f is obtained after passing through a learnable text embedding layer and an attention module. T ∈R N×512 Text token feature tensor f Tok ∈R N×77×512 And image emotional tags f L ∈R 2×512 .
[0060] Step 3.2 Online Training for Teachers
[0061] Step 3.2.1 Comparative Learning
[0062] On a large-scale image-text dataset, the teacher network is primarily trained using a contrastive learning approach, guided by a fine-grained text sentiment inference task, and supervised by sentiment-based natural language processing. The contrastive learning approach maps image and text features into the same space, integrating sentiment information from the text into the image. It increases semantic similarity by minimizing the distance between the two modalities, allowing the image to naturally contain textual information. Specifically, this is achieved by increasing the semantic similarity between the image and positive examples while decreasing the similarity with negative examples. For instance, in N image-text pairs, it brings the j-th image closer to its corresponding text while simultaneously increasing its distance from the remaining N-1 texts in the N image-text pairs. The specific training method is as follows:
[0063]
[0064] Where sim(f) I ,f T )=cos(f I ,f T () is for calculating cosine similarity. This represents the image features of the j-th image-text pair. This represents the text features of the j-th image-text pair. The contrastive learning method is consistent with that used in CLIP, which helps the encoder retain the mutual information between real pairs to the maximum extent.
[0065] 3.2.2 Fine-grained textual sentiment knowledge inference
[0066] Mapping text and images to the same space allows for knowledge mining of one modality within that space, which can also facilitate the representation of another modality. To this end, this invention designs a text sentiment knowledge mining task, focusing on fine-grained sentiment knowledge within text to further interpret sentiment information in images. This task is termed the Sentiment Text Masking Reasoning Task.
[0067] The objective is to train the model's text encoder to learnable embedding layer encoding capabilities and to extend the embedding layer to encode new [MASK] tokens.
[0068] The specific processing of the input data involves constructing text-level masks. First, 15% of all tokens need to be masked, starting with words possessing sentiment polarity. However, since words can have multiple tokens after segmentation, all tokens corresponding to a single word need to be masked. Specifically, the original token number is replaced with the number corresponding to "[MASK]", and the position and the original token number are recorded as labels. This operation masks 10% of all sentiment words. If the final number of masked sentiment word tokens is less than 15% of all tokens, the remaining tokens are masked. In this case, the masking method becomes: 80% probability of replacing with the number corresponding to "[MASK]", 10% probability of replacing with a random token from the dictionary, and 10% probability of leaving the original token unchanged. After feature encoding of the input data in the embedding layer, the self-attention mechanism in the transformer performs knowledge inference on the mask position through context, that is, calculating the MASK position feature through weighted summation of context features. The MASK position result is output to the softmax layer, obtaining a normalized probability distribution vector of the predicted MASK feature across the entire vocabulary. This vector has a size equal to the vocabulary length, and each dimension represents the probability that the predicted word is a corresponding word in the vocabulary. This is achieved by reducing the probability distribution relative to the true word label (w). i The model encoder is trained by analyzing the differences in the distribution between the two regions.
[0069] To obtain word feature representations, token features corresponding to sentiment words are extracted from the last layer of the text encoder. Note that a word may correspond to multiple tokens; therefore, the token features are summed and averaged to obtain the final word features g of the text. φ (x W Pass in a learnable fully connected layer, for example Let h represent the feature of the g-th word in the j-th sentence of the input text, with a size of [1×1×512], and the sentiment probability distribution of the word is h. ψ (g φ (x W The probability distribution represents the likelihood of a word expressing positive or negative emotion, where ψ is the parameter of the fully connected layer h. This is achieved by reducing the probability distribution relative to the emotion label y. g The model is trained based on the differences in the distribution between the intervals, as follows:
[0070]
[0071] Where H is the number of words in the j-th sentence, and N represents the number of image-text pairs in the input data. The number of words in different sentences varies.
[0072] Step 4: Generating student network training data and setting model parameters
[0073] The teacher network is used as an image sentiment classification tool to analyze the sentiment polarity of each small patch in an image. The locations and sentiment categories of image patches with obvious sentiment polarities are recorded to obtain the original image patch samples. Sentiment masks are then applied to these image patches to obtain the masked image patch samples. Specifically, to ensure model size and training time, the visual encoder module of the teacher network is used to initialize the student network. Images from SR-CC12M and their corresponding sentiment labels are used as training data. The first branch, through the teacher model, outputs the sentiment features of the entire image and the sentiment features of all image patches after passing through the last layer of the visual encoder. Sentiment classification is then performed on the sentiment features of all image patches. The teacher model has excellent sentiment classification capabilities, thus ensuring the reliability of the classification results. The classification results are the probability of a corresponding image patch expressing positive sentiment and the probability of expressing negative sentiment. Image patches with a sentiment tendency higher than 0.7 are selected, and their corresponding locations and pseudo-sentiment labels are recorded. These locations and corresponding labels are then used as training data and fed into the student network. The second branch directly feeds the training data into the student network. The remaining model parameters are as follows: the batch size of the input data is set to 256, and the training is conducted for 10 epochs; gradient backpropagation is performed every 1500 steps to update the parameters; the learning rate is initially set to 10. -5 ; Decay to 10 in 3 rounds -6 Configure the AdamW optimizer to learn the parameters of the image encoder in the model using the gradient descent scheduler. Also, configure two separate AdamW optimizers to learn the parameters of the fully connected layers in the model.
[0074] Step 5: Student Network Design
[0075] Step 5.1 Feature Extraction
[0076] Each training step takes N images as input to the model. These images are input to the visual encoder, and after convolution in the learnable visual embedding layer, the encoded feature tensor f∈R is obtained. N×50×768 After passing through the self-attention module, the overall image features f are obtained. I ∈R N×1×512 and image patch features f P ∈R N×49×512 .
[0077] Step 5.2 Student Online Training
[0078] This invention designs two training tasks to train the student network's ability to extract global and local image sentiment information. Unlike traditional methods that use fully connected (FC) layers for sentiment classification, this invention classifies images by calculating the cosine similarity between the features of each image and the features of the 2D sentiment label. By reducing the difference between the sentiment distribution probability and the distribution of sentiment labels between images, this helps the model learn to capture global image sentiment.
[0079] To enable the model to capture local emotional information in images, this invention designs an image region emotional mask reconstruction and a reconstructed emotional prediction task. Since images are rich in information, extensive region masking is required. In this invention, emotional image blocks are masked. Specifically, 75% of the positions of all emotional image blocks are randomly selected, and the positions and original RGB pixel values of the images are recorded as labels. A learnable mask matrix (specifically, a 1×1×768 tensor) is randomly initialized to mask the image block feature encoding matrix. Through the self-attention mechanism in the transformer, knowledge inference about the mask positions from the context is performed to obtain the predicted emotional feature f of the image block. P The encoder outputs the mask position features, which are then input to the decoder to obtain the corresponding predicted pixel values. In practice, the decoder is a convolutional layer. The L1 loss between the predicted and ground truth pixels is calculated.
[0080]
[0081] Where L mim y represents the image mask reconstruction loss. M Indicates the predicted pixel, x M Represents the actual pixel, ||·|| represents the distance between features, Ω(x) M () indicates the number of elements.
[0082] Predicting sentiment features f from image patches P The predicted labels for image patches are obtained by feeding them into a learnable fully connected layer, and their distribution is made to approximate the distribution of the true sentiment labels for image patches. This is specifically implemented using cross-entropy loss.
[0083] The model enhances its understanding of regions by increasing the similarity between predicted pixel values and labeled pixel values. Simultaneously, performing sentiment prediction on mask locations can further train the model's ability to capture emotional details in images.
[0084] Step 6: Downstream task testing
[0085] After undergoing the same preprocessing steps as in step 4, the images in the test dataset are input into the model trained in step 4. Experimental results are obtained under zero-shot and supervised settings. Currently, the best published results for sentiment binary and hexa-class classification on the EmotionRO dataset and sentiment binary classification on the FI dataset are presented in the high-level journal IEEE Transactions on Multitimedia in 2020, titled "WSCNet: Weakly Supervised Coupled Networks for Visual Sentiment Classification and Detection," achieving scores of 0.8510, 0.6041, and 0.9097 respectively. The best result for sentiment octal classification on the FI dataset is presented in the 2019 paper "Multi-level region-based convolutional neural network for image emotion classification," achieving a score of 0.7546. The best result for binary classification on the Twitter dataset is presented in the 2021 paper "Discovering sentimental interaction via graph convolutional network for visual sentiment prediction," achieving a score of 0.8965. The model of this invention has improved performance on these datasets, achieving scores of 0.8805 and 0.6751 for binary and hexa-class classification of sentiment in the EmotionROI dataset, 0.9375 and 0.7833 for binary and hexa-class classification of sentiment in the FI dataset, and 0.9016 for binary classification in the Twitter dataset.
Claims
1. A method for enhancing emotional engagement through continuous training that combines knowledge distillation and comparative learning, characterized in that: Includes the following steps: S1. Obtain teacher network training data; through existing text sentiment classification models, select image-text pairs with sentiment polarity in the CC12M dataset, use text sentiment as image pseudo-labels, and use a sentiment dictionary to label sentiment words in the text; the final dataset is named SR-CC12M; the first row is the image, the second row is the corresponding text, the third row is the image-text pair sentiment label, and the fourth row is the labeled words in the text with obvious sentiment polarity. S2. Preprocessing of online training data for teachers; The images and text in the training data are formatted and uniformized, and a sentiment mask is applied to the text to obtain the original text sample, the text mask sample, and the image sample. S3. Construct the teacher model; based on the input data, perform comparative learning and affective knowledge learning on the teacher model; the student model is initialized by the visual encoder of the teacher model; S4. Obtain student network training data; use the teacher network as an image sentiment classification tool to analyze the sentiment polarity of each small block in the image, record the position and sentiment category of the image block with obvious sentiment polarity, obtain the original image block sample, and perform sentiment masking on the image block to obtain the image block mask sample. In S3, the process of comparative learning and affective knowledge learning of the teacher model based on input data is as follows: Step 1: Contrastive learning; The original text samples corresponding to the image samples are used as positive training examples, and the remaining text samples in the same batch are used as negative samples; Features of the image and text samples are obtained through a transformer network, and the similarity between the image samples and the positive and negative examples in the same training batch is calculated using the following formula: Where sim represents the similarity function; Indicates the features of image samples; Represents the text sample features; cos represents the cosine similarity function; Step 2: Calculate the contrast loss based on the similarity between the original sample and the positive and negative examples respectively; the calculation formula is: ;in represents the contrastive learning loss, N is the number of training sample images and texts in a batch, and sim represents the similarity function; Step 3: Text sentiment knowledge learning; Use text mask samples as training cases to obtain mask token features through a transformer network; Perform a full connection on the feature vector of the last layer of the transformer and map it to the CLIP dictionary space to obtain the predicted token label at the mask position, and calculate the cross-entropy loss between the predicted token label and the real token label. Step 4: Perform a fully connected operation on the token features output from the last layer of the transformer, map them to the sentiment space, obtain the predicted sentiment label at the mask position, and calculate the cross-entropy loss between the predicted sentiment label and the real sentiment label. S5. Construct a student model; learn image sentiment knowledge from the student model based on image patch samples; use the student model as a pre-training model for image sentiment analysis. S6. Downstream Task Testing: After the images in the test dataset undergo the same preprocessing steps as in S2, the image encoder trained by the student network is applied to the downstream image sentiment analysis dataset to perform sentiment binary classification and multi-class classification experiments; the corresponding experimental results are obtained under zero-sample, linear probe and supervised settings.
2. The method according to claim 1, characterized in that: The process of processing teacher network training data in S2 is as follows: Step 1: Scale and crop the input image to obtain a 224x224 pixel value matrix with three channels; divide the image pixel values into smaller matrices and pass them into the visual embedding layer to obtain the corresponding image block encoding, which serves as the image sample; Step 2: Encode the input text into tokens through the text embedding layer, set the mask probability, obtain the number of masks based on the mask probability and the text tokens, and randomly sample the input text tokens according to the number of masks to obtain the mask positions; Mask the token corresponding to the mask position; The input text token is the original text sample, and the text after masking is the masked sample; The process of masking the token corresponding to the mask position is as follows: Step 1.1: Determine the tokens in the input text tokens that need to be masked based on the mask position; Step 1.2: Replace the tokens that need to be masked in the input text tokens with the corresponding mask tokens according to the required mask tokens and masking strategy; Masking strategies include: (1) There is an 80% probability that the tokens that need to be masked in the input text tokens will be replaced with the [MASK] token; (2) There is a 10% probability that the tokens that need to be masked in the input text tokens will be replaced with random tokens from the CLIP pre-trained model vocabulary; (3) There is a 10% probability that the tokens that need to be masked in the input text token will remain unchanged.
3. The method according to claim 1, characterized in that: The teacher model's visual encoder and text encoder both consist of 12-layer Transformer models.
4. The method according to claim 1, characterized in that, The process of processing student network training data in S4 is as follows: Step 1: Same as step one in S2, process the image to obtain an image sample; Step 2: Mask image generation; Input the image samples into the teacher's network visual encoder to obtain the sentiment score of each image block. Set the threshold to 0.
8. Image blocks with a sentiment polarity greater than 0.8 are recorded. Set random masking probability to obtain the mask position; Mask the image blocks corresponding to the mask positions. The input image is the original image sample, and the image after masking is the image mask sample; The process of masking the token corresponding to the mask position is as follows: Step 4.1: Determine the image blocks that need to be masked in the input image block encoding based on the mask position; Step 4.2: Based on the image blocks to be masked and the masking strategy (i.e., masking 75% of the emotional image blocks), replace the encoding positions that need to be masked in the input image block encoding with the corresponding mask encoding; The student model consists of a visual encoder and a decoder, where the visual encoder is initialized by the teacher model's visual encoder, and the decoder is a convolutional layer.
5. The method according to claim 1, characterized in that, In S5, the process of learning emotional knowledge from the student model based on input data is as follows: Step 1: Image sentiment knowledge learning; use image mask samples as training cases, obtain mask prediction features through a transformer network; input the last layer mask prediction feature vector of the transformer into the decoder to obtain the predicted pixels, and calculate the L1 loss between the predicted pixels and the real pixels; ;in This represents the image mask reconstruction loss. Indicates the predicted pixel. Represents actual pixels, Indicates the distance between features. Indicates the number of elements; Step 2: Perform a fully connected function on the feature vector of the last layer of the transformer image block, map it to the sentiment space, obtain the predicted sentiment label at the mask position, and calculate the cross-entropy loss between the predicted sentiment label and the real sentiment label.