A method, system, equipment, and medium for early detection of fake news.
By employing the multimodal pre-trained model CLIP for comparative learning of learnable cue vectors in fake news detection, the shortcomings of fake news detection in small sample scenarios are addressed, achieving more efficient multimodal information fusion and early detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-12-26
- Publication Date
- 2026-06-30
AI Technical Summary
Existing fake news detection methods perform poorly in small sample scenarios. In particular, supervised methods require a large amount of labeled data, while weakly supervised methods ignore semantic information in visual modalities, and existing models do not fully utilize the knowledge of pre-trained models.
Based on the multimodal pre-trained model CLIP, we designed learnable cue vectors and compared them with multimodal representations for learning. We then fused textual and visual modal information through a common attention layer and performed detection within the multimodal cue learning framework.
It achieves better early detection of fake news in small sample sizes, reduces training costs, makes full use of the knowledge of pre-trained models, and improves classification accuracy and recall.
Smart Images

Figure CN117874607B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of multimodal fake news detection and pre-trained model prompting learning technology, specifically to a method, system, device, and medium for early detection of fake news. Background Technology
[0002] Social media messages, multimodal information composed of short texts and visual data, have become a popular way for people to receive and publish news. Their vivid visual appeal attracts readers to browse news on social media platforms. As information spreads rapidly through these platforms, fake news with negative or even malicious intent leverages this dramatic visual background to spread quickly, potentially causing serious consequences. Therefore, it is crucial to determine its authenticity at an early stage to avoid severe consequences. Existing research has found that in the absence of an effective dissemination network and sufficient labeled data, explicit warnings about fake news can reduce the adverse consequences of fake news; therefore, early detection of fake news is essential.
[0003] The increasing prevalence of social media news, encompassing not only natural language content but also visual content such as images and videos, offers a new perspective for early detection of fake news using multimodal data. Existing fake news detection methods can be categorized into two types based on the availability of sufficient labeled training data: supervised methods for scenarios with ample data and weakly supervised methods for scenarios with limited data.
[0004] Supervised multimodal fake news detection methods utilize news information from both textual and visual modalities and heavily rely on high-quality labeled data. With the rise of deep neural networks and pre-trained models, many powerful feature extractors have emerged, such as text feature extractors like BERT and Transformer, and visual feature extractors like VGG and ResNet. Singhal et al. used a visual feature extractor to extract visual information and a text feature extractor to extract textual features, then concatenated and fused the visual and textual information for fake news detection. Wang et al. designed an auxiliary task, event identification, to measure the differences between different events and further learn invariant features of news events. This auxiliary task helps to better understand multimodal information, thus aiding in fake news detection. Khattar proposed an end-to-end multimodal variational autoencoder, using a bimodal variational autoencoder and a binary classifier for fake news detection. Qian et al. fed the obtained text and image representations into a multimodal contextual attention network to fuse intramodal and intermodal relationships and designed a hierarchical encoding network to capture rich semantic information in fake news detection. Wu et al. extracted spatial and frequency domain features from images and textual features from text. By stacking multiple common attention layers together to fuse multimodal features, they were able to learn the dependencies between multiple modalities.
[0005] Weakly supervised fake news detection methods are capable of prediction in scenarios with small sample sizes. Many models rely on graph structures or pseudo-labels to utilize partially labeled data for fake news detection. Jiang et al. proposed a method that uses a pre-trained language model to learn prompts to guide fake news detection. For applications with small sample sizes, Jiang et al. proposed a multimodal fake news detection model that fuses multimodal features generated by the CLIP model with the text representation of a pre-trained language model to aid in prompt learning for fake news detection.
[0006] However, data annotation requires significant human and material resources, and real-world scenarios often suffer from incomplete (some data is labeled, some is not), imprecise (labeled data is often coarse-grained), and inaccurate (labeled data may not be accurate and may contain errors). Therefore, models need to be able to detect fake news in small sample scenarios. The first type of approach, supervised multimodal fake news detection, requires a large amount of labeled data and performs poorly in small sample situations, making it difficult to apply to early-stage fake news detection. While the second type of approach can be applied to small sample scenarios, some models only focus on textual information for single-modal fake news detection, ignoring the semantic information contained in the visual modality. Based on existing work, Jiang et al. recently proposed a multimodal fake news detection model for small sample applications, fusing multimodal features generated by the CLIP model with the text representation of a pre-trained language model to aid in prompt learning for fake news detection. However, this model does not fully align with the original CLIP pre-training method, resulting in CLIP's advantages not being fully realized.
[0007] In summary, existing methods are not designed for real-world scenarios involving small sample sizes and multimodal early detection of fake news, and they do not fully align with existing pre-training methods to maximize the use of knowledge learned during pre-training. Summary of the Invention
[0008] To address the aforementioned problems, this invention proposes an early fake news detection method based on multimodal cue learning for small-sample scenarios. It is based on the multimodal pre-trained model CLIP and improves the manually designed cue templates of CLIP. By comparing and learning learnable cue vectors with multimodal representations, a fake news early detection method based on multimodal cue learning is proposed, enabling the detection of multimodal fake news with limited samples.
[0009] The technical solution adopted in this invention is as follows:
[0010] A method for early detection of fake news, characterized by the following steps:
[0011] Step 1: Obtain the multimodal information to be detected;
[0012] Step 2: Construct a multimodal prompting learning framework;
[0013] Step 3: Input the multimodal information to be detected into the multimodal cue learning framework;
[0014] Step 4: Perform multimodal learning using a multimodal cue learning framework, detect the multimodal information to be detected, and output the detection results.
[0015] Furthermore, the multimodal prompt learning framework constructed in step 2 includes a feature extraction module, a multimodal feature fusion module, a learnable prompt design module, and a similarity calculation module;
[0016] The feature extraction module is used to extract visual and textual features of the input multimodal information to be detected using a pre-trained CLIP model.
[0017] The multimodal feature fusion module is used to fuse the extracted visual features and text features to obtain the multimodal features of the multimodal information to be detected.
[0018] The learnable Prompt design module is used to replace the original manually designed prompt template with a set of learnable vectors, and to obtain category features through the learnable vectors;
[0019] The similarity calculation module is used to calculate the cosine similarity between multimodal features and category features, and to classify the multimodal information to be detected.
[0020] Furthermore, the multimodal feature fusion module includes two parallel co-attention modules and a fully connected layer. Each co-attention module includes a multi-head attention layer, a residual connection & normalization layer, a fully connected feedforward network layer, and a residual connection & normalization layer connected in sequence.
[0021] Two common attention modules are used to obtain text features with visual information and visual features with text information based on the input text features and visual features, respectively.
[0022] The fully connected layer is used to obtain fully fused multimodal features based on the input text features with visual information and the visual features with text information.
[0023] Furthermore, step 4 includes the following specific steps:
[0024] Step 41: Keep the CLIP model parameters frozen, input the multimodal information x to be detected into the feature extraction module, and use the pre-trained CLIP image encoder and text encoder to extract the visual features H of x respectively. I and text features H T ;
[0025] Step 42: H T and H I The input is fed into the multimodal feature fusion module, and after passing through two co-attention modules, text features with visual information are obtained. and visual features with text information And in the fully connected layer and Fusion yields multimodal features H M ;
[0026] Step 43: Using the learnable Prompt design module, obtain the learnable cue vector p, and input p into the pre-trained CLIP text encoder g(·) to obtain the category features H. C H C =g(p);
[0027] Step 44: The similarity calculation module calculates the multimodal features H. M With category feature H C Based on the cosine similarity between x, classify x and output the classification results.
[0028] Furthermore, step 42, which involves calculating the multimodal features, includes:
[0029] Step 421: Based on the input text features H T Text features with visual information are obtained through equation (4).
[0030] H′ T =H T +MA(H T H I H I )
[0031]
[0032] Wherein, FFN represents a fully connected feedforward network layer, and MA represents a multi-head attention function;
[0033] Step 422: Based on the input visual features H I Visual features with textual information are obtained through equation (5).
[0034] H′ I =H I +MA(H I H T H T )
[0035]
[0036] Step 423: Use formula (6) to... and By fusing the data, we obtain the multimodal features H of x. M :
[0037]
[0038] Where W represents the attention matrix.
[0039] Furthermore, the specific steps of step 44 include:
[0040] Step 441: Calculate the multimodal features H M With category feature H C The cosine similarity between x and x gives the probability p that x belongs to class i:
[0041]
[0042] Where τ represents the temperature parameter learned by CLIP, and k represents the number of categories;
[0043] Step 442: Classify x according to probability p and output the early detection results of fake news.
[0044] A fake news early detection system, based on the aforementioned fake news early detection method, is characterized by including a detection information input module, a multimodal prompting learning framework construction module, a detection text input module, and a multimodal learning module;
[0045] The detection information input module is used to acquire the multimodal information to be detected;
[0046] A multimodal prompting learning framework building module, used to build a multimodal prompting learning framework;
[0047] The text input detection module is used to input the multimodal information to be detected into the multimodal cue learning framework;
[0048] The multimodal learning module performs multimodal learning through a multimodal prompting learning framework, detects the multimodal information to be detected, and outputs the detection results.
[0049] An electronic device is characterized by comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the aforementioned method for early detection of fake news.
[0050] A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, it implements the aforementioned method for early detection of fake news.
[0051] The beneficial effects of this invention are:
[0052] First, the present invention effectively integrates semantic information from text and visual modalities by employing a shared attention layer.
[0053] Second, the present invention uses learnable cue vectors. Compared with manually creating templates, allowing the model to learn cue vectors on its own can better capture the semantic information between categories and multimodal embeddings.
[0054] Third, this invention fits the original CLIP pre-training method through contrastive learning. Compared with only using CLIP's encoder to extract features and then using an additional classification network for classification, this invention can make fuller use of the knowledge learned in the CLIP pre-training stage and achieve better classification results.
[0055] Fourth, this invention uses a cue-based learning method instead of fine-tuning, keeping the CLIP parameters frozen. This not only reduces training costs but also allows the model to perform better in small sample situations, making it more suitable for early detection tasks of fake news. Attached Figure Description
[0056] Figure 1 This is a flowchart of the detection method proposed in this invention.
[0057] Figure 2 This is a schematic diagram of the multimodal prompting learning framework of the present invention. Detailed Implementation
[0058] To enable those skilled in the art to better understand the technical solutions of the present invention, the technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.
[0059] This invention addresses the multimodal fake news detection task in limited sample scenarios. Based on the multimodal pre-trained model CLIP, it proposes a CLIP-based multimodal prompt learning framework (MPL) using continuous prompts design, which is used for early detection of fake news in limited sample situations.
[0060] 1. Learning prompts
[0061] Fine-tuning and cue learning are two typical paradigms of pre-trained models. Fine-tuning adapts to downstream tasks, while cue learning processes input information based on specific templates, reconstructing the task into a form that the pre-trained model can better utilize. For example, using BERT to detect fake news from text information, i.e., outputting whether the information is true or false. The traditional fine-tuning paradigm uses a large amount of data and labels to train the model parameters, extracts information features using the fine-tuned model, and then inputs them into a classifier for classification. The cue learning paradigm, however, does not require any training on the pre-trained model; that is, the model parameters are frozen. Cue learning processes the input by adding a cue before or after the text information, such as "This information is [MASK]", and then lets BERT fill in the blank with "true" or "false" to classify the information. This cue-filling cue template makes the task closer to BERT's pre-training method, namely Masked Language Model (MLM). Cue learning can fully utilize the knowledge learned by the model during pre-training, significantly reducing training costs and performing well even in scenarios with few samples.
[0062] Tip learning can be divided into two types: manually designed templates and automatically learned templates. The former involves manually creating templates for different tasks and datasets, but this requires additional manpower and knowledge, and the results are unstable; a difference of even a single word can cause drastic fluctuations. The latter allows the model to automatically learn suitable templates, and is further divided into discrete prompts and continuous prompts. Discrete prompts refer to automatically generated prompts composed of words from natural language, thus their search space is discrete. Continuous prompts, on the other hand, remove the constraints of natural language and directly search within a continuous embedding space; the learned prompt is a string of vectors rather than a sentence. Continuous prompts have achieved excellent results in many natural text classification or image classification tasks, demonstrating the effectiveness of this method in classification tasks.
[0063] 2. CLIP
[0064] CLIP is a large-scale, multimodal, pre-trained visual language model based on contrastive learning. Unlike label-based representation learning methods commonly used in computer vision, CLIP's training data consists of text-image pairs—an image and its corresponding text description. To learn different concepts and make the model more easily applicable to various downstream tasks, CLIP includes a large training dataset of 400 million image-text pairs.
[0065] CLIP consists of two encoders: an Image Encoder and a Text Encoder. The Image Encoder extracts features from the image, using ResNet or Vision Transformers of different sizes. The Text Encoder extracts features from the text, employing the text transformer model commonly used in NLP.
[0066] CLIP performs comparative learning on extracted text and image features. For a training batch containing N text-image pairs, the CLIP model pairs the N text features with the N image features, predicting the similarity of N² possible text-image pairs, i.e., the cosine similarity between the text and image features. There are N positive samples, which are the text and image pairs that truly belong to each other, while the remaining N²-N text-image pairs are negative samples. The training objective of CLIP is to maximize the similarity of the positive samples while minimizing the similarity of the negative samples.
[0067] Pre-trained encoders can only obtain text and image features, so CLIP uses cue templates to transfer these features to the pre-trained model. CLIP can directly achieve zero-shot image classification, meaning it can perform classification on downstream tasks without any training data. First, cue templates are constructed, which involve creating descriptive text for each category based on the task's classification labels: "A photo of {label}". These texts are then fed into a Text Encoder to obtain the corresponding text features; if the number of categories is N, then N text features will be obtained. Next, the image to be predicted is fed into an Image Encoder to obtain image features, and cosine similarity is calculated between these features and the N text features. The image can then be classified based on the highest similarity score.
[0068] This invention classifies multimodal news based on CLIP. In order to fit the pre-training method of CLIP contrastive learning, it improves the manually defined templates of CLIP and designs continuous prompts to perform contrastive learning with multimodal representations, thereby achieving multimodal fake news detection with few samples.
[0069] 3. The detection method proposed in this invention
[0070] (1) Problem definition and symbol system
[0071] This invention defines the fake news detection task as a binary classification problem, considering two modalities: text and image, which are the most popular information carriers on online social media platforms. A multimodal news item is labeled as an image-text pair x = (T, I), where T represents text, I represents an image, and an impressive image I is used to illustrate the main text T of a news point. The goal of multimodal fake news detection is to assign a label y ∈ {0, 1} to the input news item, where 0 represents real news and 1 represents fake news.
[0072] (2) CLIP-based Multimodal Cue Learning Framework (MPL)
[0073] like Figure 2 As shown, the MPL framework employs a co-attention layer to fuse multimodal features and learnable prompts for automated prompting, thereby fully utilizing multimodal information and the knowledge of pre-trained models. The MPL framework includes a feature extraction module, a multimodal feature fusion module, a learnable prompt design module, and a similarity calculation module. Each module is described below:
[0074] ①Feature extraction module
[0075] This module utilizes pre-trained CLIP Image Encoder and CLIP Text Encoder to extract visual features H from multimodal news. I and text features H T ;
[0076] ② Multimodal feature fusion module
[0077] This module is used to fuse the extracted visual and textual features to obtain the multimodal features H of multimodal news. M ;
[0078] This invention designs a multimodal fusion module based on co-attention. In traditional transformer self-attention, queries, keys, and values all come from the same input, while in co-attention, queries come from one input, and keys and values come from another input, and the residual connection only contains query values.
[0079] The multimodal feature fusion module includes two parallel co-attention modules and a fully connected layer. Each co-attention module includes a multi-head attention layer, residual connection & normalization, a fully connected feedforward network layer, and residual connection & normalization connected in sequence.
[0080] The d×1 dimensional V, Q, K inputs from different inputs are fed into a multi-head attention layer containing m attention heads. The calculation process of the multi-head attention function MA is as follows:
[0081] MA(Q,K,V)=hW O
[0082]
[0083]
[0084] Q i =QW i Q ,K i =KW i K V i =VW i V (2)
[0085] in, Let A represent the linear transformation matrix, and Q represent the attention calculation function. i K i V i Let each represent the input of the i-th head. The projection matrix of the i-th head, d h =d / m represents the output dimension of each head.
[0086] A fully connected feedforward network layer consists of two linear transformations and a ReLU activation function:
[0087] FFN(x)=max(0,xW1)W2 (3)
[0088] Where x represents the input of this layer; max(0,xW1) represents the ReLU activation function; W1 and W2 represent linear transformation matrices;
[0089] This invention achieves this by separately applying H I and H T The visual feature H is implemented by inputting into two parallel co-attention modules. I With text features H T Fusion, specifically: First, using text features H T As Q, visual feature H I As K and V, MA(H) is first calculated through a multi-head attention layer. T H I H I Then, after residual connection and normalization, we obtain H′. T Then, after passing through a fully connected feedforward network layer, FFN(H′) is obtained. TFinally, after residual connection and normalization, text features with visual information are obtained. Right now:
[0090] H′ T =H T +MA(H T H I H I )
[0091]
[0092] Secondly, based on visual feature H I As Q, text features H T As K and V, MA(H) is first calculated through a multi-head attention layer. I H T H T Then, after residual connection and normalization, we obtain H′. I Then, after passing through a fully connected feedforward network layer, FFN(H′) is obtained. I Finally, after residual connection and normalization, visual features with textual information are obtained. Right now:
[0093] H′ I =H I +MA(H I H T H T )
[0094]
[0095] Finally, and The input is fed into a fully connected layer to obtain multimodal features H that fully integrate textual and visual information. M :
[0096]
[0097] in This is the projection matrix.
[0098] ③ Learnable prompt vector module
[0099] CLIP utilizes hand-designed templates for small-sample experiments, but these templates require additional knowledge and manpower, and their performance is limited by template quality. To better leverage the knowledge of the pre-trained CLIP model, this invention designs learnable cue vectors: a set of learnable vectors replaces the original hand-designed cue templates, and these vectors are concatenated with categories to obtain the learnable cue vector p.
[0100] p = [V1][V2] ... [V16 [class] (7)
[0101] Among them, V i (i = 1, 2, ..., 16) represents the learnable word embedding, and class represents the label assigned to the detected text in the prompt message. The value of the label can be "true" or "false".
[0102] For the fake news detection dataset, 16 learnable word embeddings V were obtained by training on the training set. i , which serves as a contextual hint vector for the category of true or false.
[0103] ④ Similarity Calculation Module
[0104] This module is used to calculate the cosine similarity between multimodal features and category features, thereby enabling multimodal news classification.
[0105] (3) Early detection method of fake news based on MPL framework
[0106] The detection method of the present invention mainly includes the following steps:
[0107] Step 1: For a multimodal news item x consisting of an image I and text T, keep the CLIP parameters frozen;
[0108] Step 2: As shown in Equation (1), use the pre-trained CLIP image encoder f(·) to extract visual features H from the image. I Using a pre-trained CLIP text encoder g (·) Extracting text features H from the text T :
[0109] H I =f(I),H T =g(T) (1);
[0110] Step 3: Use the multimodal feature fusion module to fuse feature H I and H T By fusing, we obtain the multimodal features H of x. M ;
[0111] Step 4: Obtain the learnable cue vector p through the learnable cue vector module, and input p into the pre-trained CLIP text encoder g(·) to obtain the category feature H. C ;
[0112] H C =g(p);
[0113] Step 5: Calculate the multimodal features H M With category feature H CThe cosine similarity between them is used to classify multimodal news x, where the probability that news x belongs to category i is:
[0114]
[0115] Where τ represents the temperature parameter learned by CLIP, and k represents the number of categories.
[0116] Example
[0117] To verify the effectiveness of our method, experiments were conducted on three benchmark datasets, and the results were compared with supervised, fully trained multimodal fake news detection methods and state-of-the-art prompt-based few-shot fake news detection methods under both sufficient and few-shot conditions.
[0118] 1. Experimental setup
[0119] This embodiment uses a pre-trained CLIP (ViT-L / 14@336px) model as both the text encoder and image encoder, keeping all its parameters frozen. Continuous prompts are randomly initialized by sampling from a zero-mean Gaussian distribution with a standard deviation of 0.02. In the multimodal feature fusion module, d = 768, m = 8, and d... ff =1536. The model parameters were optimized during training using an SGD optimizer with a learning rate of 0.001. The model was trained for 20 epochs, and the checkpoint with the best validation performance was selected for testing. To fully demonstrate the effectiveness of this method, two sets of comparative experiments were conducted, one with fully supervised training and the other with limited footage, comparing the fully trained multimodal fake news detection method and the cue-based limited-camera fake news detection method.
[0120] (1) Comparative Experiment 1 (Supervised Scenario)
[0121] To compare with fully trained multimodal fake news detection methods, we followed existing approaches and divided the dataset into training and test sets in an 8:2 ratio. When data is plentiful, we used all the training data to train our model and baseline; when data is scarce, we trained our model using a small sample of data from the training set. Specifically, we trained with 16 instances from each category. Since the small sample training set significantly impacts model performance, we repeated the data sampling five times using different random seeds, and took the average score from the five experiments, discarding the highest and lowest scores, as the result of the few-sample experiment.
[0122] (2) Comparative Experiment 2 (Small Sample Scenario)
[0123] To further demonstrate the advantages of our method in few-shot scenarios, we compared it with state-of-the-art cue-based few-shot fake news detection methods. First, a small number of instances were randomly sampled from the dataset for training, with k instances drawn from each category, where k ∈ [2, 4, 8, 16]. The remaining instances were then used for testing. Additionally, a validation set of the same size as the training set chosen for the model was created. To reduce the impact of the training and validation sets on model performance, we repeatedly sampled data using five random seeds, discarding the highest and lowest scores and averaging the results as the experimental outcome.
[0124] 2. Dataset
[0125] This embodiment uses three multimodal benchmark FND datasets to evaluate the performance of the proposed method: Twitter, gossipcop, and politifact. These three datasets are real-world datasets collected from multiple social media platforms. The Twitter dataset consists of tweets containing textual, visual, and social contextual information. The Politifact and Gossipcop datasets are two English-language datasets collected from the political and entertainment domains of the FakeNewsNet repository, respectively. PolitiFact is a dataset about political news, which experts refer to as either fake or real news. Meanwhile, GossipCop tells entertainment stories with scores between 0 and 10; the authors of FakeNewsNet consider scores below 5 to be fake news.
[0126] To reduce redundancy, only the most relevant image was retained for news items with multiple images. This image was calculated based on the cosine similarity between the text and the image using a pre-trained CLIP model, and news items without images or with invalid image URLs were excluded. Statistical information for each dataset is shown in Table 1.
[0127] Table 1 Statistical Information
[0128] Twitter Politifact GossipCop #of fake news 8,011 164 2,581 #of real news 6,200 321 10,259 #of images 477 485 12,816
[0129] 3. Comparison Methods
[0130] As described above, this embodiment conducted two sets of comparative experiments. To ensure fairness, the original metrics of these methods were used for comparison. To more intuitively illustrate the comparison of few-shot models, we used a few-shot method for specific comparisons. When comparing with fully trained multimodal fake news detection methods, seven classic fully trained multimodal models were selected as baselines:
[0131] 1) EANN uses an event recognizer to capture news event information and extract news features unrelated to the events to assist in the detection of fake news;
[0132] 2) MVAE uses a variational autoencoder coupled with a binary classifier to learn shared representations of text and images;
[0133] 3) SpotFake uses VGG and BERT to extract image and text features respectively, and then concatenates them for classification;
[0134] 4) SAFE extracts multimodal (textual and visual) features of news content and the relationships between them, and detects fake news through a similarity-aware multimodal method;
[0135] 5) MCAN utilizes frequency domain features and spatial domain features to stack multiple common attention layers to fuse multimodal features;
[0136] 6) LIIMR identifies and suppresses information from weaker modalities and extracts relevant information from stronger modalities on a per-sample basis;
[0137] 7) CAFE proposed a fuzzy-aware multimodal fake news detection method to adaptively aggregate unimodal features and cross-modal correlations;
[0138] Furthermore, this embodiment is compared with two advanced cue-based fake news detection methods, one unimodal and the other multimodal:
[0139] 1) KPL extracts features based on the pre-trained language model Robert and incorporates entity knowledge information extracted from the knowledge graph into cue learning to guide fake news detection;
[0140] 2) SAMPLE integrates multimodal features generated by the CLIP model with the text representation of the pre-trained language model Robert, and uses the standard cosine similarity generated by CLIP to adjust the strength of the integrated multimodal features to help with cue learning for fake news detection.
[0141] To evaluate the performance of this model and the baseline, Accuracy, Precision, Recall, and F1 were chosen as evaluation metrics. Accuracy is the proportion of correct predictions in the overall predictions, providing a direct indication of the model's performance. Fake-precision is the probability that a sample predicted as fake news is actually fake news, representing the ability to correctly classify fake news. Fake-recall represents how many fake news items this method can detect; a higher fake-recall indicates more fake news is detected, which is important for early detection. Real-precision and real-recall are the same. F1 balances precision and recall to maximize both.
[0142] 4. Discussion of Results
[0143] 1) The comparison results with supervised multimodal fake news detection methods are shown in Table 2. Table 2 shows the results of this model under conditions of sufficient data and small sample size. MPL-Full means training with the entire training set; MPL-16 means training with only 16 data points per class.
[0144] Table 2 Results of the fully trained method in a fully trained environment.
[0145]
[0146] As shown in Table 2, compared with other models, this invention achieves optimal performance under sufficient data conditions. It achieves the best accuracy on all three datasets, and except for the gossipcop dataset, its F1 score for both true and false news is also at the highest level. Furthermore, the main CLIP module of this model keeps its parameters frozen; only the parameters of the continuous prompt and multimodal feature fusion modules are updated during training. Therefore, the training parameters of this invention are much smaller than those of other models, resulting in lower training costs and time. Even with fewer parameters and less training time, the results achieved by this model still far surpass those of other models, fully demonstrating the effectiveness of the multimodal prompt learning method.
[0147] In small sample sizes, our MPL model achieved excellent results using only 16 training data points per class. On the Twitter and Politifact datasets, our model was not significantly worse than state-of-the-art (SOTA) models, and even surpassed several classic models using all training data. This demonstrates the model's sophistication in the face of limited data; it achieves good classification results by learning from only a small amount of labeled data. On the Gossipcop dataset, although the accuracy of our model was not satisfactory, it achieved the highest recall for fake news, indicating that our method can find more fake news, which is significant for early detection.
[0148] 2) To further verify the advantages of this invention in small sample settings, it was compared with a cue-based fake news detection method on politifact and gossipcop. We selected k data points for training for each class, k∈[2,4,8,16], and selected an equal number of data points as the validation set to select the model, with the rest used as the test set.
[0149] Table 3 shows the results of the few-shot method on the PolitiFact dataset.
[0150]
[0151] Table 4 shows the results of the few-shot method on the GossipCop dataset.
[0152]
[0153] As shown in Table 3, on the politifact dataset, all results of this invention reached the best in all k-shot experiments, demonstrating the superiority of this method in small sample scenarios.
[0154] As shown in Table 4, on the gossipcop dataset, the present invention achieved the best ACC with settings of k=8 and k=16, demonstrating the effectiveness of the method.
[0155] 5. Ablation test
[0156] To verify the effectiveness of the module proposed in this invention, ablation experiments were conducted. Multiple ablation experiments were performed on three datasets under a 16-shot setting. For images without visual features (H), visual features were removed. I Using only text features H T With category feature H C Calculate similarity; for w / o text, remove text features H. T Using only visual features H I With category feature H C Calculate similarity; for w / o fusion, the multimodal feature fusion module was removed, and instead, visual features H were directly fused. I With text features H T Concatenation; for without learnable prompts, remove the learnable prompt vectors and replace them with manually defined prompt vectors [According to the image and text, this news is][class]; for without similarity, do not use similarity calculation for classification prediction, but instead use multimodal features H M With category feature H C Feature H is obtained by splicing F The parameters are input into a linear regression classifier for classification prediction; for w / o frozen, the parameters of CLIP are not frozen, and the model is fine-tuned during training.
[0157] The results of the ablation experiments are shown in Table 5. As can be seen from the table, each module is functional, and the absence of any module leads to performance degradation. The results of "w / o image" and "w / o text" demonstrate that this model fully utilizes information from both modalities; using only one modality results in performance degradation. "w / o fusion" proves that simply concatenating features from two modalities cannot fully utilize the multimodal context. The multimodal feature fusion method of this invention allows the two modal features to learn each other's semantics before fusing them together, thus fully integrating semantic information from different modalities and proving the effectiveness of this method. "w / o learnable prompt" demonstrates the improvement brought by using continuous prompts. Compared to manually creating templates, allowing the model to learn prompt vectors itself better captures the semantic information of categories and multimodal features, making it better suited for downstream tasks. "w / o similarity" proves that this model fits the CLIP pre-training method. Compared to using only the CLIP encoder to extract features and then using an additional classification network for classification, this method can more fully utilize the knowledge learned during the CLIP pre-training stage, achieving better classification results. The use of frozen data demonstrates the improvement brought by cue learning compared to fine-tuning methods in small-sample scenarios, proving the effectiveness of our method in small-sample situations.
[0158] Table 5 Ablation study results for the three datasets
[0159]
[0160] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of this invention is defined by the appended claims and their equivalents.
Claims
1. A method for early detection of fake news, characterized in that, Includes the following steps: Step 1: Obtain the multimodal information to be detected; Step 2: Construct a multimodal prompting learning framework; Step 3: Input the multimodal information to be detected into the multimodal cue learning framework; Step 4: Perform multimodal learning using a multimodal cue learning framework, detect the multimodal information to be detected, and output the detection results; The multimodal prompt learning framework constructed in step 2 includes a feature extraction module, a multimodal feature fusion module, a learnable prompt design module, and a similarity calculation module. The feature extraction module is used to extract visual and textual features of the input multimodal information to be detected using a pre-trained CLIP model. The multimodal feature fusion module is used to fuse the extracted visual features and text features to obtain the multimodal features of the multimodal information to be detected. The learnable Prompt design module is used to replace the original manually designed prompt template with a set of learnable vectors, and to obtain category features through the learnable vectors; The similarity calculation module is used to calculate the cosine similarity between multimodal features and category features, and to classify the multimodal information to be detected. Step 4 includes the following specific steps: Step 41: Keep the CLIP model parameters frozen, input the multimodal information x to be detected into the feature extraction module, and the feature extraction module uses the pre-trained CLIP image encoder and text encoder to extract the visual features of x respectively. and text features ; Step 42: The input is fed into the multimodal feature fusion module, and after passing through two co-attention modules, text features with visual information are obtained. and visual features with text information and in the fully connected layer Fusion yields multimodal features ; Step 43: Using the learnable Prompt design module, obtain the learnable cue vector p, and input p into the pre-trained CLIP text encoder. To obtain category features , ; Step 44: The similarity calculation module calculates multimodal features. With category features Based on the cosine similarity between x, classify x and output the classification results.
2. The method for early detection of fake news as described in claim 1, characterized in that, The multimodal feature fusion module includes two parallel co-attention modules and a fully connected layer. Each co-attention module includes a multi-head attention layer, a residual connection & normalization layer, a fully connected feedforward network layer, and a residual connection & normalization layer connected in sequence. Two common attention modules are used to obtain text features with visual information and visual features with text information based on the input text features and visual features, respectively. The fully connected layer is used to obtain fully fused multimodal features based on the input text features with visual information and the visual features with text information.
3. The method for early detection of fake news as described in claim 1, characterized in that, Step 42, which involves calculating multimodal features, includes: Step 421: Based on the features of the input text Text features with visual information are obtained through equation (4). : ; (4) in, This indicates a fully connected feedforward network layer. Represents a multi-head attention function; Step 422: Based on the input visual features Visual features with textual information are obtained through equation (5). : ; (5); Step 423: Use formula (6) to... and By fusing the data, we obtain the multimodal features of x. : (6); in, This represents the attention matrix.
4. The method for early detection of fake news as described in claim 3, characterized in that, Step 44 includes the following specific steps: Step 441: Calculate multimodal features With category features The cosine similarity between x and x gives the probability that x belongs to category i. : ; in, This represents the temperature parameters learned by CLIP. Indicates the number of categories; Step 442: Classify x according to probability p and output the early detection results of fake news.
5. A fake news early detection system, implemented based on the fake news early detection method as described in claim 1, characterized in that, It includes a detection information input module, a multimodal prompting learning framework construction module, a detection text input module, and a multimodal learning module; The detection information input module is used to acquire the multimodal information to be detected; A multimodal prompting learning framework building module, used to build a multimodal prompting learning framework; The text input detection module is used to input the multimodal information to be detected into the multimodal cue learning framework; The multimodal learning module performs multimodal learning through a multimodal prompting learning framework, detects the multimodal information to be detected, and outputs the detection results.
6. An electronic device, characterized in that, include: At least one processor; And a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the fake news early detection method as described in any one of claims 1 to 4.
7. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method for early detection of fake news as described in any one of claims 1 to 4.