Multimodal fake news detection method
By extracting multi-granular semantic features from news images and text using a visual language pre-trained model and a text pre-trained model, and fusing them using a hybrid heterogeneous expert network, the problem of insufficient image semantic understanding and text feature fusion in fake news detection is solved, and more efficient fake news detection is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG NORMAL UNIV
- Filing Date
- 2023-12-04
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from insufficient image semantic understanding and inadequate text feature fusion in fake news detection, resulting in low detection efficiency.
A visual language pre-trained model is used to obtain descriptive text for news images. A text pre-trained model is combined to extract semantic features at multiple granular levels. These features are then fused using a hybrid heterogeneous expert network and finally concatenated with image features for classification.
It improves the efficiency of fake news detection, enhances semantic understanding and feature representation capabilities, and improves model performance and prediction accuracy.
Smart Images

Figure CN117609765B_ABST
Abstract
Description
Technical Field
[0001] This invention discloses a multimodal fake news detection method, belonging to the field of information detection technology. Background Technology
[0002] With the development of the internet and social media, users can easily share and disseminate information. Much of this information lacks authoritative verification and often caters to the public's curiosity, allowing fake news to stand out from the crowd. This fake news is increasing in quantity, spreading faster, and affecting a wider audience. In recent years, the relationship between fake news and major public events has become increasingly close, often appearing during such events. The public's high level of attention to these events makes them more receptive to and prone to disseminating unverified information, which severely damages media credibility, misleads the public, and can lead to social chaos.
[0003] With the development of multimedia technology, news is no longer presented solely as text; most news articles now include images or videos. This is partly because images can attract readers' interest and attention, allowing articles to stand out from the vast sea of news, leading to more clicks and shares, thus expanding the reach. On the other hand, news images can add a layer of visual evidence to fake news, making it appear more authentic and credible. Since people are more receptive to visual information, fake news with images can better mimic the appearance of real news, making it easier to deceive readers. Therefore, researching multimodal fake news detection methods is essential.
[0004] Previous methods have achieved good results in fake news detection tasks, but they still have the following two drawbacks:
[0005] (1) Although pre-trained models such as VGG can extract image features, they have certain deficiencies in semantic understanding and interpretability, and cannot fully understand the meaning of the image.
[0006] (2) Single-structure expert networks limit the fusion and representation of text modalities, and cannot make full use of word-level and sentence-level semantic features of text. The combination of multiple identical network structures cannot effectively utilize multi-granular semantic features and cannot deeply explore the semantics of text.
[0007] Therefore, research on the above problems that can comprehensively utilize the semantic features of news text and images to achieve a multimodal fake news detection method with high recognition efficiency has important research and application value. Summary of the Invention
[0008] To address the aforementioned problems, this invention provides a multimodal fake news detection method. This method utilizes a visual language pre-trained model to obtain news image description text with stronger semantic understanding and interpretability. It also utilizes a text pre-trained model to obtain multi-granularity hierarchical semantic features from both the news text and the news image description text. These features are effectively fused using a hybrid heterogeneous expert network and then concatenated with image features obtained from an image pre-trained model to obtain multimodal news features, which are finally classified. The method includes the following steps:
[0009] Step 1: Use a visual language pre-trained model to obtain descriptive text for news images;
[0010] Step 2: Based on the news text, news images, and the descriptive text of the obtained news images, obtain the text features of the news text, the image features of the news images, and the text features of the descriptive text of the news images;
[0011] Step 3: Input the text features of the news text into the hybrid heterogeneous expert network module to obtain text fusion features, and input the text features of the news image description text into the hybrid heterogeneous expert network module to obtain description fusion features. The hybrid heterogeneous expert network consists of experts with two different network structures. Different weights are assigned to the two experts through a gate network. Finally, the weighted outputs of the two expert networks are summed to obtain the output features of the hybrid heterogeneous expert network.
[0012] Step 4: Concatenate the text fusion features, description fusion features, and image features of the news image to obtain multimodal news features. Input the multimodal news features into the classifier to determine whether the news to be detected is fake news.
[0013] Furthermore, in step one, the news image to be detected is input into a pre-trained visual language model to obtain the news image description text. For news without an accompanying image, the news image description text is set to "null".
[0014] Furthermore, step two is specifically implemented using the following formula:
[0015] (1) Text feature extraction of news texts
[0016]
[0017]
[0018]
[0019]
[0020] (2) Text feature extraction of news photo descriptions
[0021]
[0022]
[0023]
[0024]
[0025] (3) Image feature extraction from news photos
[0026]
[0027] Where Text represents the news text, Caption represents the news image description text, Image represents the news image, n1, m1, n2, m2, and m3 represent the data sequence numbers, token represents the sequence obtained by processing the text using BertTokenizer, and e w This is the output of the last_hidden_state layer of the BERT model, where w represents word-level text features, and e... s This is the output of the pooler_output layer of the BERT model, where s represents sentence-level text features and i represents image features extracted by the ViT model.
[0028] Furthermore, the hybrid heterogeneous expert network module in step three includes two expert networks: expert 1 is a TextCNN network, and expert 2 is an LSTM network. The outputs of the two expert networks are weighted by a gate network. Expert 1 is responsible for processing word-level text features *ew*, and expert 2 is responsible for processing sentence-level text features *es*. The outputs of the two experts are concatenated and input into the gate network to obtain the weights of the two experts. These weights are then multiplied by the corresponding expert outputs to obtain the weighted expert output. The hybrid heterogeneous expert network is implemented using the following formula:
[0029]
[0030]
[0031]
[0032]
[0033]
[0034] Where φ represents the learning parameters of the TextCNN network, τ represents the learning parameters of the LSTM network, Gate represents the gate network, ψ represents the parameters learned by the Gate network, and weight represents the weights output by the gate network.
[0035] Furthermore, the gate network in step three consists of a fully connected layer, a ReLU activation layer, a fully connected layer, and a Softmax activation layer, while the classifier in step four consists of a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, and a Sigmoid activation layer.
[0036] Furthermore, in step four, the obtained text fusion features, description fusion features, and image features are fused to obtain multimodal news features V. V is then input into a classifier to obtain the classification result, which is achieved through the following formula:
[0037]
[0038] MLP stands for Classifier. The probability value is between 0 and 1. It was considered real news at the time. This is considered fake news. The model uses the cross-entropy loss function to reduce the loss, implemented using the following formula:
[0039]
[0040] N represents the number of news datasets. The real tag representing the i-th news item. The prediction label represents the i-th news item.
[0041] In summary, the method of the present invention has the following beneficial effects:
[0042] (1) The image and text features extracted by the pre-trained model provide richer semantic understanding and enhance the representation ability of the features.
[0043] (2) The features extracted by fusing through a hybrid heterogeneous expert network effectively improve the performance and predictive ability of the model. Attached Figure Description
[0044] Figure 1 This is a flowchart illustrating the multimodal fake news detection method according to the present invention;
[0045] Figure 2 This is a schematic diagram illustrating the principle of the multimodal fake news detection method according to the present invention; Detailed Implementation
[0046] Specifically, as an example, Figure 1 This is a flowchart illustrating a multimodal fake news detection method. The multimodal fake news detection method includes steps S110 to S140.
[0047] S110. Generate descriptive text for news images using a visual language model.
[0048] The news information to be detected includes news text and news images. Traditional multimodal fake news detection models typically use both text and images for detection. This invention's multimodal fake news detection method utilizes news images to generate descriptive text, thereby acquiring more semantic information from the news images. In the specific implementation process, the news images are preprocessed and then input into the BLIP-2 model to obtain the descriptive text for the news images. The descriptive text for the news images does not exceed 50 characters. For news information without images, the descriptive text for the news images is set to "null".
[0049] S120. Use the pre-trained model ViT to generate image features of news images, and use the pre-trained model BERT to generate text features of news text and text features of news image description text, respectively.
[0050] This invention uses the pre-trained model ViT to extract image features from news images. For news articles without images, a blank image is generated as the input. In the specific implementation, the news images are first pre-processed, with their dimensions uniformly adjusted to 224. 224, with normalized parameters of [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225]. The image, converted to a Tensor, is then input into the ViT model to obtain an output vector of length 1000. A fully connected layer then converts the vector length to 320, thus obtaining the image features of the news image. This invention uses the pre-trained model BERT to extract text features from news text. In the specific implementation process, the news text is first input into BertTokenizer to obtain a word sequence, where the text length is limited to no more than 170 words, and the news image description text is limited to no more than 50 words. Then, the word sequence is input into BERT to obtain word-level text features e. w and sentence-level text features e s This can be achieved using the following formula:
[0051] (1) Text feature extraction of news texts
[0052]
[0053]
[0054]
[0055]
[0056] (2) Text feature extraction of news photo descriptions
[0057]
[0058]
[0059]
[0060]
[0061] (3) Image feature extraction from news photos
[0062]
[0063] Where Text represents the news text, Caption represents the news image description text, Image represents the news image, n1, m1, n2, m2, and m3 represent the data sequence numbers, token represents the sequence obtained by processing the text using BertTokenizer, and e w This is the output of the last_hidden_state layer of the BERT model, where w represents word-level text features, and e... s This is the output of the pooler_output layer of the BERT model, where 's' represents sentence-level text features and 'i' represents image features extracted by the ViT model. In T... t In text features, e w The dimensions are (batch_size, 170, 768), e s The dimension is (batch_size, 170), in T c text features e w The dimensions are (batch_size, 50, 768), e s The dimension is (batch_size, 50).
[0064] S130. Input the text features of the news text and the text features of the news image description text into the hybrid heterogeneous expert network to obtain text fusion features and description fusion features.
[0065] The text features of the news text and the text features of the news image description text obtained in section S120 will be input into the hybrid expert network module to obtain text fusion features and description fusion features. In the specific implementation process, the e-values representing word-level text features will be used. w The input is fed into the Expert 1 TextCNN network, where e represents sentence-level text features. s The input is fed into an expert 2LSTM network to obtain feature vectors F with dimensions (batch_size, 320). t With F l Then Ft and F l F is obtained by splicing cat , will F cat The input is fed into the gate network to obtain weights, which are then multiplied by the outputs of the corresponding experts, and finally summed to obtain the weighted text fusion features. The hybrid heterogeneous expert network is implemented using the following formula:
[0066]
[0067]
[0068]
[0069]
[0070]
[0071] in, These represent the learning parameters of the TextCNN network. The parameters represent the learning parameters of the LSTM network, and Gate represents the gate network. The parameters represent the gate network's learned parameters, and `weight` represents the weights of the gate network's output. The text fusion features and description fusion features can be obtained using the above formula. The gate network consists of fully connected layers, ReLU activation layers, fully connected layers, and softmax activation layers.
[0072] S140. After concatenating the obtained text fusion features, description fusion features, and image features of the news images, input them into the classifier for classification to obtain the classification result of the news.
[0073] In the specific implementation process, the text fusion features, description fusion features, and image features of the news images obtained in parts S120 and S130 are concatenated to obtain multimodal news features with a dimension of (batch_size, 960). The multimodal news features are then input into a classifier to obtain the classification result, which is achieved through the following formula:
[0074]
[0075] MLP stands for classifier, when The output is a probability value between 0 and 1. It was considered real news at the time. This is considered fake news. The model uses the cross-entropy loss function to reduce the loss, implemented using the following formula:
[0076]
[0077] N represents the number of news datasets. The real tag representing the i-th news item. The prediction label represents the i-th news item.
[0078] The classifier consists of a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, and a Sigmoid activation layer.
[0079] Figure 2 This is a schematic diagram illustrating the principle of the multimodal fake news detection method according to the present invention. Figure 2 As shown, in the specific implementation process, the first step is to use the visual language pre-trained model BLIP-2 to generate news image description text from news images in the news information. The second step is to preprocess the news text and news image description text to extract text features, and preprocess the news images to extract image features. The third step is to input the extracted text features of the news text and the text features of the news image description text into a hybrid heterogeneous expert network, and then input the word-level text features into the network. w The text features are input into the expert 1—TextCNN network, and sentence-level text features are processed. s The input is fed into Expert 2—an LSTM network. The outputs of the two experts are concatenated and then fed into a gate network to obtain weights. These weights are then multiplied by the outputs of the two experts, and the results are summed to obtain the fused features. In the fourth step, the obtained text fused features, description fused features, and image features of the news image are concatenated and fed into a classifier to obtain the classification result.
[0080] The multimodal fake news detection method of this invention was used to detect fake news on the Chinese dataset Weibo21. The results are shown in Table 1.
[0081] Table 1: Detection results of the multimodal fake news detection model of the present invention on the Weibo21 dataset.
[0082]
[0083] As can be seen from Table 1, the multimodal fake news detection method of the present invention achieves an average accuracy of 93% on the Chinese dataset Weibo21. The overall classification accuracy, as well as the F1 score for real news and the F1 score for fake news, surpass other models.
Claims
1. A multimodal fake news detection method, which utilizes a visual language pre-trained model to obtain news image description text with stronger semantic understanding and interpretability, utilizes a text pre-trained model to obtain multi-granularity hierarchical semantic features of the news text and news image description text, effectively fuses them through a hybrid heterogeneous expert network, concatenates them with image features obtained using an image pre-trained model to obtain multimodal news features, and finally performs classification. The method includes the following steps: Step 1: Use a visual language pre-trained model to obtain descriptive text for news images; Step 2: Based on the news text, news images, and the descriptive text of the obtained news images, obtain the text features of the news text, the image features of the news images, and the text features of the descriptive text of the news images; Step 3: Input the text features of the news text into the hybrid heterogeneous expert network module to obtain text fusion features, and input the text features of the news image description text into the hybrid heterogeneous expert network module to obtain description fusion features. The hybrid heterogeneous expert network consists of experts with two different network structures. Different weights are assigned to the two experts through a gate network. Finally, the weighted outputs of the two expert networks are summed to obtain the output features of the hybrid heterogeneous expert network. The hybrid heterogeneous expert network module includes two expert networks: Expert 1 is a TextCNN network, and Expert 2 is an LSTM network. The outputs of the two expert networks are weighted by a gate network. Expert 1 is responsible for processing word-level text features e. w Expert 2 is responsible for processing sentence-level text features. s The outputs of the two experts are concatenated and input into the gate network to obtain the weights of the two experts. These weights are then multiplied by the corresponding expert's output to obtain the weighted expert's output. The hybrid heterogeneous expert network is implemented using the following formula: Where φ represents the learning parameters of the TextCNN network, τ represents the learning parameters of the LSTM network, Gate represents the gate network, ψ represents the parameters learned by the Gate network, and weight represents the weights output by the gate network. Step 4: Concatenate the text fusion features, description fusion features, and image features of the news image to obtain multimodal news features. Input the multimodal news features into the classifier to determine whether the news to be detected is fake news.
2. The multi-modal fake news detection method of claim 1, wherein: In step one, the news image to be detected is input into a pre-trained visual language model to obtain the news image description text. For news without an accompanying image, the news image description text is set to "null".
3. The multi-modal fake news detection method of claim 1, wherein: Step two is specifically implemented using the following formula: (1) Text feature extraction of news texts (2) Text feature extraction of news photo description text (3) Image feature extraction from news photos Where Text represents the news text, Caption represents the news image description text, Image represents the news image, n1, m1, n2, m2, and m3 represent the data sequence numbers, token represents the sequence obtained by processing the text using BertTokenizer, and e w This is the output of the last_hidden_state layer of the BERT model, where w represents word-level text features, and e... s This is the output of the pooler_output layer of the BERT model, where s represents sentence-level text features and i represents image features extracted by the ViT model.
4. The multi-modal fake news detection method of claim 1, wherein: The gate network in step three consists of a fully connected layer, a ReLU activation layer, a fully connected layer, and a Softmax activation layer. The classifier in step four consists of a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, a LeakyReLU activation layer, a fully connected layer, and a Sigmoid activation layer.
5. The multi-modal fake news detection method of claim 1, wherein: In step four, the obtained text fusion features, description fusion features, and image features are fused to obtain multimodal news features V. V is then input into a classifier to obtain the classification result, which is achieved through the following formula: MLP stands for Classifier. The probability value is between 0 and 1. It was considered real news at the time. If the news is considered fake news, the model uses the cross-entropy loss function to reduce the loss, which is achieved through the following formula: N represents the number of news datasets. The real tag representing the i-th news item. The prediction label represents the i-th news item.