A method and system for identifying fake news based on data clustering
By constructing a K-means clustering model based on propagation features, the problem of insufficient recognition ability of fake news detection across modalities and languages is solved, and efficient and stable fake news detection is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SUN YAT SEN UNIV
- Filing Date
- 2023-06-27
- Publication Date
- 2026-06-16
AI Technical Summary
Existing fake news detection technologies have weak cross-modal and cross-linguistic recognition capabilities and poor model recognition stability, making it difficult to cope with the development of deep synthesis technologies for machine-generated text and images.
A news dataset is constructed, and dissemination features such as dissemination timeline, user characteristics, feedback features, and popularity features are extracted. The K-means algorithm is used for clustering, and a fake news detection model is built. The authenticity of news is judged by the dissemination feature vector.
It achieves cross-modal and cross-language fake news detection, improves the stability and efficiency of model recognition, reduces costs, and has better clustering results than other algorithms.
Smart Images

Figure CN116719941B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the technical field of content detection and data clustering, and specifically to a method for detecting fake news using data clustering technology. Background Technology
[0002] With the widespread use of social media, an increasing amount of misinformation is also rampant on social media platforms. This fake news includes rumors, defamation, and other false information, negatively impacting people's access to information and their decision-making.
[0003] Compared to traditional fake news, fake news on social media platforms is characterized by its intelligent production and dissemination: it can use AI forgery technology to make content difficult to trace or verify its authenticity; many photos and videos are often so realistic that they are indistinguishable from genuine articles; and in terms of dissemination, it often relies heavily on social bots or online trolls to spread fake news. Therefore, efficient and stable automated methods are needed to effectively detect fake news on social media.
[0004] Current fake news detection methods primarily target specific text types, platforms, and languages. These methods have weak cross-modal recognition capabilities. Furthermore, existing technologies mainly focus on the content differences between fake and real news, extracting content features for identification. However, with the advancement of fake news generation technology, the content differences between fake and real news are becoming increasingly smaller, leading to unstable fake news detection results.
[0005] One existing technology is "Fake News Detection Method, Device, Electronic Equipment and Readable Storage Medium" (CN202210161193.3), which is a method for constructing a fake news recognition model, a fake news recognition method and device based on neural network technology. Its approach is as follows: 1) Extract textual and image features from news articles and combine them into multimodal features for each news article; 2) Train a fake news recognition model using the method proposed in 1), and then use this model to identify fake news. The drawbacks of this method and device are that: current deep synthesis technologies related to text and images are developing rapidly, and the content features of text and images may change quickly, leading to poor recognition stability of the model trained using this method. Furthermore, it performs poorly in recognizing video and audio news.
[0006] The second existing technology is "A Fake News Explainability Detection System and Method Based on Evidence Inference Network" (CN202110045012.6), which is a fake news detection system and method based on an evidence-aware hierarchical interactive attention network. Its approach is as follows: 1) The system includes an input embedding module, an internal interaction module, a global inference module, and a task learning module; 2) The system predicts explainability of fake news by detecting suspicious parts in the fake news. The drawback of this system and method is that it can only capture suspicious and contradictory parts of text data. Therefore, news of image, video, and audio types often needs to be converted into text for recognition, resulting in poor cross-modal recognition performance. Summary of the Invention
[0007] The purpose of this invention is to overcome the shortcomings of existing methods and propose a method and system for identifying fake news based on data clustering. The main problems addressed by this invention are: 1) how to overcome the defects of existing fake news identification technologies and achieve detection of cross-modal and cross-language news; 2) how to improve the stability of the model's recognition and detection performance under the condition that machine-generated text and image deep synthesis technology is developing rapidly and newly generated text or images often do not have the features of existing text and images.
[0008] To address the aforementioned problems, this invention proposes a method for identifying fake news based on data clustering, the method comprising:
[0009] To construct a news dataset, a media web crawler was designed to collect news data from the media and store the collected data. The data of each news item was classified and cleaned, and the authenticity of each news item in the dataset was distinguished by manual annotation. Finally, a news dataset containing real news and fake news was formed.
[0010] In the news dataset, the dissemination features of each news item are extracted. The extracted dissemination features include: dissemination time sequence features, dissemination user features, dissemination feedback features, and dissemination popularity features. Then, these four features are used to construct the dissemination feature vector of each news item.
[0011] A news detection model is constructed, and the K-means method is used to cluster the news dataset. This includes treating the propagation feature vector of each news item as a point, selecting a real news item and a fake news item in the dataset as the center points of the two clusters, calculating the distance from each sample news item to the two center points for each sample news item to be clustered, assigning each sample news item to the category corresponding to the center point with the shortest distance based on the distance from the two center points, recalculating the center point of each category, and using the new center point as the new center point of the category. The above process is iterated repeatedly until the minimum error change is reached, and finally the news detection model is obtained.
[0012] Input the news data to be detected, extract the propagation feature vector of the news to be detected through feature extraction, import these propagation feature vectors into the news detection model, calculate the authenticity score of the news, and output the news detection result.
[0013] Preferably, the news data on the media specifically includes:
[0014] The specific content of the news, the media in which the news was disseminated, the content of the news comments and the number of likes on the comments, the number of reposts and views of the news, and the topic tags to which the news belongs.
[0015] Preferably, the propagation timing feature specifically includes:
[0016] The number of peak news dissemination times is calculated by statistically analyzing the relationship between the dissemination time and the number of reposts for each news item, thus determining the peak number of reposts for each news item.
[0017] Preferably, the user characteristics being disseminated specifically include:
[0018] News disseminators and news forwarding users are categorized. News disseminators are identified and mapped to corresponding numerical sequences. For news forwarding users, social bot identification technology is used to determine if they are social bots. A score of 0 to 1 is used as the news social bot score; a score of 0 indicates that all news disseminators are real accounts, while a score of 1 indicates that all news disseminators are social bots.
[0019] Preferably, the news feedback feature specifically includes:
[0020] The sentiment tendency and degree of sentiment homogeneity of news comments are determined by extracting and segmenting comments and reposts from news articles, and then performing sentiment analysis on these comments and reposts using the SnowNLP module. The result is a sentiment score for each comment; a higher score indicates a more positive sentiment, and vice versa. By calculating the overall sentiment score of the text comments, the sentiment tendency of the news comments is obtained, and the probability distribution of sentiment types in the news comments is statistically analyzed as a representation of the degree of sentiment homogeneity.
[0021] Preferably, the propagation heat characteristic specifically includes:
[0022] The number of likes on news articles and the total number of likes on news comments are used to calculate the ratio between the two, thus obtaining the news's popularity ratio.
[0023] Preferably, the propagation feature vector of each news item is specifically:
[0024] The numerical representations of each feedback feature are aggregated into a vector, and the propagation feature vector of each news item is represented as follows:
[0025] N a ={n1,n2,n3,n4,n5,n6}
[0026] Where, N a For a specific news item, n1 represents the peak number of times the news was forwarded, n2 represents the media through which the news was disseminated, n3 represents the score of the news social robot, n4 represents the sentiment score of the news comments, n5 represents the degree of emotional homogeneity of the news, and n6 represents the proportion of the news in terms of popularity.
[0027] Preferably, the distance from each sample news item to the two center points is specifically:
[0028]
[0029] Where L represents the distance from the sample news item to the center point, and n a1 ,n a2 ...n a6 Let n be the term values of the feature vector of the sample news. b1 ,n b2 ...n b6 The term value is the eigenvector of the center point.
[0030] Preferably, the news authenticity score is as follows:
[0031] Calculate the distance between the feature vector of the news to be detected and the two center points, and calculate the score for whether it is fake news based on the distance between the two center points. The higher the score, the higher the authenticity of the news. Output the score:
[0032]
[0033] Where G is the output news authenticity score, L1 is the distance from the center point of fake news, and L2 is the distance from the center point of real news. The value of G ranges from 0 to 100. A value of 0 indicates that the news completely matches the characteristics of fake news and is judged as fake news, while a value of 100 indicates that the news is highly authentic and is judged as real news.
[0034] Accordingly, the present invention also provides a system for identifying fake news based on data clustering, comprising:
[0035] The data acquisition unit is used to construct a news dataset, design a media web crawler to collect news data from the media, store the collected data, classify and clean the data of each news item, and distinguish the authenticity of each news item in the dataset through manual annotation, ultimately forming a news dataset containing real news and fake news.
[0036] The feature extraction unit is used to extract the dissemination features of each news item in the news dataset. The extracted dissemination features include: dissemination time sequence features, dissemination user features, dissemination feedback features, and dissemination popularity features. Then, these four features are used to construct the dissemination feature vector of each news item.
[0037] The model construction unit is used to build a news detection model. It uses the K-means method to cluster the news dataset. This includes taking the spread feature vector of the news as a point, selecting a real news and a fake news in the dataset as the center points of the two clusters, calculating the distance of each sample news to the two center points for each sample news to be clustered, assigning each sample news to the category corresponding to the center point with the shortest distance based on the distance of each sample news to the two center points, recalculating the center point of each category, and taking the new center point as the new center point of the category. The above process is iterated repeatedly until the minimum error change is reached, and finally the news detection model is obtained.
[0038] The news detection unit is used to input news data to be detected, extract features to obtain the propagation feature vectors of the news data to be detected, import these propagation feature vectors into the news detection model, calculate the authenticity score of the news, and output the detection result of the news data to be detected.
[0039] Implementing this invention has the following beneficial effects:
[0040] This invention does not use content features such as text or image features. Instead, it constructs corresponding vectors and models based on the propagation characteristics of fake news. Compared to content features, propagation characteristics are easier to quantify, fewer in number, and more stable. Therefore, this invention is lower in cost, more efficient, and more stable in identification than other inventions. Furthermore, due to the inherent propagation purpose of news, its propagation characteristics are reflected in news across different modalities and languages. This invention, based on propagation characteristics, constructs a fake news identification model that can achieve cross-modal and cross-linguistic fake news detection. In addition, the data clustering method used in this invention employs the K-means algorithm. Compared to other algorithms, the K-means algorithm has superior clustering results, is easier to implement, maintain, and adjust, and has stronger interpretability. Attached Figure Description
[0041] Figure 1 This is an overall flowchart of a method for identifying fake news based on data clustering according to an embodiment of the present invention;
[0042] Figure 2 This is a flowchart of the feature extraction module in an embodiment of the present invention;
[0043] Figure 3 This is a flowchart of the news clustering algorithm according to an embodiment of the present invention;
[0044] Figure 4 This is a flowchart of the news detection module according to an embodiment of the present invention;
[0045] Figure 5 This is a structural diagram of a system for identifying fake news based on data clustering, according to an embodiment of the present invention. Detailed Implementation
[0046] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0047] Figure 1 This is a flowchart illustrating the overall process of a method for identifying fake news based on data clustering, as described in this embodiment of the invention. Figure 1 As shown, the method includes:
[0048] S1. Construct a news dataset. The collected data is preprocessed and manually labeled. Each news item is categorized and cleaned, and its authenticity is determined through manual labeling, ultimately forming a news dataset containing both real and fake news.
[0049] S2, Feature Extraction Method. Input the news data processed in step S1, extract the propagation features of each news data point, and output the propagation feature vector for each news data point.
[0050] S3. The news dataset is clustered using the K-means method. After inputting the news feature vectors output in step S2, initial cluster centroids are constructed. For each sample news item, the distance to the centroid is calculated. The sample news items are classified according to the results, and the centroids are recalculated. The above process is iterated repeatedly until the minimum error change is reached, and finally, the news detection model is output.
[0051] S4, a news detection method, takes the raw data of the news to be detected as input and outputs the authenticity score of the news.
[0052] Step S1 is as follows:
[0053] S1-1: Obtaining the dataset mainly involves designing a media web crawler based on the Python language to collect news data from the media and storing the collected data in a MySQL database.
[0054] S1-2: News in the dataset, including the specific content of the news, the media in which the news was disseminated, the content of the comments and the number of likes on the comments, the number of reposts and views of the news, and the topics or tags to which the news belongs.
[0055] Step S2 is as follows:
[0056] S2-1, the workflow diagram of the feature extraction method is as follows: Figure 2 As shown, the news data processed in step S1 is input, the dissemination features of the news are extracted, and the dissemination feature vector of each news data is finally output.
[0057] S2-2, the characteristics of news dissemination include: dissemination time sequence characteristics, dissemination user characteristics, dissemination feedback characteristics, and dissemination popularity characteristics.
[0058] S2-2-1, the extracted dissemination time sequence features include: the number of news dissemination peaks, which are calculated by statistically analyzing the relationship between the dissemination time and the number of reposts of each news item.
[0059] S2-2-2, the extracted user characteristics for dissemination include: news disseminator category and news forwarding user category. By identifying news disseminators, they are mapped to corresponding numerical sequences; for news forwarding users, social robot identification technology is used to determine whether they are social robots. A score of 0 to 1 is used as the news social robot score; a score of 0 indicates that all news disseminators are real accounts, while a score of 1 indicates that all news disseminators are social robots.
[0060] S2-2-3, the extracted news feedback features include: the sentiment tendency and the degree of sentiment uniformity of news comments. Comments and reposts under the news are extracted, segmented, and the SnowNLP module is used to perform sentiment analysis on the comments and reposts, obtaining the sentiment score of the comments. A higher score indicates a more positive sentiment, and vice versa. By calculating the overall sentiment score of the text comments, the sentiment tendency of the news comments is obtained, and the probability distribution of the sentiment types of the news comments is statistically analyzed as a representation of the degree of sentiment uniformity.
[0061] S2-2-4, the features of dissemination popularity extracted include: the number of likes on the news and the total number of likes on the news comments. By counting the number of likes under the news and the total number of likes on the news comments, the ratio between the two is calculated to obtain the popularity ratio of the news.
[0062] S2-3, aggregate the numerical representations of each feedback feature into a vector, with each news item represented as:
[0063] N a ={n1,n2,n3,n4,n5,n6}
[0064] Where, N a For a specific news item, n1 represents the peak number of times the news was forwarded, n2 represents the media through which the news was disseminated, n3 represents the score of the news social robot, n4 represents the sentiment score of the news comments, n5 represents the degree of emotional homogeneity of the news, and n6 represents the proportion of the news in terms of popularity.
[0065] Step S3 is as follows:
[0066] S3-1, the flowchart of the news clustering algorithm required for the news detection model is as follows: Figure 3 As shown, the news feature vector output from step S2 is input, and the news detection model is used as the output result.
[0067] S3-2 uses the news dissemination feature vector as a point, and selects a real news item and a fake news item from the dataset as the center points of two clusters.
[0068] S3-3, For each sample news item, calculate the distance from the news item to the two center points using the following formula:
[0069]
[0070] Where L represents the distance from the sample news item to the center point, and n a1 ,n a2 ...n a6 Let n be the term values of the feature vector of the sample news. b1 ,n b2 ...n b6 The term value is the eigenvector of the center point.
[0071] S3-4: Based on the distance of the sample news to the two center points, assign the sample news to the category corresponding to the center point with the shortest distance, and recalculate the center point of each category, and use the new center point as the new center point of the category.
[0072] Step S4 is as follows:
[0073] S4-1, the workflow diagram of the news detection module is as follows: Figure 4 As shown, the feature extraction module obtains the news dissemination feature vector for the news to be detected, and imports it into the news detection model to output the news detection results.
[0074] S4-2, calculate the distance between the feature vector of the news to be detected and the two center points, and calculate the score for whether it is fake news based on the distance between the two center points. The higher the score, the higher the authenticity of the news. Output the score:
[0075]
[0076] Where G is the output news authenticity score, L1 is the distance from the center point of fake news, and L2 is the distance from the center point of real news. The value of G ranges from 0 to 100. A value of 0 indicates that the news completely matches the characteristics of fake news and is judged as fake news, while a value of 100 indicates that the news is highly authentic and is judged as real news.
[0077] Accordingly, the present invention also provides a system for identifying fake news based on data clustering, such as... Figure 5 As shown, it includes:
[0078] Data Acquisition Unit 1 is used to construct a news dataset. A media web crawler designed using Python is used to collect news data from media outlets, and the collected data is stored in a MySQL database. The collected data undergoes preprocessing and manual annotation. Each news item is categorized and cleaned, and manual annotations are used to distinguish the authenticity of each news item in the dataset, ultimately forming a news dataset containing both real and fake news.
[0079] Feature extraction unit 2 is used to extract the dissemination features of each news item from the news dataset. The extracted dissemination features include: dissemination time sequence features, dissemination user features, dissemination feedback features, and dissemination popularity features. These four features are then used to construct the dissemination feature vector for each news item.
[0080] Model building unit 3 is used to construct the news detection model. The K-means method is used to cluster the news dataset. This includes treating the propagation feature vector of each news item as a point, selecting one real news item and one fake news item from the dataset as the center points of their respective clusters, calculating the distance from each sample news item to the two center points for each sample news item to be clustered, assigning each sample news item to the category corresponding to the center point with the shortest distance based on its distance to the two center points, recalculating the center point of each category, and using the new center point as the new center point of the category. This process is iterated repeatedly until the minimum error change is reached, ultimately obtaining the news detection model.
[0081] The news detection unit 4 is used to input the news data to be detected, extract the propagation feature vector of the news data to be detected through feature extraction, import these propagation feature vectors into the news detection model, calculate the authenticity score of the news, and output the detection result of the news data to be detected.
[0082] Therefore, this invention does not use content features such as text or image features. Instead, it constructs corresponding vectors and models based on the propagation characteristics of fake news. Compared to content features, propagation characteristics are easier to quantify, fewer in number, and more stable. Thus, this invention is lower in cost, more efficient, and more stable in identification than other inventions. Furthermore, due to the inherent propagation purpose of news, its propagation characteristics are reflected in news across different modalities and languages. This invention constructs a fake news identification model based on propagation characteristics, enabling cross-modal and cross-linguistic fake news detection. In addition, the data clustering method in this invention uses the K-means algorithm. Compared to other algorithms, the K-means algorithm has superior clustering results, is easier to implement, maintain, and adjust, and has stronger interpretability.
[0083] The foregoing has provided a detailed description of a method and system for identifying fake news based on data clustering, as provided in the embodiments of the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the embodiments above are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A method for identifying fake news based on data clustering, characterized in that, The method includes: To construct a news dataset, a media web crawler was designed to collect news data from the media and store the collected data. The data of each news item was classified and cleaned, and the authenticity of each news item in the dataset was distinguished by manual annotation. Finally, a news dataset containing real news and fake news was formed. In the news dataset, the dissemination features of each news item are extracted. The extracted dissemination features include: dissemination time sequence features, dissemination user features, dissemination feedback features, and dissemination popularity features. Then, these four features are used to construct the dissemination feature vector of each news item. A news detection model is constructed, and the K-means method is used to cluster the news dataset. The steps are as follows: First, the propagation feature vector of each news item is used as a point, and a real news item and a fake news item in the dataset are selected as the centroids of the two clusters. Second, for each sample news item to be clustered, the distance from it to the two centroids is calculated. Third, based on the distance from each sample news item to the two centroids, each sample news item is assigned to the category corresponding to the centroid with the shortest distance, and the centroid of each category is recalculated, and the new centroid is used as the new centroid of the category. The process of the second and third steps is iterated repeatedly until the minimum error change is reached, and finally the news detection model is obtained. Input the news data to be detected, extract the propagation feature vector of the news to be detected through feature extraction, import these propagation feature vectors into the news detection model, calculate the authenticity score of the news, and output the news detection result.
2. The method for identifying fake news based on data clustering as described in claim 1, characterized in that, The news data on the media are specifically as follows: The specific content of the news, the media in which the news was disseminated, the content of the news comments and the number of likes on the comments, the number of reposts and views of the news, and the topic tags to which the news belongs.
3. The method for identifying fake news based on data clustering as described in claim 1, characterized in that, The aforementioned propagation timing characteristics, propagation user characteristics, propagation feedback characteristics, and propagation popularity characteristics are specifically as follows: The timing characteristics of dissemination include the number of news dissemination peaks. By statistically analyzing the relationship between the dissemination time and the number of reposts of each news item, the number of reposts peaks for each news item is calculated. The characteristics of disseminating users include the categories of news disseminators and news forwarding users. By identifying news disseminators, they are mapped to corresponding numerical sequences. For news forwarding users, social robot identification technology is used to determine whether they are social robots. The data range of 0 to 1 is used as the news social robot score. If it is 0, it means that all news dissemination users are real accounts. If it is 1, it means that all news dissemination users are social robots. The dissemination feedback characteristics include the sentiment tendency and sentiment homogeneity of news comments. The comments and forwarded content under the news are extracted and segmented, and the sentiment analysis of the comments and forwarded content is performed through the SnownLP module to obtain the sentiment score of the comments. The higher the sentiment, the higher the score, and vice versa. By calculating the overall sentiment score of the text comments, the sentiment tendency of the news comments can be obtained, and the probability distribution of the sentiment type of the news comments can be statistically analyzed as a representation of the sentiment homogeneity. The characteristics of news spread popularity include the number of likes on the news article and the total number of likes on the news comments. By counting the number of likes on the news article and the total number of likes on the news comments, the ratio between the two is calculated to obtain the popularity ratio of the news article.
4. The method for identifying fake news based on data clustering as described in claim 1, characterized in that, The propagation feature vector for each news item is specifically as follows: The numerical representations of each feedback feature are aggregated into a vector, and the propagation feature vector of each news item is represented as follows: ; in, This refers to a specific news item. This indicates the peak number of times a news article is forwarded. Indicates the media that disseminate news. This indicates the score of the news social robot. This indicates the sentiment score of the news commentary. This indicates the degree of emotional homogeneity in news reports. This indicates the proportion of news topics that are trending.
5. The method for identifying fake news based on data clustering as described in claim 1, characterized in that, The distance from each sample news item to the two center points is specifically as follows: ; Where L represents the distance from the sample news item to the center point. The values of the feature vectors of the sample news items are the terms. The term value is the eigenvector of the center point.
6. The method for identifying fake news based on data clustering as described in claim 1, characterized in that, The authenticity score of the news report is as follows: Calculate the distance between the feature vector of the news to be detected and the two center points, and calculate the score for whether it is fake news based on the distance between the two center points. The higher the score, the higher the authenticity of the news. Output the score: ; Where G is the output news authenticity score, The distance from the epicenter of the fake news. The distance from the center point of the real news is G. The value of G ranges from 0 to 100. A value of 0 indicates that the news completely matches the characteristics of fake news and is judged as fake news, while a value of 100 indicates that the news is highly authentic and is judged as real news.
7. A system for identifying fake news based on data clustering, characterized in that, The system includes: The data acquisition unit is used to construct a news dataset, design a media web crawler to collect news data from the media, store the collected data, classify and clean the data of each news item, and distinguish the authenticity of each news item in the dataset through manual annotation, ultimately forming a news dataset containing real news and fake news. The feature extraction unit is used to extract the dissemination features of each news item in the news dataset. The extracted dissemination features include: dissemination time sequence features, dissemination user features, dissemination feedback features, and dissemination popularity features. Then, these four features are used to construct the dissemination feature vector of each news item. The model construction unit is used to build a news detection model. It employs the K-means method to cluster the news dataset, including: First, using the news propagation feature vector as a point, selecting one real news item and one fake news item from the dataset as the centroids of two clusters; Second, calculating the distance from each news sample to the two centroids for each sample news item to be clustered; Third, assigning each news sample news item to the category corresponding to the shortest distance centroid based on its distance to the two centroids, and recalculating the centroid of each category, using the new centroid as the new centroid of the category; Iterating through steps two and three until the minimum error change is achieved, ultimately obtaining the news detection model. The news detection unit is used to input news data to be detected, extract features to obtain the propagation feature vectors of the news data to be detected, import these propagation feature vectors into the news detection model, calculate the authenticity score of the news, and output the detection result of the news data to be detected.
8. The system for identifying fake news based on data clustering as described in claim 7, characterized in that, The aforementioned propagation timing characteristics, propagation user characteristics, propagation feedback characteristics, and propagation popularity characteristics are specifically as follows: The timing characteristics of dissemination include the number of news dissemination peaks. By statistically analyzing the relationship between the dissemination time and the number of reposts of each news item, the number of reposts peaks for each news item is calculated. The characteristics of disseminating users include the categories of news disseminators and news forwarding users. By identifying news disseminators, they are mapped to corresponding numerical sequences. For news forwarding users, social robot identification technology is used to determine whether they are social robots. The data range of 0 to 1 is used as the news social robot score. If it is 0, it means that all news dissemination users are real accounts. If it is 1, it means that all news dissemination users are social robots. The dissemination feedback characteristics include the sentiment tendency and sentiment homogeneity of news comments. The comments and forwarded content under the news are extracted and segmented, and the sentiment analysis of the comments and forwarded content is performed through the SnownLP module to obtain the sentiment score of the comments. The higher the sentiment, the higher the score, and vice versa. By calculating the overall sentiment score of the text comments, the sentiment tendency of the news comments can be obtained, and the probability distribution of the sentiment type of the news comments can be statistically analyzed as a representation of the sentiment homogeneity. The characteristics of news spread popularity include the number of likes on the news article and the total number of likes on the news comments. By counting the number of likes on the news article and the total number of likes on the news comments, the ratio between the two is calculated to obtain the popularity ratio of the news article.
9. The system for identifying fake news based on data clustering as described in claim 7, characterized in that, The propagation feature vector for each news item is specifically as follows: The numerical representations of each feedback feature are aggregated into a vector, and the propagation feature vector of each news item is represented as follows: ; in, This refers to a specific news item. This indicates the peak number of times a news article is forwarded. Indicates the media that disseminate news. This indicates the score of the news social robot. This indicates the sentiment score of the news commentary. This indicates the degree of emotional homogeneity in news reports. This indicates the proportion of news topics that are trending.
10. The system for identifying fake news based on data clustering as described in claim 7, characterized in that, The distance from each sample news item to the two center points is specifically as follows: ; Where L represents the distance from the sample news item to the center point. The values of the feature vectors of the sample news items are the terms. The term value is the eigenvector of the center point.
11. The system for identifying fake news based on data clustering as described in claim 7, characterized in that, The authenticity score of the news report is as follows: Calculate the distance between the feature vector of the news to be detected and the two center points, and calculate the score for whether it is fake news based on the distance between the two center points. The higher the score, the higher the authenticity of the news. Output the score: ; Where G is the output news authenticity score, The distance from the epicenter of the fake news. The distance from the center point of the real news is G. The value of G ranges from 0 to 100. A value of 0 indicates that the news completely matches the characteristics of fake news and is judged as fake news, while a value of 100 indicates that the news is highly authentic and is judged as real news.