Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Word distribution and document feature based automatic classification method for spam comments

A technology of spam comments and document features, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as single features, poor scalability, and no comprehensive consideration of word distribution features and document features

Active Publication Date: 2015-12-23
NANJING UNIV
View PDF4 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, there are still limitations in the classification research on the automatic identification of spam comments in the network: 1) the scalability is not strong, most classification methods can only target specific application scenarios, and it is difficult to expand; 2) the extracted features are single, and the existing The current classification method only measures the similarity of comments, and does not comprehensively consider word distribution characteristics and document features; 3) It is highly dependent on the data set and requires a large number of comment labels; it cannot meet the needs of automatic classification of spam comments in the Internet

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word distribution and document feature based automatic classification method for spam comments
  • Word distribution and document feature based automatic classification method for spam comments
  • Word distribution and document feature based automatic classification method for spam comments

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0062] figure 1 Shown is the overall framework of the automatic classification method of spam comments based on word distribution and document features. The input of the method is a small number of labeled online comments (that is, artificially labeled online comments as normal comments or spam comments, forming a labeling set), and a large number of unlabeled comments to be classified (forming a target set). The output of the method is the classification of online comments: normal comments are marked as 0; spam comments are marked as 1. The method of the present invention comprises the following four main steps: 1) collecting network comments, segmenting the comments to obtain a keyword set; 2) establishing a word distribution matrix, training a language model, and calculating the classification probability that unmarked network comments belong to normal comments and spam comments ; 3) extract the document features of network comments, train the Bayes classifier based on pro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a word distribution and document feature based automatic classification method for spam comments. The method comprises: firstly, collecting network comments and performing word segmentation on the comments to obtain a keyword set; secondly, establishing a word distribution matrix, training a language model, and calculating a classification probability of unlabeled network comments belonging to normal comments or the spam comments; thirdly, extracting document features of the network comments, and calculating the classification probability of the unlabeled network comments; and finally, calculating a weighted average of the classification probabilities, and repeating the steps until the classification probabilities calculated for successive two times are same or a given number of iterations is reached. The method comprehensively considers word distribution features and the document features in the network comments, automatically finishes network comment classification through a self-learning policy, and assists in identification of the spam comments in the network comments. The method is simple in calculation and high in universality and expansibility, can carry out real-time classification on a great amount of comments by only a small amount of network comments with labels, and meets an application demand of quickly identifying the spam comments in the instantly updated network comments.

Description

technical field [0001] The invention relates to the field of computer applications, in particular to an automatic classification method for assisting in identifying rubbish comments in massive Internet comments. technical background [0002] With the rapid development of Internet technology, a variety of emerging network communication and communication methods have been promoted. Internet users can post various comments conveniently and quickly. With its good freedom, real-time and convenience, network communication is gradually changing the way people communicate. [0003] The development of network technology has two sides. The freedom of users to post comments and the powerful dissemination ability of the Internet are often used by some users to post various commercial advertisements or malicious information into online comments. In recent years, the spread of spam comments on the Internet has intensified, and various commercial advertisements based on spam comments and...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 王建翔顾庆喻黎霞陈道蓄
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products