Method and device for spam filtering based on short text

A technology of spam filtering and short text, applied in the field of spam filtering devices based on short text, can solve the problems of interfering text classification, wrong results, not reading in time, etc., to achieve the effect of strengthening word segmentation results and reducing possibility

Active Publication Date: 2013-12-11
LUNKR TECH GUANGZHOU CO LTD
View PDF5 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, for short text information with very short text content, use the zipf rule to extract the most important feature words, and after discarding the unimportant feature words, there will be less information left. If you directly use such little information for text Classification, the final result may even be wrong, and in severe cases, normal emails may even be classified as spam, resulting in users not reading or not reading this email in time
In addition, currently a large amount of spam is often packaged in html, and by adding a large number of invisible or different-sized fonts, it interferes with the effect of text classification. Therefore, this patent is not suitable for filtering spam with very short body content

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for spam filtering based on short text
  • Method and device for spam filtering based on short text
  • Method and device for spam filtering based on short text

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0022] figure 1 It is a flow chart of the first embodiment of a short text-based spam filtering method of the present invention, including:

[0023] S100. Perform word segmentation processing on the text in the email and obtain a word segmentation result.

[0024] When word segmentation is performed on the text in the email, it is necessary to separate the HTML tags, Chinese characters and English characters, and then perform word segmentation on the Chinese characters and English characters respectively to obtain word segmentation results.

[0025] S101. Use TF-IDF technology to sort the word segmentation results to obtain a word segmentation list.

[0026] After extracting the word segmentation results (Chinese word segmentation and English word segmentation) from the email, use the TF-IDF algorithm to sort the word segmentation results from high to low according to the discrimination ability, and obtain the word segmentation list after sorting.

[0027] It should be noted...

no. 2 example

[0036] figure 2 It is a flow chart of the second embodiment of a short text-based spam filtering method of the present invention, including:

[0037] S200. Preprocess the text and extract the Chinese text and / or the English text.

[0038]When working, the email is first fetched, and the text in the email is preprocessed. For Hypertext Markup Language (HTML) documents, the HTML tags (HTML tags) are extracted and processed separately; for the remaining information, Chinese characters and English characters are separated, and converted into only English characters. Text with only Chinese characters.

[0039] S201. Perform word segmentation processing on the Chinese text and the English text respectively, and obtain word segmentation results.

[0040] For English text, use the traditional word segmentation method to obtain word segmentation results (separate each word segmentation with punctuation marks and spaces).

[0041] For Chinese text, the words are separated from the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for spam filtering based on a short text. The method for spam filtering based on the short text comprises the following steps that word segmentation is conducted on the text of each email and word segmentation results are obtained; sequencing is conducted on the word segmentation results through the TF-IDF technology, so that a word segmentation list is obtained; an email fingerprint of each email is calculated according to the word segmentation results; clustering processing is conducted on the emails according to the email fingerprints, and a clustering result is obtained; spam filtering is conducted according to the clustering result. The invention further discloses a device for spam filtering based on the short text. By the adoption of the method and device for spam filtering based on the short text, word segmentation and TF-IDF technology sequencing can be conducted on the texts of the emails, and noise filtering is achieved; according to the length of the text of each email, the email fingerprint of each email is calculated through one or more BKDR hash functions, and the function of the word segmentation result can be effectively enhanced; clustering processing can be conducted on the emails through similarity comparison of the fingerprints by means of normalization processing, and therefore spam filtering is achieved.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a spam filtering method based on short text and a spam filtering device based on short text. Background technique [0002] With the wide application of the Internet, e-mail is favored by people for its fast, simple and cheap advantages, and has become an efficient mass communication medium. At the same time, a large number of useless emails poured into people's mailboxes, bringing disasters to their studies and lives. Spam is what users hate most. They waste users' time, money, and network bandwidth. At the same time, they mess up users' mailboxes. Some emails are even harmful, such as containing pornographic content or viruses. According to relevant research reports, more than 10% of emails in the world are spam every day. Therefore, it is necessary to develop an effective method for effectively intercepting and filtering garbage. [0003] At present, there are many meth...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L12/58G06F17/27
CPCG06Q10/10G06Q10/107H04L51/212
Inventor 林延中潘庆峰
Owner LUNKR TECH GUANGZHOU CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products