Method for filtering Chinese junk mail based on Logistic regression

A logistic regression and spam filtering technology, applied in electrical components, transmission systems, office automation, etc., can solve problems not involved in Chinese spam filtering methods, improve operating efficiency and classification effect, reduce size, and avoid limitations sexual effect

Inactive Publication Date: 2008-07-23
ZHEJIANG UNIV
View PDF7 Cites 47 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The spam filtering technology adopted in the above patents does no...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for filtering Chinese junk mail based on Logistic regression
  • Method for filtering Chinese junk mail based on Logistic regression
  • Method for filtering Chinese junk mail based on Logistic regression

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] Main principle of the present invention is as follows:

[0027] 1) In the preprocessing stage of emails, including email parsing and word segmentation process. Use JavaMail to extract the title, text content of the body text, and the attachments, pictures, audio, video and other information contained in the email; segment the non-Chinese text according to natural segmentation marks such as punctuation and spaces, and use the maximum matching method to analyze the Chinese text Text is segmented.

[0028] 2) At the feature level, all the words in the email sample set form a feature space, and each email can be mapped into a vector of the feature space; an improved feature value calculation method is adopted, and a weight factor is introduced to reflect the text features of the email ; Using word frequency as the feature selection basis to implement dimensionality reduction, reducing the size of the feature space.

[0029] 3) At the model level, Logistic is used for trai...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a filtering method of recursive Chinese junk E-mail, which is based on Logistic. The method comprises the following steps: first, analyzing E-mails, extracting E-mail titles, E-mail main bodies and accessory relative information, second, segmenting words for version information which is extracted, third, accounting word frequencies of entries in E-mails, calculating weights of words through utilizing TF-IDF pattern, presenting the E-mail to be characteristic vector which is weighted, fourth, utilizing an LIBLINEAR tool kit to exercise the sample of the E-mail to get an Logistic recursive module, fifth, utilizing the Logistic recursive module to classify for new E-mails, getting the probability value whether the E-mails which are got are junk E-mails. The utility which utilizes the Logistic recursive module has the advantages of simple module, little amount of parameter, and high classifying accuracy in a data set whose text number and characteristic number are both bigger, the accuracy and efficiency of filtering junk E-mails are improved through dimension reduction and improved characteristic value calculating method, and meanwhile, the problem of choosing module exercise parameter which is faced in filtering junk E-mails is effectively solved.

Description

technical field [0001] The invention relates to a junk mail filtering method, in particular to a Logistic regression-based Chinese junk mail filtering method. Background technique [0002] With the proliferation of spam, various spam filtering technologies emerged as the times require. Currently, content-based intelligent mail filtering methods have become the mainstream technology, among which support vector machine (SVM), dynamic Markov modeling (DMM), Winnow and other machine learning methods have been successfully applied to the field of mail classification. The basic idea of ​​these methods is to treat spam filtering as a two-category problem, to search for classifiers from sample mails, and to use classifiers to predict unknown mails. [0003] Generally, machine learning techniques can be divided into discriminative models (Discriminative Model, such as Logistic regression, SVM) and generative models (Generative Model, such as Bayes) two types. Practice has proved ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L12/58H04L29/06G06F17/30G06Q10/00G06Q10/10
Inventor 徐从富王庆幸彭鹏
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products