Unlock instant, AI-driven research and patent intelligence for your innovation.

Bayesian Spam Filtering Method Based on Improved Feature Evaluation Function

A spam filter and evaluation function technology, which is applied in special data processing applications, electrical digital data processing, instruments, etc., can solve the problem of not considering the number of occurrences of entries, different contribution capabilities of feature item category definition, and weak negative correlation performance ability and other issues, to achieve efficient and accurate filtering effect

Active Publication Date: 2017-05-24
LIAONING UNIVERSITY
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] The most common feature selection method in Bayesian spam filtering is the "mutual information" method. This method can effectively express the degree of dependence between words in text classification. However, it will be highlighted in the feature selection stage of spam filtering. The following problems make the entire filtering method lack in performance: 1. Positive and negative correlation issues: the correlation between feature items and text categories is divided into two types: positive correlation and negative correlation. The effect is that the positive correlation has a strong ability to express the category, and the negative correlation has a weak ability to express, but the meaning expressed from the formula is that the positive and negative offset each other, that is, the negative correlation has the opposite effect on the performance, which is contrary to the original intention; 2 Ignoring word frequency and tending to low-frequency words: the mutual information feature selection method is based on the assumption that the amount of text in each category is roughly equal
In addition, only the occurrence and non-appearance of the term is considered, regardless of the number of times the term appears in the document, but usually we think that the feature words with more occurrences (that is, higher word frequency) are more related to the category and more representative This category, so this has an impact on the feature items that appear frequently in an email; 3. The feature items in different positions have different contribution to the category definition: the feature items extracted from the two different positions of the email title and the body The ability to contribute to classification will be very different. In actual spam filtering, users can often judge whether an email is normal email or spam from its main image.
However, for the above problems, there is currently no improvement method for these unsuitable problems.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Bayesian Spam Filtering Method Based on Improved Feature Evaluation Function
  • Bayesian Spam Filtering Method Based on Improved Feature Evaluation Function
  • Bayesian Spam Filtering Method Based on Improved Feature Evaluation Function

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The Bayesian spam filtering method based on the improved feature evaluation function is characterized in that the steps are as follows:

[0032] 1) Preprocess the training mail set: divide the mail into two sub-text sets S of the mail header and the body part 1 ,S 2 , in which word segmentation is performed respectively to form two sets of feature items T 1 ,T 2 ; 2) in two feature sets T 1 ,T 2 Use the stop vocabulary list to delete prepositions, pronouns, adverbs, auxiliary words, conjunctions, and words whose word frequency is lower than a given threshold p, and the processed feature item set is recorded as T 1 ’, T 2 ';

[0033] 3) In the feature item set T 1 ’, T 2 ’ using the improved feature evaluation function to calculate the mutual information value MI(t k )':

[0034] 3a) Let feature vector set T={t k ,k=1,2,...,n}, obtain the training set category set C={c in the network file text database j ,i=1,2,...,r};

[0035] 3b) Use the formula (1) to cal...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Disclosed is an improved feature evaluation function based Bayesian spam filtering method. The method includes the steps of 1), preprocessing a training mail set into a mail head part and a text part; 2), respectively deleting prepositions, pronouns, adverbs, auxiliary words, conjunctions and words with the work frequency lower than the given threshold P in two feature sets T1 and T2; 3), respectively calculating a mutual information value MI (tk)' in the feature sets T1 and T2 by adopting the improved feature evaluation function; 4), in the training set, sorting the MI (tk)' according to the order from big to small, and selecting feature items corresponding to first n values as representation of the training set; 5), performing spam filtering on to-be-tested samples by adopting a Bayes classifier at the sorting phase. With the method, mails can be classified highly accurately, and spasm can be filtered out.

Description

technical field [0001] The invention relates to a Bayesian spam filtering method based on an improved feature evaluation function. Background technique [0002] The most common feature selection method in Bayesian spam filtering is the "mutual information" method. This method can effectively express the degree of dependence between words in text classification. However, it will be highlighted in the feature selection stage of spam filtering. The following problems make the entire filtering method lack in performance: 1. Positive and negative correlation issues: the correlation between feature items and text categories is divided into two types: positive correlation and negative correlation. The effect is that the positive correlation has a strong ability to express the category, and the negative correlation has a weak ability to express, but the meaning expressed from the formula is that the positive and negative offset each other, that is, the negative correlation has the o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 王青松魏如玉温翠娟张黎
Owner LIAONING UNIVERSITY