Improved feature evaluation function based Bayesian spam filtering method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology for spam filtering and evaluating functions, applied in special data processing applications, electrical digital data processing, instruments, etc., can solve the problems of weak negative correlation performance, lack of performance of filtering methods, and different contribution capabilities of feature item category definition.

Active Publication Date: 2015-06-24

LIAONING UNIVERSITY

View PDF4 Cites 8 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0002] The most common feature selection method in Bayesian spam filtering is the "mutual information" method. This method can effectively express the degree of dependence between words in text classification. However, it will be highlighted in the feature selection stage of spam filtering. The following problems make the entire filtering method lack in performance: 1. Positive and negative correlation issues: the correlation between feature items and text categories is divided into two types: positive correlation and negative correlation. The effect is that the positive correlation has a strong ability to express the category, and the negative correlation has a weak ability to express, but the meaning expressed from the formula is that the positive and negative offset each other, that is, the negative correlation has the opposite effect on the performance, which is contrary to the original intention; 2 Ignoring word frequency and tending to low-frequency words: the mutual information feature selection method is based on the assumption that the amount of text in each category is roughly equal

In addition, only the occurrence and non-appearance of the term is considered, regardless of the number of times the term appears in the document, but usually we think that the feature words with more occurrences (that is, higher word frequency) are more related to the category and more representative This category, so this has an impact on the feature items that appear frequently in an email; 3. The feature items in different positions have different contribution to the category definition: the feature items extracted from the two different positions of the email title and the body The ability to contribute to classification will be very different. In actual spam filtering, users can often judge whether an email is normal email or spam from its main image.

However, for the above problems, there is currently no improvement method for these unsuitable problems.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0031] The Bayesian spam filtering method based on improved feature evaluation function is characterized in that the steps are as follows:

[0032] 1) Preprocess the training mail set: divide the mail into two sub-text sets S 1 ,S 2 , In which word segmentation is performed separately to form two feature item sets T 1 , T 2 ; 2) Respectively in two feature sets T 1 , T 2 Use the stop vocabulary table to delete prepositions, pronouns, adverbs, auxiliary words, conjunctions, and words whose frequency is lower than a given threshold p, and the processed feature item set is marked as T 1 ’, T 2 ’;

[0033] 3) Respectively in the feature item set T 1 ’, T 2 ’Uses an improved feature evaluation function to calculate the mutual information value MI(t k )’:

[0034] 3a) Set the feature vector set T = {t k ,k=1,2,...,n}, obtain the training set category set C={c in the network file text library j ,i=1,2,...,r};

[0035] 3b) Using formula (1) to calculate the correction coefficient λ:

[0036] ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

Disclosed is an improved feature evaluation function based Bayesian spam filtering method. The method includes the steps of 1), preprocessing a training mail set into a mail head part and a text part; 2), respectively deleting prepositions, pronouns, adverbs, auxiliary words, conjunctions and words with the work frequency lower than the given threshold P in two feature sets T1 and T2; 3), respectively calculating a mutual information value MI (tk)' in the feature sets T1 and T2 by adopting the improved feature evaluation function; 4), in the training set, sorting the MI (tk)' according to the order from big to small, and selecting feature items corresponding to first n values as representation of the training set; 5), performing spam filtering on to-be-tested samples by adopting a Bayes classifier at the sorting phase. With the method, mails can be classified highly accurately, and spasm can be filtered out.

Description

Technical field [0001] The invention relates to a Bayesian spam filtering method based on an improved feature evaluation function. Background technique [0002] The most common feature selection method in Bayesian spam filtering is the "mutual information" method. This method can effectively express the degree of dependence between words in text classification, but it will be prominent when used in the feature selection stage of spam filtering. The following problems cause the performance of the entire filtering method to be lacking: 1 Positive and negative correlation problem: The correlation between the feature item and the text category is divided into two types: positive correlation and negative correlation. Both cases indicate that the feature item has a definition of the category. The effect is that the positive correlation has a stronger performance ability on the category, and the negative correlation performance ability is weak, but the meaning expressed from the formula...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/27G06F17/30

Inventor 王青松魏如玉温翠娟张黎

Owner LIAONING UNIVERSITY

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Improved feature evaluation function based Bayesian spam filtering method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology