Unlock instant, AI-driven research and patent intelligence for your innovation.

Feature extraction method and spam filter based on byte-level n-gram

A technology of spam filtering and n-grams, which is applied in electrical components, digital transmission systems, data processing applications, etc., can solve the problems of not being able to adapt to multilingual text extraction and identification at the same time, so as to improve robustness and simplify features The effect of extracting and improving efficiency

Inactive Publication Date: 2016-08-03
HEILONGJIANG INST OF TECH +1
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] In order to solve the problem that the existing text feature extraction method requires thesaurus support and cannot simultaneously adapt to the feature extraction and identification of multilingual characters (such as English, Chinese), graphics and other forms of information, the present invention proposes a A feature extraction method and spam filter based on byte-level n-gram

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature extraction method and spam filter based on byte-level n-gram
  • Feature extraction method and spam filter based on byte-level n-gram
  • Feature extraction method and spam filter based on byte-level n-gram

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0023] Specific Embodiment 1: The feature extraction method based on byte-level n-gram (byteleveln-gram) described in this embodiment is: perform a sliding window operation with a size of n on the extracted object information, and obtain m words with a length of n Segment sequence as feature information.

[0024] The feature selection method in this embodiment can select a sliding window with a length of n bytes, and then use the sliding window to select m consecutive information fragments gram with a length of n bytes in the information as features, the i+1th A byte segment starts with the second byte in the i-th byte segment, where i is an integer greater than 0, and i<m.

[0025] The feature information extraction method in this embodiment can extract the first m information fragments (n-grams) with a length of n bytes as feature information, and can also extract the last m information fragments (n-grams) with a length of n bytes. Information fragments (n-grams) are used a...

specific Embodiment approach 2

[0028] Specific embodiment two: What this embodiment described is a spam filter based on the feature extraction method based on byte-level n-grams described in specific embodiment one, which consists of a classifier, a feature weight library and a trainer composed of:

[0029] The classifier is used to perform feature extraction on the received mail and obtain feature information, and is also used to classify the received mail into junk mail and normal mail according to the feature information in the feature information and feature weight database, and the feature extraction method adopts a method based on A feature extraction method for byte-level n-grams;

[0030] The feature weight library is used to store the features and weights of spam, and update the feature information in real time according to the information provided by the trainer; the user is a spam filter user who can feed back spam information, including spam filter The actual users, that is, the service objects...

specific Embodiment approach 3

[0081] Embodiment 3: This embodiment is the method and conclusion of testing the spam filter described in Embodiment 2 by adopting all existing Chinese spam public test sets (TREC06c, SEWM07 and SEWM08).

[0082] The performance of the filter is verified on all existing Chinese spam public test sets (TREC06c, SEWM07 and SEWM08). Table 1 shows the test data. The test set whose starting character is TREC is provided by TREC (TextREtrieval Conference), and the TREC evaluation is sponsored by the US Defense Advanced Research Projects Agency (DARPA, Defense Advanced Research Projects Agency and the National Institute of Standards and Technology (NIST, National Institute of Standards and Technology). The starting character is SEWM (SearchEngineandWebMining ) test set is provided by South China University of Technology, and SEWM spam filtering evaluation is sponsored by China Computer Federation.

[0083] Table 1 Spam filtering test set

[0084]

[0085] The test set used in the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A feature extraction method and a spam email filter based on a byte-level n-gram grammar relate to the technical field of information processing including the spam email filtering technology. It solves the problem of needing thesaurus support in the existing text feature extraction methods, and cannot simultaneously adapt to the feature extraction and identification of English, Chinese characters, graphics and other forms of information. The features extracted by the feature extraction method of the present invention The information is a sequence of m information fragments of length n bytes. The classifier in the spam filter of the present invention uses the above method to extract the characteristic information of the mail as a judgment basis, and adopts the discriminative learning model of the logistic regression model to theoretically ensure that good filtering performance can be obtained; the spam filtering of the present invention The trainer in the trainer adopts the online learning method, and adopts the TONE (Train On or Near Error) method to adjust the feature weight. The spam filter of the present invention is especially suitable for filtering Chinese spam.

Description

technical field [0001] The invention relates to the field of information processing including spam filtering technology, and specifically relates to the fields of information filtering, information pushing and pattern recognition. Background technique [0002] When the processing object is an information unit containing multiple types of information (such as web pages and emails), the user's specific information needs have two manifestations: information filtering and information pushing. They have the same essence: the user's information needs remain unchanged, and it is necessary to identify the attributes of the information from the incoming information, that is, whether the user needs the information. Since the processing object is an information unit containing multiple types of information, language is an important carrier of information, and information filtering and information push mainly rely on text information. However, a large amount of valuable information is ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): H04L12/58G06Q10/10
Inventor 齐浩亮何晓宁杨沐昀韩咏李生雷国华李军安波
Owner HEILONGJIANG INST OF TECH