Feature extraction method and spam filter based on byte-level n-gram
A technology of spam filtering and n-grams, which is applied in electrical components, digital transmission systems, data processing applications, etc., can solve the problems of not being able to adapt to multilingual text extraction and identification at the same time, so as to improve robustness and simplify features The effect of extracting and improving efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
specific Embodiment approach 1
[0023] Specific Embodiment 1: The feature extraction method based on byte-level n-gram (byteleveln-gram) described in this embodiment is: perform a sliding window operation with a size of n on the extracted object information, and obtain m words with a length of n Segment sequence as feature information.
[0024] The feature selection method in this embodiment can select a sliding window with a length of n bytes, and then use the sliding window to select m consecutive information fragments gram with a length of n bytes in the information as features, the i+1th A byte segment starts with the second byte in the i-th byte segment, where i is an integer greater than 0, and i<m.
[0025] The feature information extraction method in this embodiment can extract the first m information fragments (n-grams) with a length of n bytes as feature information, and can also extract the last m information fragments (n-grams) with a length of n bytes. Information fragments (n-grams) are used a...
specific Embodiment approach 2
[0028] Specific embodiment two: What this embodiment described is a spam filter based on the feature extraction method based on byte-level n-grams described in specific embodiment one, which consists of a classifier, a feature weight library and a trainer composed of:
[0029] The classifier is used to perform feature extraction on the received mail and obtain feature information, and is also used to classify the received mail into junk mail and normal mail according to the feature information in the feature information and feature weight database, and the feature extraction method adopts a method based on A feature extraction method for byte-level n-grams;
[0030] The feature weight library is used to store the features and weights of spam, and update the feature information in real time according to the information provided by the trainer; the user is a spam filter user who can feed back spam information, including spam filter The actual users, that is, the service objects...
specific Embodiment approach 3
[0081] Embodiment 3: This embodiment is the method and conclusion of testing the spam filter described in Embodiment 2 by adopting all existing Chinese spam public test sets (TREC06c, SEWM07 and SEWM08).
[0082] The performance of the filter is verified on all existing Chinese spam public test sets (TREC06c, SEWM07 and SEWM08). Table 1 shows the test data. The test set whose starting character is TREC is provided by TREC (TextREtrieval Conference), and the TREC evaluation is sponsored by the US Defense Advanced Research Projects Agency (DARPA, Defense Advanced Research Projects Agency and the National Institute of Standards and Technology (NIST, National Institute of Standards and Technology). The starting character is SEWM (SearchEngineandWebMining ) test set is provided by South China University of Technology, and SEWM spam filtering evaluation is sponsored by China Computer Federation.
[0083] Table 1 Spam filtering test set
[0084]
[0085] The test set used in the ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 