Junk image filtering method based on semi-supervision

A technology of garbage pictures and filtering methods, which is applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of unresponsive text and fonts, and the amount of calculation is not large, so as to improve accuracy and efficiency, save program operation time and effect of space

Inactive Publication Date: 2012-09-12
NANJING UNIV OF POSTS & TELECOMM
View PDF3 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using edge features to detect image Spam can obtain an accuracy rate of 80%. The advantage of this type of classification algorithm is that using edge features can obtain text-intensive shape regularity and the amount of calculation is not large, but the disadvantage is that the text on the template Response to font changes is slow
Since it takes a lot of manpower and material resources to obtain labeled samples, it is relatively easier to obtain unlabeled samples.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Junk image filtering method based on semi-supervision
  • Junk image filtering method based on semi-supervision
  • Junk image filtering method based on semi-supervision

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] Step 1) Initial sample selection:

[0039] Download image spam from the image spam database shared on the Internet, image spam collected from private mailboxes and image collections in normal mail to form a sample set.

[0040] Step 2) Text feature extraction:

[0041] Step 2.1) Use optical character recognition technology to batch process the images in the file to obtain the text features of each image.

[0042] Step 2.2) Save the text extraction results of step 2.1), save the text of each picture in a .txt text file, and put them into the junk image folder and the normal image folder respectively.

[0043] Step 2.3) Use the Waikato intelligent analysis environment to normalize the results of step 2.2) into an .arff file, and the first column of each line in the file represents the text in an image, and the second column represents the label of an image , as the text feature vector of the image.

[0044] Step 3) Use the R-value feature selection method to rank the f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

When image junk mails are detected and determined through junk image filtering technology research based on semi-supervision, text and images characteristics are extracted to conduct characteristic processing. Detecting and sorting are conducted by using obtained sorting models, new label sample are added continuously, a sorter is trained, sorting precision is improved, and simultaneously misjudgment is greatly reduced. A large amount of experimental data inspection shows that the method builds a high efficient junk mail webpage filtering system, high accuracy rate is guaranteed and simultaneously processing efficiency is greatly improved, and webpage detecting time is greatly shortened.

Description

technical field [0001] The present invention is a method of semi-supervised learning, using labeled picture samples, to train the support vector machine algorithm model, and to realize the detection of image-type spam, which mainly solves the detection efficiency of image-type spam in today's technology Problems such as low recall rate and low recall rate belong to the field of data mining and machine learning. Background technique [0002] The continuous improvement of text-based spam filtering technology drives spammers to explore new spam production techniques. As a result, image spam has become a popular spam communication medium today. According to McAfee's report in 2007, image spam accounts for about 30% of all spam. Image-type spam is to embed advertisements and other spam information into pictures in the form of text, and spread them wantonly to email clients as attachments or directly as the text content of emails. [0003] In 2007, Battista Biggio et al. propos...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
Inventor 张卫丰胡文婷张迎周周国强王慕妮钱小燕许碧欢陆柳敏
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products