Similarity-based semi-supervised learning spam page detection method

A semi-supervised learning, spam web technology, applied in special data processing applications, instruments, electrical digital data processing, etc., to achieve the effect of simplifying the calculation steps

Inactive Publication Date: 2010-08-25
NANJING UNIV OF POSTS & TELECOMM
View PDF0 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Technical problem: the purpose of the present invention is to design a similarity-based semi-supervised learning spam

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similarity-based semi-supervised learning spam page detection method
  • Similarity-based semi-supervised learning spam page detection method
  • Similarity-based semi-supervised learning spam page detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a similarity-based semi-supervised learning spam page detection method, which solves the problems in semi-supervised learning through page links. A hidden 'link' diagram is established based on page similarity in the method. The method comprises the following steps: 1, extracting page features based on contents and links; 2, carrying out feature extraction for features extracted in Step 1 in a method of principal component analysis; 3, establishing a hidden 'link' diagram according to the page similarity; 4, building a Gaussian random field model on the 'link' diagram, and carrying out semi-supervised learning through harmonic functions; and 5, combining classification results of the model established in Step 4 and other classifiers, thereby improving the classification effect. In the diagram, the weight is given to the links between pages according to the similarity, the Gaussian random field model is then established, and the harmonic functions are adopted for semi-supervised learning, thereby improving the semi-supervised learning capacity.

Description

A Similarity-Based Semi-supervised Learning Spam Detection Method technical field The invention relates to a method for detecting garbage webpages of search engines, which mainly solves the problem of detecting garbage webpages under the condition of small samples, and belongs to the fields of search engines and semi-supervised machine learning. Background technique Search engines enable users to find the correct content they are interested in from a large number of web pages. But the prevalence of spam has damaged the credibility of search engines and eroded the trust of their users. Finding an effective way to reduce the impact of webpage spam and improve the quality of search engine webpage ranking is very important for users to quickly find interesting and correct webpages. Initially, search engines used traditional information extraction algorithms, such as TF-IDF (Term Frequency-Inverse Document Frequency) [1], to rank the results returned by queries submitted to t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 张卫丰朱丹梅周国强张迎周陆柳敏许碧娣刘霞
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products