Content based junk webpage detecting method and detecting apparatus thereof

A technology of spam web pages and detection methods, applied in website content management, network data retrieval, other database retrieval, etc., can solve problems affecting the relevance and accuracy of search results, high ranking, etc. sex, improve relevance

Active Publication Date: 2015-12-23
TIANJIN UNIV
View PDF5 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

It can be seen from this that the PageRank algorithm only considers the links between webpages and ignores the correlation between the content of the webpage and the topic, so

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Content based junk webpage detecting method and detecting apparatus thereof
  • Content based junk webpage detecting method and detecting apparatus thereof
  • Content based junk webpage detecting method and detecting apparatus thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0055] A content-based spam detection method, see figure 1 , the spam detection method includes the following steps:

[0056] 101: selecting several spam webpages as seed spam webpages;

[0057] Assume that there are a total of N web pages, of which x number of spam web pages have been marked and stored in the set X. Randomly select m spam webpages from the set X as the sample set M, and use M to denote the seed spam webpage.

[0058] 102: Calculate the maximum similarity value between all webpages and the content of the seed spam webpage, and generate a similarity set S;

[0059] Firstly, the features of all web pages are extracted by statistical methods, and then the extracted features are composed into vectors by using VSM. Finally, the cosine similarity method based on vector space is used to calculate the similarity between all web pages and the content of seed spam web pages.

[0060] 103: Use the PageRank algorithm to sort all web pages; and set the sorted web pages...

Embodiment 2

[0067] The scheme in Embodiment 1 is described in detail below in conjunction with specific calculation formulas and examples, see the following description for details:

[0068] 201: selecting several spam webpages as seed spam webpages;

[0069] Wherein, the spam webpage refers to a webpage containing malicious content or worthless content. The process of selecting spam webpages as seeds in the embodiment of the present invention is as follows: Suppose there are a total of N webpages, among which x number of spam webpages that have been marked are stored in the set X. Randomly select m spam web pages from the set X as the sample set M, and use M to denote the seed spam web pages.

[0070] 202: Using a statistical method to extract features from the webpage, and then using VSM to form feature vectors from the extracted features;

[0071] The innovation of the embodiment of the present invention is based on the traditional PageRank algorithm, adding the calculation of the co...

Embodiment 3

[0102] Below in conjunction with specific example, the scheme in embodiment 1 and 2 is carried out feasibility verification, see the following description for details:

[0103] In the embodiments of the present invention, the recall rate is used to evaluate the experimental results, that is, the recall rate is represented by the ratio of the intersection of detected spam webpages and marked spam webpage sets to the marked spam webpage set.

[0104] When calculating the experimental results, the capacity of the collection of detected spam web pages is set to 20,000 web pages. The threshold s of the similarity is set to five values ​​of 0.91, 0.93, 0.95, 0.97 and 0.99 to monitor the recall rate.

[0105] Comparing the experimental results of this method with the traditional PageRank results, it is found that the number and recall rate of spam web pages detected by this method (Sim-PageRank) are higher than those of the traditional PageRank algorithm. When the similarity threshol...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a content based junk webpage detecting method and a detecting apparatus thereof. The method comprises: calculating a maximum content similarity-degree value of all webpages and seed junk webpages, and generating a similarity-degree set; sorting all the webpages in descending order by using a PageRank algorithm; based on a sorting result, searching the similarity-degree set for a content similarity-degree value of the webpages and the sample junk webpages; and comparing the similarity-degree value with a similarity-degree threshold, performing detection on the webpages, and adding detected junk webpages into a junk webpage set. The apparatus comprises a generation module, a sorting module, a search module and a detection module. By means of the modules, determination of a webpage content similarity degree is added into the method provided by the present invention on the basis of the conventional PageRank algorithm; links and contents of the webpages are combined; and detection is performed on the junk webpages, thereby improving accuracy and efficiency of junk webpage detection.

Description

technical field [0001] The invention relates to the fields of data mining, text mining and search engines, in particular to a content-based spam web page detection method and a detection device thereof. Background technique [0002] Page ranking algorithms can be used to detect spam web pages. Among them, PageRank is a method used by Google to identify the level / importance of web pages, and is the only standard used by Google to measure the quality of a website. [0003] The calculation of PageRank is based on the following two basic assumptions: [0004] Quantity assumption: In the network graph model, if a page node receives more incoming links from other web pages, the page is more important. [0005] Quality assumption: The quality of inbound links pointing to page A is different, and high-quality pages will pass more weight to other pages through links. So the more high-quality pages point to page A, the more important page A is. [0006] So PageRank implements the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9535G06F16/958G06F16/972G06F2216/03
Inventor 喻梅孟莹于瑞国周静雷霆田逸尘
Owner TIANJIN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products