Method and device for confirming web structure similarity

A technique for determining the structure and method of a web page, which is applied in the computer field and can solve problems such as the inability to calculate the similarity of the web page structure

Active Publication Date: 2010-04-14
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF0 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The embodiment of the present invention provides a method and device for determining similarity of webpage structure, which is used to solve...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for confirming web structure similarity
  • Method and device for confirming web structure similarity
  • Method and device for confirming web structure similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] The technical solutions of the embodiments of the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0021] According to an embodiment of the present invention, a method for determining webpage structure similarity is provided, figure 1 is a flow chart of a method for determining webpage structure similarity in an embodiment of the present invention, as figure 1 As shown, the method for determining webpage structure similarity according to the embodiment of the present invention includes:

[0022] Step 101, determine the template feature vector of the webpage according to the DOM tree of the webpage;

[0023] Step 102, calculating the structural similarity of the webpage for the template feature vector, and performing search or clustering.

[0024] Through the above processing, all cheating websites with the same webpage structure can be found by looking for homepage templates with similar templa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a device for confirming web structure similarity. The method includes steps of confirming template feature vectors of webs according to DOM trees of the webs, calculating web structure similarity of the template feather vectors, and then finding or matching. Through the above processes, the method for confirming web structure similarity overcomes shortages that the method in the prior art can not calculate web structure similarity, and when operators find a cheat website, the operators can find cheat websites with identical web structures through finding home pages with similar template feature vectors. In addition, aggregate of the cheat websites can be automatically and fast found through matching and finding template feature vectors of all home page templates.

Description

technical field [0001] The embodiments of the present invention relate to the field of computer technology, and in particular to a method and device for determining similarity of webpage structure. Background technique [0002] In the prior art, the main object processed by search engines is web pages. In addition to analyzing and processing the content of the webpage, the search engine also needs to compare the similarity between two or more webpages, for example, the similarity of webpage content and / or the similarity of webpage structure. [0003] Among them, the similarity of web page content refers to: when the same article is copied and reproduced by different websites, although the layout of each website is different, the content of the article is consistent. At this time, the search engine does not need to present all the web pages containing the article to the user, because this will make it difficult for the user to find other different content, and the search eng...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 李景阳张波
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products