Judging system and judging method for web page repeating

A webpage and webpage content technology, applied in the field of judging system for repeated webpages, can solve problems such as high time complexity and time-consuming calculations

Active Publication Date: 2015-04-29
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method can be calculated more accurately, but the time complexity is too high, and the calculation is time-consuming

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Judging system and judging method for web page repeating
  • Judging system and judging method for web page repeating
  • Judging system and judging method for web page repeating

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.

[0042] like figure 1 as shown, figure 1 It is a flow chart of the judging method for repeated web pages of the present invention.

[0043] In step 10, multiple web pages are obtained. In this step, a web crawler (spider) may be used to crawl a large number of web pages from the Internet.

[0044] In step 11, the webpage text of each webpage is extracted respectively. Many methods can be used to extract the text of the web page in the web page, see below figure 2 A specific embodiment of step 11 is described in detail.

[0045] like figure 2 as shown, figure 2 Yes figure 1The sub-flow chart of step 11.

[0046] In step 111, the web page is divided into blocks. In this step, if Figure 4 As shown, the web page content displayed by the browser can be divided into multiple content blocks, including: navigation block, web page location block,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a judging system and a judging method for web page repeating. The judging method includes the following steps: obtaining multiple web pages; extracting the texts of the web pages respectively; extracting one or more sentences from the texts of the web pages, and computing the sentence signatures of the texts of the web pages according to the one or more sentences; clustering multiple web pages according to the sentence signatures of the texts of the web pages; computing the additional signatures for the web pages in the same cluster; and the judging whether the web pages in the same cluster repeat according to the additional signatures. By adopting the judging system and the judging method for web page repeating, the web pages can be effectively and quickly judged whether to be repeated according to the multi-dimensional signatures including the sentence signatures of the texts of the web pages.

Description

【Technical field】 [0001] The invention relates to the field of the Internet, in particular to a system for judging duplication of webpages and a judging method thereof. 【Background technique】 [0002] In this era of highly developed technology, the Internet has become the main way for people to obtain information. But today's Internet is full of some repetitive content everywhere, causing great troubles to users' access. Therefore, the service provider needs to judge the repetition of webpages, and only select some high-quality webpages for users to browse the repeated webpages. [0003] However, in the prior art, the similarity between two pages is generally confirmed by comparing the contents and nodes of the two pages. This method can be calculated more accurately, but the time complexity is too high, and the calculation is time-consuming. By signing some important information in a page, and then comparing the signatures of two pages to calculate the similarity, this m...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 吴一璞
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products