Method for judging whether web page content is identical or not

A web page and content technology, applied in the field of judging web pages with the same content, can solve the problems of no web page identification and filtering, inconvenient use, web page filtering, etc., to achieve the effect of convenient and quick viewing and reduce redundant results pages

Inactive Publication Date: 2009-01-21
胡辉
View PDF0 Cites 56 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] At present, search engines sort all the pages related to keywords according to their specific algorithms and then display them to users. However, because many websites on the Internet reprint some articles, news, etc. The same webpage is filtered, so that the search engine will return many webpage results with the same article content to the user, so that the user has to fi

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for judging whether web page content is identical or not
  • Method for judging whether web page content is identical or not
  • Method for judging whether web page content is identical or not

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0077] Below in conjunction with accompanying drawing and example the present invention is described in further detail.

[0078] Take the following two webpages A and B as an example, where:

[0079] For the URL of web page A see Figure 4 410 in , for the display effect of webpage A in IE browser, please refer to Figure 4 420 in , for the abbreviated content of the HTML source code file of webpage A, see Image 6 ;

[0080] URL of page B see Figure 5 510 in , for the display effect of web page B in IE browser, see Figure 5 520 in , for the abbreviated content of the HTML source code file of webpage B, see Figure 7 ;

[0081] Such as figure 1 As shown, we first calculate the title similarity of web pages A and B.

[0082] 110 is the processing of extracting the title content of the web page from the HTML source file of the web page, and the extraction method is to find the content in the source file with tags (case-insensitive), and the content between these two ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for judging whether web page contents are same or not, which can be used in the technical field of search engines to filter the query results with same web page contents. The method comprises: calculating the similarity of web page titles and the similarity of web page text contents, judging whether the web pages are same contents or not according to the similarity of the web page titles and the text contents, determining the web pages to be the web pages with the same contents if the similarity of the web page titles and the similarity of the text contents reach certain valve value, and otherwise determining the web pages to be the web pages with different contents.

Description

technical field [0001] The invention relates to a method for judging webpages with the same content, which can help to filter repeated search results in search engines. Background technique [0002] At present, search engines sort all the pages related to keywords according to their specific algorithms and then display them to users. However, because many websites on the Internet reprint some articles, news, etc. The same webpage is filtered, so that the search engine will return many webpage results with the same article content to the user, so that the user has to find useful results among a large number of redundant results, which brings inconvenience in use. Some search engines group related webpages from the same website (same domain name, different URL) into a group of results and display them next to each other, but they cannot identify and identify webpages such as articles and news from different websites with the same content. Filter it out. Contents of the inve...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 胡辉
Owner 胡辉
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products