Method and device for recognizing similar webpages

A web page and similar technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as long time consumption, errors, low efficiency, etc., and achieve the effect of making up for low efficiency

Active Publication Date: 2013-04-17
任芳坤
View PDF2 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Collecting the features of web pages manually takes too long and is inefficient; when judging the similarity of web pages through histograms, since the position of the color information in each picture in the ...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for recognizing similar webpages
  • Method and device for recognizing similar webpages
  • Method and device for recognizing similar webpages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0064] The embodiment of the present invention provides a method for identifying similar web pages, see figure 1 , The method flow includes:

[0065] 101: Obtain hypertext markup language HTML element information of a first webpage to be classified and HTML element information of a second webpage with known category information, respectively;

[0066] 102: Calculate the similarity between the first webpage and the second webpage according to the HTML element information of the first webpage and the second webpage;

[0067] 103: When the degree of similarity is greater than a preset similarity threshold, determine that the first webpage and the second webpage are similar webpages.

[0068] The embodiment of the present invention obtains the HTML element information of the first webpage to be classified and the second webpage of the known category, and calculates the similarity according to the HTML element information corresponding to the two webpages to determine whether the two webpag...

Embodiment 2

[0070] The embodiment of the present invention provides a method for identifying similar web pages, see figure 2 , The method flow includes:

[0071] 201: Obtain the hypertext markup language HTML element information of the first webpage to be classified and the HTML element information of the second webpage with known category information respectively.

[0072] Step 201 can be specifically:

[0073] 2011: Acquire the DOM structure information of the document object model of the first webpage according to the URL address of the uniform resource locator of the first webpage to be classified.

[0074] The webpage information of the first webpage to be classified corresponding to the specified URL is crawled by a webpage crawler, where the webpage information is the HTML code of the webpage, and the DOM structure information of the first webpage is obtained from the HTML code.

[0075] 2012: Acquire the DOM structure information of the second webpage with known category information from t...

Embodiment 3

[0117] The embodiment of the present invention provides a device for identifying similar webpages, see Figure 4 , The device includes:

[0118] The first obtaining module 401 is configured to obtain the hypertext markup language HTML element information of the first webpage to be classified and the HTML element information of the second webpage with known category information respectively;

[0119] The calculation module 402 is configured to calculate the similarity between the first webpage and the second webpage according to HTML element information of the first webpage and the second webpage;

[0120] The determining module 403 is configured to determine that the first webpage and the second webpage are similar webpages when the similarity is greater than a preset similarity threshold.

[0121] The embodiment of the present invention obtains the HTML element information of the first webpage to be classified and the second webpage of the known category, and calculates the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and device for recognizing similar webpages and belongs to the technical field of computers. The method includes: respectively obtaining hypertext markup language (HTML) element information of a first webpage to be classified and HTML element information of a second webpage whose class information is known; calculating similarity of the first webpage and the second webpage according to the HTML element information of the first webpage and the second webpage; and when the similarity is larger than a preset similarity threshold, determining the first webpage and the second webpage to be similar webpages. Whether the two webpages are similar or not is determined by obtaining the HTML element information of the first webpage to be classified and the classified second webpage and calculating the similarity according to the HTML element information corresponding to the two webpages, so that defects of low efficiency of manual webpage similarity judgment and higher misjudgment rate of webpage similarity judgment by histograms are overcome.

Description

Technical field [0001] The present invention relates to the field of computer technology, in particular to a method and device for identifying similar web pages. Background technique [0002] With the popularization and development of the Internet, both the number of websites and the number of web pages under the website have shown explosive growth. This has resulted in many emerging Internet services, such as web page clustering, web page classification and other services. These services classify web pages based on the information presented on the web pages, thereby providing a better user experience. When classifying webpages, it is necessary to make similarity judgments on the webpages to be classified, and find the webpages of known categories similar to the webpages to be classified to determine the category of the webpages to be classified. [0003] A web page is composed of HTML (Hypertext Markup Language) element information, so certain combinations of HTML element informa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 李鹏
Owner 任芳坤
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products