Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Clustering method and device for webpage

A web page, clustering method technology, applied in the field of data processing, can solve problems such as low web page clustering efficiency

Active Publication Date: 2015-04-08
BEIJING GRIDSUM TECH CO LTD
View PDF7 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The main purpose of the present invention is to provide a web page clustering method and device to solve the problem of low web page clustering efficiency in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clustering method and device for webpage
  • Clustering method and device for webpage

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0026] According to the embodiment of the present invention, a method embodiment that can be used to implement the embodiment of the device of the present application is provided. It should be noted that the steps shown in the flow charts of the drawings can be performed on a computer such as a set of computer-executable instructions system, and, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

[0027] According to an embodiment of the present invention, a web page clustering method is provided. figure 1 is a flow chart of a web page clustering method according to an embodiment of the present invention, such as figure 1 As shown, the method includes the following steps S102 to S108:

[0028] S102: Obtain the first element of the web page to be compared.

[0029] S104: According to the first element and the second element contained in each page category ...

Embodiment 2

[0068] According to an embodiment of the present invention, there is also provided a web page clustering device for implementing the above web page clustering method, the clustering device is mainly used to implement the clustering method provided in the above content of the embodiment of the present invention, The following is a specific introduction to the web page clustering device provided by the embodiment of the present invention:

[0069] figure 2 is a schematic diagram of a web page clustering device according to an embodiment of the present invention, such as figure 2 As shown, the device mainly includes an acquisition unit 10, a calculation unit 20, a first processing unit 30 and a second processing unit 40, wherein:

[0070] The acquiring unit 10 is configured to acquire the first element of the web page to be compared.

[0071] The calculation unit 20 is used to sequentially calculate the similarity index value between the web page to be compared and each page ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a clustering method and a clustering device for webpages. The clustering method of the webpage comprises the following steps: acquiring a first block element of a web page to be compared; sequentially calculating the similarity index values of the web page to be compared and each page type according to the first block element and a second block element of each page type in a page type set; when calculating that the similarity index value of the web page to be compared and a present page type is greater than a preset threshold, clustering the web page to be compared into the present page type, and updating the second block element of the present page set so as to acquire an updated page type of the present page type; when the similarity index value of the web page to be compared and each page type of the present page type set is smaller than the preset threshold, adding the web page to be compared into the page type set as a new page. Through the adoption of the clustering method, the problem that the web page clustering efficiency in the prior art is low is solved, and an effect of improving the web page clustering efficiency is achieved.

Description

technical field [0001] The present invention relates to the field of data processing, in particular to a web page clustering method and device. Background technique [0002] Due to the needs of Internet big data analysis, the collection of web page information is becoming more and more important. However, the differences in page code formats between different websites and different columns increase the difficulty of information collection. Before collecting web page information, it is necessary to cluster web pages with different coding styles. Through webpage clustering, webpages with similar structural codes can be aggregated together for centralized processing, which reduces the difficulties caused by differences in code formats when collecting information. [0003] The current web page clustering method is to generate a code tag tree through structured HTML code, and compare the shortest edit distance between two tag trees to judge the degree of page similarity and fina...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F18/23G06F18/22
Inventor 侯明午
Owner BEIJING GRIDSUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products