Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Web page clustering method and device

A web page, clustering method technology, applied in the field of data processing, can solve problems such as low web page clustering efficiency

Active Publication Date: 2017-11-21
BEIJING GRIDSUM TECH CO LTD
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The main purpose of the present invention is to provide a web page clustering method and device to solve the problem of low web page clustering efficiency in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page clustering method and device
  • Web page clustering method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0026] According to the embodiment of the present invention, a method embodiment that can be used to implement the embodiment of the device of the present application is provided. It should be noted that the steps shown in the flow charts of the drawings can be performed on a computer such as a set of computer-executable instructions system, and, although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.

[0027] According to an embodiment of the present invention, a web page clustering method is provided. figure 1 is a flow chart of a web page clustering method according to an embodiment of the present invention, such as figure 1 As shown, the method includes the following steps S102 to S108:

[0028] S102: Obtain the first element of the web page to be compared.

[0029] S104: According to the first element and the second element contained in each page category ...

Embodiment 2

[0068] According to an embodiment of the present invention, there is also provided a web page clustering device for implementing the above web page clustering method, the clustering device is mainly used to implement the clustering method provided in the above content of the embodiment of the present invention, The following is a specific introduction to the web page clustering device provided by the embodiment of the present invention:

[0069] figure 2 is a schematic diagram of a web page clustering device according to an embodiment of the present invention, such as figure 2 As shown, the device mainly includes an acquisition unit 10, a calculation unit 20, a first processing unit 30 and a second processing unit 40, wherein:

[0070] The acquiring unit 10 is configured to acquire the first element of the web page to be compared.

[0071] The calculation unit 20 is used to sequentially calculate the similarity index value between the web page to be compared and each page ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a web page clustering method and device. Wherein, the web page clustering method includes: obtaining the first element of the web page to be compared; according to the first element and the second element contained in each page category in the page category set, sequentially calculating The similarity index value of each page category; when the similarity index value of the webpage to be compared and the current page category is calculated to be greater than the preset threshold, the webpage to be compared is classified into the current page category, and the current page category contains The second element obtains the updated page category of the current page category; when the similarity index value between the webpage to be compared and each page category in the page category set is less than the preset threshold, the webpage to be compared is regarded as a new page The category is added to the page categories collection. The present invention solves the problem of low web page clustering efficiency in the prior art, and further achieves the effect of improving web page clustering efficiency.

Description

technical field [0001] The present invention relates to the field of data processing, in particular to a web page clustering method and device. Background technique [0002] Due to the needs of Internet big data analysis, the collection of web page information is becoming more and more important. However, the differences in page code formats between different websites and different columns increase the difficulty of information collection. Before collecting web page information, it is necessary to cluster web pages with different coding styles. Through webpage clustering, webpages with similar structural codes can be aggregated together for centralized processing, which reduces the difficulties caused by differences in code formats when collecting information. [0003] The current web page clustering method is to generate a code tag tree through structured HTML code, and compare the shortest edit distance between two tag trees to judge the degree of page similarity and fina...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F18/23G06F18/22
Inventor 侯明午
Owner BEIJING GRIDSUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products