Webpage similarity determination method and device, webpage clustering method and device and electronic equipment

A technology of determination method and clustering method, which is applied in the computer field and can solve problems that are not suitable for processing and have high time complexity.

Pending Publication Date: 2021-01-05
CHINA CONSTRUCTION BANK
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The traditional web page similarity calculation is based on the tree edit distance. The tree edit distance is based on the text edit distance and introduces the parent-child relationship between nodes to calculate the edit distance between two trees. However, the similarity calculation of the tree edit distance The method has a high time complexity and is not suitable for dealing with massive web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage similarity determination method and device, webpage clustering method and device and electronic equipment
  • Webpage similarity determination method and device, webpage clustering method and device and electronic equipment
  • Webpage similarity determination method and device, webpage clustering method and device and electronic equipment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0092] The embodiment of the present application provides a method for determining web page similarity based on path similarity, such as figure 1 As shown, the method may include the following steps:

[0093] Step S101, determining the first leaf node path set of the first web page and the second leaf node path set of the second web page;

[0094] Step S102: Determine the similarity between the first web page and the second web page based on the similarity between the leaf node paths in the first leaf node path set and the leaf node paths in the second leaf node path set.

[0095] Specifically, determining the first leaf node path set of the first webpage and the second leaf node path set of the second webpage includes:

[0096] Determining the DOM tree corresponding to the first webpage and the DOM tree corresponding to the second webpage;

[0097] Based on the determined DOM tree of the first webpage and the DOM tree corresponding to the second webpage, a fir...

Embodiment 2

[0126] The embodiment of the present application provides a web page clustering method, such as figure 1 As shown, the method may include the following steps:

[0127] Step S201, randomly determining a certain webpage as a category from the webpages to be classified;

[0128] Step S202, respectively determine the similarity between the certain webpage and other webpages in the webpages to be classified by the webpage similarity determination method based on path similarity;

[0129] Step S203, classifying the webpages whose similarity is within the threshold range as the same category as a certain webpage, and using the remaining webpages as webpages to be classified;

[0130] In step S204, step 201 to step S203 are repeatedly executed until all webpages to be classified are classified.

[0131] Specifically, a single-pass clustering algorithm may be used to cluster web pages. The single-pass clustering algorithm is to take a webpage from the webpages to be cla...

Embodiment 3

[0150] image 3 The device for determining webpage similarity based on path similarity provided by the embodiment of the present application, the device 30 includes:

[0151] A device for determining web page similarity based on path similarity is provided, including:

[0152] The first determination module is configured to determine the first leaf node path set of the first webpage and the second leaf node path set of the second webpage;

[0153] The second determining module is configured to determine the similarity between the first web page and the second web page based on the similarity between the leaf node paths in the first leaf node path set and the leaf node paths in the second leaf node path set.

[0154] Optionally, the first determination module includes:

[0155] The first determining unit is configured to determine the DOM tree corresponding to the first web page and the DOM tree corresponding to the second web page;

[0156] The second determi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a webpage similarity determination method and device based on path similarity, a webpage clustering method and device and electronic equipment, and is applied to the technical field of computers. The method comprises the steps that the similarity of webpages is determined based on the similarity of leaf node paths of the webpages; compared with a traditional method that thewebpage similarity is determined by a tree editing distance, the method disclosed in the invention is advantageous in that the similarity calculation process is simple, the time complexity is low, andmeanwhile the accuracy of webpage similarity calculation is improved; in addition, webpage clustering is carried out through a webpage similarity determination method based on path similarity, the time complexity of webpage clustering can be reduced, and therefore a large number of webpages can be rapidly processed.

Description

technical field [0001] The present application relates to the field of computer technology, in particular, the present application relates to a method for determining similarity of webpages based on path similarity, a method for clustering webpages, a device and electronic equipment. Background technique [0002] With the development of data mining technology, the demand and importance of data continue to increase, and web pages are the main carrier of data, and a large amount of data is presented through web pages, which makes the automatic extraction of web data an important technology. HTML pages are a combination of data stored in the background database and HTML content templates. Most of the web pages inside the website are generated by the same set of content templates. Therefore, through web page clustering, the web pages generated by the same template are analyzed Data extraction will greatly improve the accuracy of extraction. [0003] Web page similarity calculat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/906G06F16/955G06F40/143
CPCG06F16/906G06F16/955G06F40/143
Inventor 王一洲洪毅清吕文栋蔡淑莲钟文杰
Owner CHINA CONSTRUCTION BANK
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products