Webpage similarity determination method and device, webpage clustering method and device and electronic equipment
A technology of determination method and clustering method, which is applied in the computer field and can solve problems that are not suitable for processing and have high time complexity.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0092] The embodiment of the present application provides a method for determining web page similarity based on path similarity, such as figure 1 As shown, the method may include the following steps:
[0093] Step S101, determining the first leaf node path set of the first web page and the second leaf node path set of the second web page;
[0094] Step S102: Determine the similarity between the first web page and the second web page based on the similarity between the leaf node paths in the first leaf node path set and the leaf node paths in the second leaf node path set.
[0095] Specifically, determining the first leaf node path set of the first webpage and the second leaf node path set of the second webpage includes:
[0096] Determining the DOM tree corresponding to the first webpage and the DOM tree corresponding to the second webpage;
[0097] Based on the determined DOM tree of the first webpage and the DOM tree corresponding to the second webpage, a fir...
Embodiment 2
[0126] The embodiment of the present application provides a web page clustering method, such as figure 1 As shown, the method may include the following steps:
[0127] Step S201, randomly determining a certain webpage as a category from the webpages to be classified;
[0128] Step S202, respectively determine the similarity between the certain webpage and other webpages in the webpages to be classified by the webpage similarity determination method based on path similarity;
[0129] Step S203, classifying the webpages whose similarity is within the threshold range as the same category as a certain webpage, and using the remaining webpages as webpages to be classified;
[0130] In step S204, step 201 to step S203 are repeatedly executed until all webpages to be classified are classified.
[0131] Specifically, a single-pass clustering algorithm may be used to cluster web pages. The single-pass clustering algorithm is to take a webpage from the webpages to be cla...
Embodiment 3
[0150] image 3 The device for determining webpage similarity based on path similarity provided by the embodiment of the present application, the device 30 includes:
[0151] A device for determining web page similarity based on path similarity is provided, including:
[0152] The first determination module is configured to determine the first leaf node path set of the first webpage and the second leaf node path set of the second webpage;
[0153] The second determining module is configured to determine the similarity between the first web page and the second web page based on the similarity between the leaf node paths in the first leaf node path set and the leaf node paths in the second leaf node path set.
[0154] Optionally, the first determination module includes:
[0155] The first determining unit is configured to determine the DOM tree corresponding to the first web page and the DOM tree corresponding to the second web page;
[0156] The second determi...
PUM
![No PUM](https://static-eureka.patsnap.com/ssr/23.2.0/_nuxt/noPUMSmall.5c5f49c7.png)
Abstract
Description
Claims
Application Information
![application no application](https://static-eureka.patsnap.com/ssr/23.2.0/_nuxt/application.06fe782c.png)
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com