Uniform resource locator (URL) de-duplication method and device
A resource locator and generation device technology, applied in the network field, can solve problems such as increased operating costs, reduced efficiency of URL security vulnerability detection, and loss of website server performance, and achieves the effect of improving detection efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0073] See image 3 , which is a flowchart of a uniform resource locator deduplication method according to an embodiment of the present invention, which includes the following steps:
[0074] S301. Preset a deduplication rule base according to the structure of the uniform resource locator, wherein multiple deduplication rules are stored in the deduplication rule base, each deduplication rule corresponds to a different structure of the uniform resource locator, and the deduplication rule A rewrite flag indicating the rewritten segment parameter in the corresponding uniform resource locator is set in .
[0075] S302. Obtain deduplicated URL data from the website access data.
[0076] S303. Match the de-duplication URL with the de-duplication rules in the de-duplication rule base according to the structure and segmentation parameters of the Uniform Resource Locators.
[0077] S304. Filter the matched uniform resource locators corresponding to the same deduplication rule, and re...
Embodiment 2
[0099] See Figure 5 , which is a flow chart of the first method for generating deduplication rules in an embodiment of the present invention, which includes the following steps:
[0100] S501. Obtain uniform resource locator data under the domain name for which the deduplication rule is to be generated. That is, the URL with the same host part is read from the data center of the website server, and the read URL must not be rewritten.
[0101] S502. Perform clustering on the acquired uniform resource locators. Specifically, the URLs read in step S501 are clustered according to length and character lexicographic order. Clustered URLs can make system operations faster and improve the efficiency of deduplication rule generation.
[0102] The length refers to the segment length of the URL, or the number of segments separated by the " / " symbol in the URL, for example, "http: / / www.qq.com / news / getNews?type=sports&date=20131120&id= 1" and "http: / / www.qq.com / news / getNews?type=scien...
Embodiment 3
[0124] See Figure 7 , which is a flowchart of a second method for generating deduplication rules in an embodiment of the present invention, which includes the following steps:
[0125] S701. Obtain an existing deduplication rule in a preset deduplication rule base, where the structure of the deduplication rule includes a domain name parameter part, a suffix part, a segment number part and a rewriting rule part.
[0126] S702. Obtain multiple uniform resource locator data under the domain name for which the deduplication rule is to be generated. The number of URLs to be obtained should not be too small, generally not less than 5000.
[0127] S703. Using the suffix part and the rewriting rule part of the existing deduplication rule, match multiple uniform resource locators under the domain name to generate the deduplication rule.
[0128] S704. When the number of matched uniform resource locators is greater than the set threshold, replace the domain name parameter part in the...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com