Uniform resource locator (URL) de-duplication method and device

A resource locator and generation device technology, applied in the network field, can solve problems such as increased operating costs, reduced efficiency of URL security vulnerability detection, and loss of website server performance, and achieves the effect of improving detection efficiency

Active Publication Date: 2015-09-23
TENCENT TECH (SHENZHEN) CO LTD
View PDF3 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] The purpose of the embodiments of the present invention is to provide a uniform resource locator deduplication method and device to solve the problem that UR

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Uniform resource locator (URL) de-duplication method and device
  • Uniform resource locator (URL) de-duplication method and device
  • Uniform resource locator (URL) de-duplication method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0073] See image 3 , which is a flowchart of a uniform resource locator deduplication method according to an embodiment of the present invention, which includes the following steps:

[0074] S301. Preset a deduplication rule base according to the structure of the uniform resource locator, wherein multiple deduplication rules are stored in the deduplication rule base, each deduplication rule corresponds to a different structure of the uniform resource locator, and the deduplication rule A rewrite flag indicating the rewritten segment parameter in the corresponding uniform resource locator is set in .

[0075] S302. Obtain deduplicated URL data from the website access data.

[0076] S303. Match the de-duplication URL with the de-duplication rules in the de-duplication rule base according to the structure and segmentation parameters of the Uniform Resource Locators.

[0077] S304. Filter the matched uniform resource locators corresponding to the same deduplication rule, and re...

Embodiment 2

[0099] See Figure 5 , which is a flow chart of the first method for generating deduplication rules in an embodiment of the present invention, which includes the following steps:

[0100] S501. Obtain uniform resource locator data under the domain name for which the deduplication rule is to be generated. That is, the URL with the same host part is read from the data center of the website server, and the read URL must not be rewritten.

[0101] S502. Perform clustering on the acquired uniform resource locators. Specifically, the URLs read in step S501 are clustered according to length and character lexicographic order. Clustered URLs can make system operations faster and improve the efficiency of deduplication rule generation.

[0102] The length refers to the segment length of the URL, or the number of segments separated by the " / " symbol in the URL, for example, "http: / / www.qq.com / news / getNews?type=sports&date=20131120&id= 1" and "http: / / www.qq.com / news / getNews?type=scien...

Embodiment 3

[0124] See Figure 7 , which is a flowchart of a second method for generating deduplication rules in an embodiment of the present invention, which includes the following steps:

[0125] S701. Obtain an existing deduplication rule in a preset deduplication rule base, where the structure of the deduplication rule includes a domain name parameter part, a suffix part, a segment number part and a rewriting rule part.

[0126] S702. Obtain multiple uniform resource locator data under the domain name for which the deduplication rule is to be generated. The number of URLs to be obtained should not be too small, generally not less than 5000.

[0127] S703. Using the suffix part and the rewriting rule part of the existing deduplication rule, match multiple uniform resource locators under the domain name to generate the deduplication rule.

[0128] S704. When the number of matched uniform resource locators is greater than the set threshold, replace the domain name parameter part in the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a uniform resource locator (URL) de-duplication method and device. The URL de-duplication method comprises presetting a de-duplication rule base according to structures of URLs; acquiring URL data to be de-duplicated from website visiting data; matching the URLs to be de-duplicated with de-duplication rules in the de-duplication rule base according to the structures and segmentation parameters of the URLs; and filtering the matched URLs corresponding to the same de-duplication rules, and reserving one URL corresponding to each de-duplication rule. Through the method and device, a massive amount URL data is filtered and de-duplicated through the de-duplication rules, and the situation of repeatedly scanning the same common gateway interface (CGI) by a security flaw scanner during URL security flaw detection is avoided, thereby raising a security flaw detection efficiency.

Description

technical field [0001] The present invention relates to the field of network technology, in particular to a method and device for deduplication of uniform resource locators, and a corresponding method and device for generating deduplication rules. Background technique [0002] URL Rewrite is a technique for rewriting URLs (Uniform Resource Locators, Uniform Resource Locators) on the Internet. It first obtains the URL request sent by the client to access the website, and then rewrites it into another URL that the website can handle. URL, what the user gets is the returned content of the processed URL address. [0003] For example, news on many news websites has many categories, such as sports, technology, etc., and new news is released every day, which is classified by date, and there is a news index ID under these daily news. When the client accesses a sports news page, after receiving the access request, the website server will form an intermediate URL address through CGI ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30H04L29/06
Inventor 何双宁
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products