URL merging processing method and device

A processing method and collection technology, applied in the field of information processing, can solve the problems of resource consumption, bandwidth occupation and storage resources, etc., and achieve the effect of reducing bandwidth and storage consumption

Active Publication Date: 2016-11-09
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF5 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example: search engines need to repeatedly crawl documents with similar content in the process of web crawling, which greatly occupies bandwidth and storage resources; another example: when using some link-based web page sorting algorithms, these similar web page URLs It will affect the calculation of the web page ranking score of each link; in addition, when performing website security testing, a large number of web pages with similar structures are repeatedly checked, which will also bring great resource consumption

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • URL merging processing method and device
  • URL merging processing method and device
  • URL merging processing method and device

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0025] Figure 1a It is a flow chart of a URL merge processing method provided in the first embodiment of the present invention. The method in this embodiment can be executed by a URL merge processing device, which can be implemented by means of hardware and / or software, and can generally be integrated In the server used to complete the URL merge processing function. The method of this embodiment specifically includes:

[0026] 110. Obtain a set of URLs corresponding to the target website.

[0027] Generally speaking, a website is a collection of multiple web pages, and a web page corresponds to an independent URL address. In order to obtain all URL addresses corresponding to a target website (for example, www.baidu.com). In the prior art, the URL set corresponding to the target website can be crawled in the network mainly by means of a web crawler. Wherein, the URL set includes at least one URL address corresponding to a web page in the target website.

[0028] However, o...

no. 2 example

[0066] figure 2 a is a flowchart of a URL merging processing method according to the second embodiment of the present invention. This embodiment is optimized on the basis of the above-mentioned embodiments. In this embodiment, the URL set corresponding to the target website is obtained as follows: according to the browsing log information of the user, the URL set corresponding to the target website is obtained; meanwhile, it is also preferred Including: sequentially obtaining one of the URL merge clusters as a verification cluster; from the verification clusters, obtaining at least two URLs as verification URLs; downloading the webpage content of at least two verification webpages corresponding to the verification URLs; if according to the The content of the webpage, identifying that the webpage structure between the verification webpages is different, then unmerging the URLs in the verification cluster;

[0067] In addition, according to the content of the webpage, identify...

no. 3 example

[0088] Fig. 3 is a flow chart of a URL merging processing method according to the third embodiment of the present invention. This embodiment is optimized based on the above-mentioned embodiments. In this embodiment, according to the data characteristics of the structure value corresponding to the structure identifier, the specific optimization of obtaining the generalization identifier in the structure identifier is as follows: A feature set corresponding to each URL in the set, generating a set of structure values ​​corresponding to each of the structure identifiers; according to the data characteristics of each structure value in the set of structure values, calculating the value of the structure identifier corresponding to the set of structure values A generalization weight value; according to the generalization weight values ​​corresponding to each structural identifier, the generalization identifier in the structural identifier is obtained;

[0089] At the same time, acco...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a URL merging processing method and device. The method comprises the steps that a URL set corresponding to a target website is obtained; structure splitting is conducted on URLs in the URL set according to composition specifications of the URLs to generate a characteristic set corresponding to the URLs; generalization identifiers in structure identifiers are obtained according to data characteristics of structure values corresponding to the structure identifiers contained in the characteristic set; merging processing is conducted on the URLs in the URL set according to the generalization identifiers to generate at least one URL merging cluster. According to the technical scheme, the URL merging processing method and device can achieve the technical effect of merging the URLs corresponding to web pages with the similar structures, bandwidth is greatly decreased, storage consumption is greatly reduced, and a more simple, convenient and rapid mode is provided for sorting and merging technologies of the web pages.

Description

technical field [0001] Embodiments of the present invention relate to information processing technologies, and in particular, to a method and device for merging URLs. Background technique [0002] With the advent of Web 2.0, Internet data is showing explosive growth, and a prominent performance is the growth of the number of URLs (Uniform Resource Locators, Uniform Resource Locators). In order to further enhance the user experience or record some session information when the user clicks, the website will generate many duplicate URLs correspondingly. These duplicate URLs have only a small number of inconsistent strings, but they correspond to the same or similar web page content. [0003] A large number of duplicate URLs exist, which brings great challenges to the work of web page crawling and parsing. For example: search engines need to repeatedly crawl documents with similar content in the process of web crawling, which greatly occupies bandwidth and storage resources; ano...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9566
Inventor 马宇峰王晓元叶峻邓鸣捷
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products