URL (uniform resource locator) purifying method and device

A URL and template technology, applied in network data retrieval, network data indexing, special data processing applications, etc., can solve the problems of wasting storage space of the URL scheduling module and the actual use efficiency of crawlers is not high, so as to save resources and improve the ability Effect

Active Publication Date: 2014-05-14
BEIJING QIHOO TECH CO LTD
View PDF4 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] Nowadays, with the increasingly abundant means of website promotion, in order to count the traffic sources of the current URL, most websites will do some extra processing on the URL, some will add some extra information after the URL body, and some will The form of the URL is changed. These additional forms improve the efficiency of the website, but it is a nightmare for search engine crawlers, because the crawlers of the prior art do not actively distinguish these additional information when crawling , and these changed URLs will be crawled separately, but the crawled content points to the same webpage
For crawlers, the storage space, bandwidth, and computing resources of the URL scheduling module are wasted, resulting in inefficient use of crawlers.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • URL (uniform resource locator) purifying method and device
  • URL (uniform resource locator) purifying method and device
  • URL (uniform resource locator) purifying method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0066] Embodiment 1: the command word goodsid, this command word is suitable for the overall URL form is relatively changeable, it is necessary to summarize the rules, find out the main parts, and then splicing out the final form of the website.

[0067] For example, some B2C website link forms are not standardized, and there will be multiple forms of links on the website at the same time, as follows:

[0068] http: / / www.eggcoo.com / page_product_527393_0.html

[0069] http: / / www.eggcoo.com / product.shtml?method=detailView&id=527393&cv=0

[0070] This is in the golden egg market, there are two different forms of links, but in fact they point to the same product.

[0071] For another example, some large B2Cs with a long history will also be revised from time to time, and the same situation exists:

[0072] http: / / www.amazon.cn / gp / product / B0019DBU60?ver=gp&uid=476-6816060-6082564&pageletid=taiwan (from the list page)

[0073] http: / / www.amazon.cn / mn / detailApp / ref=sr_1_1?_encodin...

Embodiment 2

[0089] Embodiment 2: command word truncate, this command word is applicable to the situation that URL is followed by additional information. Now many websites will add some additional parameters after the URL to mark the source or do statistics. This form is more common and easier to handle, for example:

[0090] http: / / www.vancl.com / Product_0006984 / BaiHeHuaLianYiQun%20HongSeYinHua.html?Source=eqf&SourceSunInfo=96845|yqftid_12783880711284196186

[0091] The way to purify this type of website is to use the command word truncate (truncate and merge), set up grouping (add a pair of brackets) for all the data that needs to be preserved, and only return the grouped results, as in the following rule :

[0092] {"www.vancl.com", "^( / Product_[0-9]+ / [\w]+\.html).*.*$", "truncate", null}

[0093] Apply this rule and return when the above link is encountered

[0094] http: / / www.vancl.com / Product_0006984 / BaiHeHuaLianYiQun%20HongSeYinHua.html

Embodiment 3

[0095]Embodiment 3: The command word is a group command, and this command word is applicable to websites whose URLs are not case-sensitive. Some websites are not sensitive to the case of the URL, but for crawlers, the uppercase and lowercase URLs correspond to different links respectively. In this case, you can use the grouping command to uniformly capitalize a certain group Convert to lowercase or convert lowercase to uppercase.

[0096] For example, the URL of Dangdang.com:

[0097] http: / / product.dangdang.com / product.aspx?product_id=22799821

[0098] http: / / product.dangdang.com / Product.aspx?product_id=22799821

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a URL (uniform resource locator) purifying method including the steps of matching an original URL to a domain name of a domain name set capable of being purified, positioning to a corresponding URL template set according to the domain name successful in matching, matching the original URL to a regular expression of a URL template of the URL template set, judging whether the template with the regular expression successful in matching contains command words or not, if so, then processing the URL according to the command words and going to the step of outputting a new URL purified, otherwise, returning to the original URL, and finally outputting the new URL purified. The invention further provides a URL purifying device. Whether the URLs of various forms are crawled or not can be judged after the URLs are purified; if crawled, then the URLs are not have to be crawled again; therefore, effective-webpage crawling capacity of clawers is improved remarkably and various resources are saved.

Description

technical field [0001] The invention relates to a website purification method and device thereof, in particular to a method for purifying websites in websites with many website forms. Background technique [0002] URL (Uniform Resoure Locator: Uniform Resource Locator) is the address of a network resource, also known as a web address. In the present invention, the same concept is represented by the Chinese "web address" and the English abbreviation "URL". It consists of the following parts from left to right: [0003] Internet resource type (scheme): indicates the tool used by the WWW client program to operate. For example, "http: / / " means WWW server, "ftp: / / " means FTP server, "gopher: / / " means Gopher server, and "new:" means Newgroup newsgroup. [0004] Server address (host): Point out the domain name of the server where the WWW page is located. [0005] Port (port): Sometimes (not always required), for access to certain resources, it is necessary to give the correspon...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9566G06F16/951
Inventor 周雷高扬姜鑫牛杏媛蒋英雪
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products