Website purification method and device

A purification device and website technology, which is applied in the direction of network data retrieval, network data indexing, special data processing applications, etc., can solve the problems of wasting storage space of the URL scheduling module, and the actual use efficiency of crawlers is not high, so as to save resources and improve the ability Effect

Active Publication Date: 2016-08-31
BEIJING QIHOO TECH CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] Nowadays, with the increasingly abundant means of website promotion, in order to count the traffic sources of the current URL, most websites will do some extra processing on the URL, some will add some extra information after the URL body, and some will The form of the URL is changed. These additional forms improve the efficiency of the website, but it is a nightmare for search engine crawlers, because the crawlers of the prior art do not actively distinguish these additional information when crawling , and these changed URLs will be crawled separately, but the crawled content points to the same webpage
For crawlers, the storage space, bandwidth, and computing resources of the URL scheduling module are wasted, resulting in inefficient use of crawlers.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Website purification method and device
  • Website purification method and device
  • Website purification method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0066] Embodiment 1: the command word goodsid, this command word is suitable for the overall URL form is relatively changeable, it is necessary to summarize the rules, find out the main parts, and then splicing out the final form of the website.

[0067] For example, some B2C website link forms are not standardized, and there will be multiple forms of links on the website at the same time, as follows:

[0068] http: / / www.eggcoo.com / page_product_527393_0.html

[0069] http: / / www.eggcoo.com / product.shtml?method=detailView&id=527393&cv=0

[0070] This is in the golden egg market, there are two different forms of links, but in fact they point to the same product.

[0071] For another example, some large B2Cs with a long history will also be revised from time to time, and the same situation exists:

[0072] http: / / www.amazon.cn / gp / product / B0019DBU60?ver=gp&uid=476-6816060-6082564&pageletid=taiwan (from the list page)

[0073] http: / / www.amazon.cn / mn / detailApp / ref=sr_1_1?_encodin...

Embodiment 2

[0089] Embodiment 2: command word truncate, this command word is applicable to the situation that URL is followed by additional information. Now many websites will add some additional parameters after the URL to mark the source or do statistics. This form is more common and easier to handle, for example:

[0090] http: / / www.vancl.com / Product_0006984 / BaiHeHuaLianYiQun%20HongSeYinHua.html?Source=eqf&SourceSunInfo=96845|yqftid_12783880711284196186

[0091] The way to purify this kind of website is to use the command word truncate (truncate), set up grouping (with a pair of brackets) for all the data that needs to be preserved, and only return the grouped results, as in the following rule :

[0092] {"www.vancl.com", "^( / Product_[0-9]+ / [\w]+\.html).*.*$", "truncate", null}

[0093] Apply this rule and return when the above link is encountered

[0094] http: / / www.vancl.com / Product_0006984 / BaiHeHuaLianYiQun%20HongSeYinHua.html

Embodiment 3

[0095]Embodiment 3: The command word is a group command, and this command word is applicable to websites whose URLs are not case-sensitive. Some websites are not sensitive to the case of the URL, but for crawlers, the uppercase and lowercase URLs correspond to different links respectively. In this case, you can use the grouping command to uniformly capitalize a certain group Convert to lowercase or lowercase to uppercase.

[0096] For example, the URL of Dangdang.com:

[0097] http: / / product.dangdang.com / product.aspx?product_id=22799821

[0098] http: / / product.dangdang.com / Product.aspx?product_id=22799821

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Disclosed is a URL purification method including the steps of: matching an original URL with a domain name in a domain name set which is capable of being purified; locating a successfully-matched domain name to a corresponding URL template set; matching the original URL with a regular expression of a URL template in the URL template set; determining whether the template in which the regular expression is matched successfully includes a command word; if yes, processing the URL according to the command word, if not, returning to the original URL; and outputting a purified new URL. The disclosure further discloses a URL purification device. After a URL with many forms is purified, whether the URL has been crawled may be determined, and the URL is not crawled again if it has been crawled before, thereby significantly improving the capability of crawling valid web pages by a crawler, and saving various resources.

Description

technical field [0001] The invention relates to a website purification method and device thereof, in particular to a method for purifying websites in websites with many website forms. Background technique [0002] URL (Uniform Resoure Locator: Uniform Resource Locator) is the address of a network resource, also known as a web address. In the present invention, the same concept is represented by the Chinese "web address" and the English abbreviation "URL". It consists of the following parts from left to right: [0003] Internet resource type (scheme): indicates the tool used by the WWW client program to operate. For example, "http: / / " means WWW server, "ftp: / / " means FTP server, "gopher: / / " means Gopher server, and "new:" means Newgroup newsgroup. [0004] Server address (host): Point out the domain name of the server where the WWW page is located. [0005] Port (port): Sometimes (not always required), for access to certain resources, it is necessary to give the correspon...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/9566G06F16/951
Inventor 周雷高扬姜鑫牛杏媛蒋英雪
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products