Unlock instant, AI-driven research and patent intelligence for your innovation.

A method and device for intercepting reptiles

A crawler and page technology, applied in the network field, can solve the problems that normal users mistakenly think it is a web crawler, and the efficiency of intercepting web crawlers is not high, so as to achieve high concurrency, reduce pressure, and increase the interception rate.

Active Publication Date: 2020-09-01
BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the prior art, in order to ensure the access of normal users, some websites adopt the method of filtering the client IP, or the method of filtering the specific User-Agent header of the HTTP request to intercept the access from the web crawler. In some cases, when many normal users share the same IP, these normal users will be mistaken for web crawlers and filtered out
On the other hand, according to the HTTP protocol specification, the value of the User-Agent header can be set arbitrarily, so many web crawlers set their User-Agent headers to be the same as ordinary browsers to avoid filtering, which leads to the interception of web crawlers. is not efficient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and device for intercepting reptiles
  • A method and device for intercepting reptiles

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] Example 1, in one embodiment,

[0037] 1) The browser sends an HTTP request to the server, requesting the first page of the current category;

[0038] The server generates an image URL path containing the cookie value and saves it to the first page;

[0039] The server side pre-sets the range of pages that allow direct access to pages as 1-10 pages, and the server side judges that the first page belongs to the direct access range, so it returns the first page that includes the image URL path to the browser;

[0040] The browser automatically downloads the picture to the browser according to the URL path of the picture contained in the returned page of the first page of the current category; parses the picture with the JS method, extracts the cookie value, and saves it; carries the cookie value when turning the page later .

[0041] 2) The browser sends an HTTP request carrying a cookie value to the server, requesting page 10 of the current category;

[0042] The serv...

Embodiment 2

[0050] Embodiment 2, in another embodiment,

[0051] If the browser receives a link to page 10 of the category, then,

[0052] The browser sends an HTTP request to the server, requesting page 10 of the current category;

[0053] The server side generates the image URL path containing the cookie value and saves it to page 10;

[0054] The server side pre-sets the range of pages that allow direct access to pages 1-10, and the server judges that the 10th page belongs to the direct access range. Therefore, although the HTTP request does not contain a cookie value at this time, it will directly include pictures. Page 10 of the URL path is returned to the browser.

[0055] The browser automatically downloads the picture to the browser according to the URL path of the picture contained in the returned page of the 10th page of the current classification; parses the picture with the JS method, extracts the cookie value in it, and saves it; carries the cookie value when turning the pa...

Embodiment 3

[0056] Embodiment three, in another embodiment,

[0057] If the browser receives a link to category page 11, then,

[0058] The browser sends an HTTP request to the server, requesting page 11 of the current classification;

[0059] The server generates an image URL path containing the cookie value and saves it to page 11;

[0060] The server judges that the 11th page does not belong to the scope of direct access. Therefore, it further judges whether there is a cookie value in the HTTP request. Since it is a link directly received by the browser, the HTTP request does not contain a cookie value. Therefore, to browse The browser returns to the first page of the current category.

[0061] Next, if you want to continue to visit other pages, you can repeat the operation in Embodiment 1 to achieve normal page visits.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Proposed are a crawler interception method and device, a server and a medium. The method comprises: after receiving an access request, sent by a client, for accessing a page, a server end generating a current field value for recognizing a crawler and generating a picture attribute value for saving the field value in a picture; saving a picture uniform resource locator (URL) path that contains the picture attribute value in the requested page; the server end determining whether a current page to be accessed belongs to a direct access allowed page; if so, returning the requested page to the client; if not, further determining whether the access request contains a valid field value for recognizing the crawler; if there is a valid field value, returning the requested page to the client; and if no field value is contained for recognizing the crawler, or a contained field value is invalid, confirming that same is the crawler, and returning a first classified page of the page to be accessed to the client. By means of the present invention, crawler access can be effectively intercepted.

Description

technical field [0001] The invention relates to network technology, in particular to a method and device for intercepting reptiles. Background technique [0002] Web crawlers are a fundamental component of search engine technology. Web crawler technology starts from the URL (Uniform Resource Locator, Uniform Resource Locator) of one or several initial webpages, and obtains the URLs on the initial webpage. Extract new URLs from the web page and put them in the queue until some stopping condition is met. Then store the captured webpage information in the server of the search engine. [0003] In the prior art, in order to ensure the access of normal users, some websites adopt the method of filtering the client IP, or the method of filtering the specific User-Agent header of the HTTP request to intercept the access from the web crawler. Under normal circumstances, when many normal users share the same IP, these normal users will be mistaken for web crawlers and filtered out. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566G06F16/00
Inventor 王向维韩笑跃王飞谢刚费艳茹韩勇马顺风
Owner BEIJING JINGDONG SHANGKE INFORMATION TECH CO LTD