Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for automatically identifying web crawler

An automatic identification and crawler technology, applied in the field of web crawlers, to prevent the collection of information

Inactive Publication Date: 2017-02-15
成都知道创宇信息技术有限公司
View PDF8 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the dark chain can be identified, and the calculation of the header information requires additional resource consumption

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for automatically identifying web crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0017] The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. By embedding javascript in the web page to redirect to the same page one or more times and return the status code at the same time, the crawler cannot crawl the page normally due to deduplication. Execute the cookie or badcookie specified by the javascript code in onload to identify whether the request comes from a crawler.

[0018] The server home page returns a page containing only JS code (the code of the script file extension written in JavaScript). This code is located in the onload function and is executed after the page is fully loaded. This JS code will use a certain algorithm (IP, header and other information as algorithm parameters) to set a cookie field, and then use window.location to jump to the home page (this page). If the server detects that the cookie is legal, it returns another piece of JS, which uses another algorit...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for automatically identifying a web crawler. The method comprises the following steps of step 1 of returning a server home page to a page containing only an JS code, wherein the code is located in an onload function and is executed after the page is loaded completely; step 2 of adopting a certain algorithm to set a cookie field for the JS code in step 1, then using window.location to skip to the home page; and detecting the cookie is legal by a sever and returning to another JS code, and adopting another algorithm to set a cookie filed for the another JS code; step 3 of returning to a normal home page URL when all cookie fields are legal; and step 4 of setting a badcookie, and marking the badcookie as the crawler when a client does not have a redirection operation, or the cookie value is incorrect. The method for automatically identifying the web crawler provided by the invention can block the access of most static crawlers, and if the crawlers cannot execute the JS code of the home page, then the crawlers can only crawl to the home page returned by the server only containing the JS code, thus the real home page cannot be acquired.

Description

technical field [0001] The invention relates to the field of web crawlers, in particular to a method for automatically identifying web crawlers. Background technique [0002] Currently, there are various ways for websites to identify web crawlers. The most effective and widely used method is to provide interactive components to identify whether a client is a real user or a web crawler, such as a verification code. However, this method will affect the user's online experience to some extent. [0003] In the process of crawling the website pages, the crawler will crawl the homepage. At the same time, since crawlers usually do not repeatedly crawl pages with the same URL, it can be used to identify whether the request comes from a crawler program. In the prior art, a hidden link is placed in a page as a honeypot to identify crawlers, or the feature information (HTTP header, etc.) of the crawler is used as the basis for identification. However, the dark link can be identified...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): H04L29/06G06F17/30
CPCG06F16/951G06F16/958H04L63/1466
Inventor 周雨晨
Owner 成都知道创宇信息技术有限公司