Method and device for automatically identifying web crawlers

A technology of web crawler and automatic identification, applied in the field of network security

Inactive Publication Date: 2010-12-01
BEIJING VENUS INFORMATION TECH +1
View PDF2 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The root causes of these security problems in the Web application system are mostly due to the flaws in the program codes of the Web application system itself, which introduces Web security vulnerabilities, thus allowing hackers to take advantage of them.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for automatically identifying web crawlers
  • Method and device for automatically identifying web crawlers
  • Method and device for automatically identifying web crawlers

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0058] Embodiment 1, a method for automatically identifying a webpage crawler, based on the observed webpage request sequence to determine whether the operation of the remote host r is a webpage crawler, including:

[0059] Obtain the web page request from the remote host r to the Web server s within a period of time, judge whether the time interval of each adjacent web page request is greater than or equal to a predetermined adjacent web page request time interval threshold δ, and whether each judgment result meets the preset condition, to determine whether the operation of the remote host r is a web crawler.

[0060] Such as figure 1 As shown, the method specifically includes the following steps:

[0061] A1, collect the web page request sequence from the remote host r to the web server s within a period of time;

[0062] A2, to the collected web page request sequence W from the remote host r to the Web server s including n web page requests (each element in the sequence u...

example 1

[0091] Assume that according to the step A1 of the web crawler automatic identification method, 10 webpage requests from the remote host r to the web server s are collected, and the initiation time of these 10 webpage requests is shown in Table 1.

[0092] Table 1

[0093]

[0094] According to the step A2 of the web crawler automatic identification method, the time interval sequence T of adjacent web page requests with 9 elements is calculated, as shown in Table 2.

[0095] Table 2

[0096]

[0097] According to step A3 of the method for automatically identifying webpage crawlers and the preset threshold value of the time interval between adjacent webpages δ=3000 milliseconds, the basic event sequence E shown in Table 3 is obtained.

[0098] table 3

[0099]

[0100] According to step A4 of web crawler automatic identification method and preset lower limit threshold η0 is 0.132 and upper limit threshold η1 is 7.59, at first calculate the likelihood ratio of basic e...

example 2

[0103] Assume that according to the step A1 of the web crawler automatic identification method, 10 webpage requests from the remote host r to the web server s are collected, and the initiation time of these 10 webpage requests is shown in Table 4.

[0104] Table 4

[0105] Web page request sequence number

[0106]According to step A2 of the web crawler automatic identification method, the time interval sequence T of adjacent web page requests with 9 elements is calculated, as shown in Table 5.

[0107] table 5

[0108] Element number

[0109] According to step A.3 of the method for automatically identifying webpage crawlers and the preset threshold value of the time interval between adjacent webpages δ=3000 milliseconds, the basic event sequence E shown in Table 6 is obtained.

[0110] Table 6

[0111] Element number

[0112] According to the step A.4 of the automatic identification method of webpage crawler and the preset lower limit threshold ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for automatically identifying web crawlers. The device comprises an acquisition unit, a judgment unit and an identification unit, wherein the acquisition unit is used for acquiring web requests from a remote host to a Web server in certain time; the judgment unit is used for judging whether the time interval of the adjacent web requests is greater than or equal to a predetermined threshold value delta of the time interval of the adjacent web requests; and the identification unit is used for judging whether the operation of the remote host is the web crawler according to whether each judgment result meets the preset condition. The method and the device can quickly detect the web crawlers of various types so as to provide precious response time for subsequent security response.

Description

technical field [0001] The invention relates to the technical field of network security, in particular to a method and device for automatically identifying web crawlers. Background technique [0002] Due to the convenience and ease of use of Web services, more and more network services have shifted from the traditional mode of dedicated client and dedicated server (C / S mode) to browsers and Web servers using standard Web browsers as clients. mode (B / S mode). These network services using the B / S mode are generally called Web application systems. While web application systems bring convenience to people, they also bring a lot of security problems. The more common security problems include webpage Trojan horse virus, SQL injection attack, XSS attack and so on. Most of the root causes of these security problems in the Web application system are that there are program code defects in the Web application system itself, which introduces Web security holes, so that hackers can tak...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/06H04L12/24
Inventor 叶润国胡振宇周涛
Owner BEIJING VENUS INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products