Web crawler identification method and web crawler identification device

A technology of web crawler and identification method, applied in the field of identification method and device of web crawler, capable of solving the problems of high false negative rate and false positive rate, large limitations, etc. Good results

Active Publication Date: 2018-08-21
TENCENT TECH (SHENZHEN) CO LTD
View PDF6 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a method and device for identifying web crawlers, so as to solve the technic

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web crawler identification method and web crawler identification device
  • Web crawler identification method and web crawler identification device
  • Web crawler identification method and web crawler identification device

Examples

Experimental program
Comparison scheme
Effect test

no. 1 example

[0034] This embodiment will be described from the perspective of the identification device of the web crawler, please refer to Figure 1b , Figure 1b The identification method of the web crawler provided by the embodiment of the present invention is specifically described, which may include:

[0035] S101. Generate a crawler identification instruction.

[0036] In this embodiment, the trigger condition for generating the crawler identification instruction can be determined according to actual needs, and it can be a specified time or a specified amount of data, wherein, the specified time and the specified amount of data can be set by the user, or can be It is the factory default setting when the server leaves the factory. Specifically, when the trigger condition is a specified time, the server may be triggered to generate a crawler identification instruction when the specified time is reached. When the trigger condition is the specified amount of data, it is necessary to c...

no. 2 example

[0103] According to the method described in Embodiment 1, an example will be given below for further detailed description.

[0104] In this embodiment, it will be described in detail by taking the identification device of the web crawler integrated in the server, the first terminal and the second terminal as an example.

[0105] Such as Figure 2a and Figure 2b As shown, a web crawler identification method, the specific process can be as follows:

[0106] S201. The server acquires and parses the user access request sent by the first terminal, obtains the user ID, access address and access time of the current user, and stores them in a first preset database.

[0107] For example, there are roughly two acquisition paths for the user's access request, one is obtained through the traffic bypass copy operation from the switch through the optical splitting device, and the other is obtained from the web server through the data sending queue to report real-time traffic data. The u...

no. 3 example

[0142] According to the methods described in Embodiment 1 and Embodiment 2, this embodiment will be further described from the perspective of a web crawler identification device, and the web crawler identification device may be integrated in a server.

[0143] see Figure 3a , Figure 3a The web crawler identification device provided by the third embodiment of the present invention is specifically described, which may include: a generation module 10, an acquisition module 20, a calculation module 30 and an identification module 40, wherein:

[0144] (1) Generate module 10

[0145] The generating module 10 is configured to generate a crawler identification instruction.

[0146] In this embodiment, the trigger condition for generating the crawler identification instruction can be determined according to actual needs, and it can be a specified time or a specified amount of data, wherein, the specified time and the specified amount of data can be set by the user, or can be It i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a web crawler identification method and a web crawler identification device. The web crawler identification method comprises steps that a crawler identification instruction isgenerated; according to the crawler identification instruction, a user identification set and an access time set corresponding to each user identification of the user identification set in a preset time period are acquired; an interval between two adjacent access times in the access time set is calculated to acquire an interval length set; according to the interval length set, the web crawler is identified from the user identification set, and therefore the web crawler can be identified accurately, a missing report rate and a false report rate are reduced, and identification effect is good.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method and device for identifying a web crawler. Background technique [0002] A web crawler is a program that automatically obtains the content of web pages. For a website, a large number of requests from malicious crawlers will consume the performance of the server, and even cause the server to crash. Moreover, in industries such as literature, film and television, and e-commerce, malicious crawlers can be easily used to batch pull and copy public or Semi-public information seriously affects the security of the website server. [0003] Existing web crawler technologies can be divided into high-frequency script crawlers and collector crawlers according to the differences in crawling targets, countermeasures, and performance requirements. Among them, high-frequency script crawlers aim to obtain the site in the shortest time difference. The updated content and full amount of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L29/06G06F17/30
CPCG06F16/951G06F16/958H04L63/10H04L63/1416H04L63/1466
Inventor 唐文韬郑云文胡珀郑兴郭晶张强范宇河王放杨勇
Owner TENCENT TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products