Unlock instant, AI-driven research and patent intelligence for your innovation.

A low-frequency crawler identification method and device

An identification method and crawler technology, applied in the Internet field, can solve problems such as wrong seal of proxy IP library, update delay, and limitation of identification recall rate, etc.

Active Publication Date: 2018-09-18
BEIJING SHU AN XINYUN TECH CO LTD
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] (1) The identification recall rate is limited by the coverage rate of the proxy IP library. Currently, there are hundreds of millions of Internet proxy IPs, and the mobile phone proxy IP library can only cover a small part;
[0005] (2) The proxy IP is not static, so it is necessary to update the proxy IP library frequently. Customers generally have a resistance to online updates, and offline updates will face the problem of update delays;
[0006] (3) The proxy IP obtained by using ADSL community broadband disconnection replay and multicast is more concealed, and this IP will be used by many real users, and the proxy IP library will face problems such as false sealing and inability to accurately identify

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A low-frequency crawler identification method and device
  • A low-frequency crawler identification method and device

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment

[0067] Collect the network application logs of each user IP in a certain month, and calculate the behavior feature vector of each user IP in this month. The behavior feature vector of each user IP is clustered to obtain two clusters.

[0068] The inspection rules include: determining the three target behavior characteristics are the largest similarity ratio of Referer, the ratio of request path set space, and the ratio of 2XX status codes.

[0069] The judgment logic corresponding to the maximum similarity ratio of Referer is greater than, and the threshold is 95%.

[0070] The judgment logic for the space ratio of the request path collection is greater than, and the threshold is 50%.

[0071] The judgment logic for the proportion of 2XX status codes is greater than, and the threshold is 50%.

[0072] Calculate the average value of these three target behavior characteristics of all user IPs in the two clusters respectively, and the average values ​​of these three target beha...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Disclosed herein are a low-frequency crawler identification method, a device, a readable storage medium and an equipment, the method comprising: computing a behavior feature vector of each user IP within a preset time slot according to a network application log of each user IP; clustering the behavior feature vector of each user IP to acquire a plurality of clusters; and determining an inspection rule, determining a cluster that meets the corresponding inspection rule, and determining each user IP in the cluster as a crawler. The embodiments of the present invention may effectively identify low frequency crawlers, and may solve group threats, low-frequency threats, associated threats and persistent threats, which traditional security products cannot identify. Public cloud or private cloud deployment is supported, and threat identification and blocking may be performed without changing network topology and without embedding any code; the joining of custom blocking interfaces is supported, and a deployment environment being completely switched off under extreme cases will not influence the normal operation of an original service.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a low-frequency crawler identification method and device. Background technique [0002] The Internet is filled with a large number of crawlers, and in the process of anti-crawlers, crawlers are also constantly evolving. The evolution process of crawlers includes the following three stages: primary crawlers, browser crawlers, and low-frequency crawlers. Among them, the primary crawler crawls the target page without disguising itself, and can be accurately identified by features such as user agent (User-agent), frequency, etc.; the browser crawler will use the User-agent it uses through Firefox, opera , chrome and other types of browsers are disguised, and their behavior will be similar to normal users. Browser crawlers can be identified by features such as access frequency and timeline; low-frequency crawlers use a large number of proxy IP pools to imitate ordinary users for...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): H04L29/06
CPCH04L63/1416H04L63/1425H04L63/145
Inventor 胡志磊刘鑫琪陈峰汪海陈哲从磊
Owner BEIJING SHU AN XINYUN TECH CO LTD