Malicious web crawler detection method based on hidden Markov model (HMM)

A hidden Markov and web crawler technology, applied in data exchange networks, special data processing applications, instruments, etc., can solve the problem of exposure of business secrets, difficulty in distinguishing benign and malicious web crawlers, and regardless of the adverse impact of crawling behavior on the website And other issues

Inactive Publication Date: 2017-07-18
广东亿荣电子商务有限公司 +1
View PDF5 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, malicious web crawlers aim at grabbing useful information from websites, disregarding the negative impact of the crawling behavior on the website, and even violating the website’s data protection statement, forcibly grabbing sensitive information from the website, causing user privacy leaks and commercial threats. Adverse consequences such as confidentiality exposure
Existing web crawlers can only distinguish between crawler traffic and general user traffic, but it is difficult to distinguish between benign and malicious web crawlers

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Malicious web crawler detection method based on hidden Markov model (HMM)
  • Malicious web crawler detection method based on hidden Markov model (HMM)
  • Malicious web crawler detection method based on hidden Markov model (HMM)

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] Implementation process

[0042] Step 1: preprocessing of training data to generate a training data set for web crawler detection;

[0043] Step 2: Use the forward-backward algorithm to estimate the parameters of the model, and obtain the HMM-based HTTP traffic model;

[0044] Step 3: Calculate the entropy of the monitoring sequence using the trained model;

[0045] Step 4: Calculate the traffic anomaly detection amount |μ-μ 0 |;

[0046] Step 5: By judging |μ-μ 0 |≥3σ 0 Whether it is established to identify web crawler traffic;

[0047] Step 6: Extract the training data set for benign crawler detection;

[0048] Step 7: Estimate the model parameters of the benign web crawler using the forward-backward algorithm;

[0049] Step 8: use the trained benign web crawler model to calculate the entropy of the web crawler sequence;

[0050] Step 9: Calculate the amount of anomaly detection |μ-μ 0 |;

[0051] Step 10: By judging |μ-μ 0 |≥3σ 0 true to identify malicious...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a malicious web crawler detection method based on a hidden Markov model (HMM), and belongs to the technical field of computer software. A malicious web crawler maliciously captures sensitive information and privacy data of a website without permission, and meanwhile a barbaric traffic behavior mode has a negative effect on the service quality of the website. An existing web crawler detection method cannot accurately recognize the malicious web crawler, and has a relatively high misjudgment rate. The novel malicious web crawler detection method based on the HMM provided by the invention specifically comprises the following steps: (1) performing modeling based on the user HTTP (Hyper Text Transport Protocol) traffic of the HMM; and (2) performing modeling based on web crawler behaviors of an HTTP.

Description

technical field [0001] The invention belongs to the technical field of computer software. Background technique [0002] Benign web crawlers are an integral part of search engines. Benign web crawlers generally consider the impact on the service quality of the website and strictly abide by the data crawling rules of the website. However, malicious web crawlers aim at grabbing useful information from websites, disregarding the negative impact of the crawling behavior on the website, and even violating the website’s data protection statement, forcibly grabbing sensitive information from the website, causing user privacy leaks and commercial threats. Adverse consequences such as exposure of confidentiality. Existing web crawlers can only distinguish between crawler traffic and general user traffic, but it is difficult to distinguish between benign and malicious web crawlers. Contents of the invention [0003] The purpose of the invention is to propose a malicious web crawle...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/06H04L12/24H04L29/08G06F17/30
Inventor 罗日红蔡君
Owner 广东亿荣电子商务有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products