Internet data collection method and system based on man-in-the-middle

A data acquisition system and data acquisition technology, applied in the field of web crawlers, can solve problems affecting the efficiency and stability of the crawler system, different resource consumption, and difficulty in data acquisition, so as to improve data capture efficiency, reduce difficulty, and analyze Configure flexible effects

Pending Publication Date: 2020-02-11
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF12 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, with the popularization of the mobile Internet, more traffic is directly distributed through various terminal applications, and WEB access is not provided or some data will be limited in WEB access, which brings great difficulties to data collection.
[0003] In the crawling process of the web crawler, it is necessary to obtain the request URL, send a WEB request to download the page, parse the structured data from the web page, filter the duplicate data, and process the seed task. There are 5 links in total. Each link consumes different resources and each Problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Internet data collection method and system based on man-in-the-middle

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] In order to make the above-mentioned features and effects of the present invention more clear and understandable, the following specific examples are given together with the accompanying drawings for detailed description as follows.

[0025] The present invention introduces Anyproxy proxy tool to proxy all HTTP / HTTPS flows of client applications, and ensures the decryption of HTTPS encrypted data by pre-installing security certificates on corresponding collection devices.

[0026] Technical scheme of the present invention is as follows:

[0027] A data collection method based on "man-in-the-middle attack", comprising the following steps:

[0028] 1) The applications and devices that need to be collected are the main part of the collection, but only need to configure the intermediary agent and install the intermediary agent certificate, and access any page on the application that needs to be collected to initialize.

[0029] 2) The man-in-the-middle proxy module is main...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an Internet data collection method and system based on a man-in-the-middle. The method comprises the steps that a man-in-the-middle proxy certificate is installed to webpage information collection equipment, a man-in-the-middle of the webpage information collection equipment is established, and when the webpage information collection equipment accesses webpage information inthe Internet, the man-in-the-middle proxy certificate acts on all network traffic of the webpage information collection equipment; the man-in-the-middle acquires an collection task accommodating a URL regular expression of a to-be-acquired webpage, captures traffic conforming to the URL regular expression in all network traffic as intermediate traffic, injects the collection task into an HTML page of the intermediate traffic to obtain a to-be-analyzed page, and stores the to-be-analyzed page into a first database; and the analysis module distributes the to-be-analyzed page to the analyzer instance for analysis according to the URL information of the to-be-analyzed page in the first database, acquires a webpage collection result accommodating the structured data from the analyzer instanceand stores the webpage collection result into the second database. According to the invention, data collection of all applications which provide information by integrating browser kernel functions canbe supported.

Description

technical field [0001] The present invention relates to the field of web reptiles, in particular to a data acquisition method and system based on "man-in-the-middle attack", which can continuously inject different task codes into different application programs by using a man-in-the-middle agent to modify traffic data attacks to complete different pages. requests and obtain relevant data. Background technique [0002] A web crawler is a program that can effectively use various existing resources to automatically grab a large amount of web page information on the Internet, and is sometimes called a "web spider (Spider)". However, with the popularization of the mobile Internet, more traffic is directly distributed through various terminal applications, and WEB access is not provided or some data will be limited in WEB access, which brings great difficulties to data collection. . [0003] In the crawling process of the web crawler, it is necessary to obtain the request URL, se...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/9566G06F16/951
Inventor 程学旗史存会胡耀康朱运昌俞晓明刘悦
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products