Crawler system based on HTTP proxy and implementation method thereof

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A crawler system and implementation method technology, applied in the information field, can solve problems such as being unsuitable for large-scale data acquisition, unable to cope with webpage font anti-crawling measures, and unsuitable for large-scale data crawling, etc., so as to simplify browser operations and improve Concealment, the effect of preventing detection by websites

Pending Publication Date: 2021-05-14

SHANGHAI INST OF TECH

View PDF9 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] However, the implementation of most of these solutions requires a lot of manual operations on the browser, which is not suitable for large-scale data acquisition.

And automated crawler browser plug-ins like Web Scraper, which only require a small amount of work by the user, also have the problem of not being suitable for large quantities of data crawling and unable to cope with font anti-crawling measures for web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0044] The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

[0045] The object of the present invention is to provide an HTTP proxy-based crawler system and its implementation method, aiming at solving the problems of weak concealment and many manual operations in the existing crawler system.

[0046] In the present invention, the HTTP agent mainly refers to a common agent, and what this agent plays is the role of a middleman. For the client connected to it, it is the server; for the server to be connected, it is the client. end. It is responsible for transmitti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a crawler system based on HTTP proxy and an implementation method thereof. The system comprises a crawler task creation module, a crawler feature processing module, a crawler task execution module, an information storage module and a browser. The crawler task creating module is used for determining a crawler task according to the crawler task seed information and constructing a corresponding url; the crawler task execution module is used for acquiring and executing crawler tasks and extracting website page information; the crawler feature processing module is used for modifying request header information when the browser sends a request to the website server; and the information storage module is used for storing the extracted website page information. According to the method, important information in the HTTP request header can be modified, the characteristics of the third-party headless browser can be hidden when the third-party headless browser is used, the third-party headless browser is prevented from being detected by a website, the concealment of a browser crawler is further improved, browser operation can be simplified, and the method is suitable for crawling data on a large scale.

Description

technical field [0001] The invention relates to the field of information technology, in particular to an HTTP proxy-based crawler system and an implementation method thereof. Background technique [0002] In order to improve the concealment of the crawler, many current crawler technical solutions choose to deploy the crawler in the browser, and the user operates the browser to crawl information. [0003] However, the implementation of most of these solutions requires a lot of manual operations on the browser, which is not suitable for large-scale data acquisition. And automated crawler browser plug-ins like Web Scraper that only require a small amount of work by the user also have the problem of being unsuitable for large-volume data crawling and unable to cope with font anti-crawling measures for web pages. Contents of the invention [0004] Aiming at the defects in the prior art, the object of the present invention is to provide a crawler system based on HTTP proxy and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/951G06F16/955

CPCG06F16/951G06F16/9566

Inventor李宗伟童晓玲

OwnerSHANGHAI INST OF TECH

Crawler system based on HTTP proxy and implementation method thereof

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology