Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Crawler system based on HTTP proxy and implementation method thereof

A crawler system and implementation method technology, applied in the information field, can solve problems such as being unsuitable for large-scale data acquisition, unable to cope with webpage font anti-crawling measures, and unsuitable for large-scale data crawling, etc., so as to simplify browser operations and improve Concealment, the effect of preventing detection by websites

Pending Publication Date: 2021-05-14
SHANGHAI INST OF TECH
View PDF9 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, the implementation of most of these solutions requires a lot of manual operations on the browser, which is not suitable for large-scale data acquisition.
And automated crawler browser plug-ins like Web Scraper, which only require a small amount of work by the user, also have the problem of not being suitable for large quantities of data crawling and unable to cope with font anti-crawling measures for web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Crawler system based on HTTP proxy and implementation method thereof
  • Crawler system based on HTTP proxy and implementation method thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

[0045] The object of the present invention is to provide an HTTP proxy-based crawler system and its implementation method, aiming at solving the problems of weak concealment and many manual operations in the existing crawler system.

[0046] In the present invention, the HTTP agent mainly refers to a common agent, and what this agent plays is the role of a middleman. For the client connected to it, it is the server; for the server to be connected, it is the client. end. It is responsible for transmitti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a crawler system based on HTTP proxy and an implementation method thereof. The system comprises a crawler task creation module, a crawler feature processing module, a crawler task execution module, an information storage module and a browser. The crawler task creating module is used for determining a crawler task according to the crawler task seed information and constructing a corresponding url; the crawler task execution module is used for acquiring and executing crawler tasks and extracting website page information; the crawler feature processing module is used for modifying request header information when the browser sends a request to the website server; and the information storage module is used for storing the extracted website page information. According to the method, important information in the HTTP request header can be modified, the characteristics of the third-party headless browser can be hidden when the third-party headless browser is used, the third-party headless browser is prevented from being detected by a website, the concealment of a browser crawler is further improved, browser operation can be simplified, and the method is suitable for crawling data on a large scale.

Description

technical field [0001] The invention relates to the field of information technology, in particular to an HTTP proxy-based crawler system and an implementation method thereof. Background technique [0002] In order to improve the concealment of the crawler, many current crawler technical solutions choose to deploy the crawler in the browser, and the user operates the browser to crawl information. [0003] However, the implementation of most of these solutions requires a lot of manual operations on the browser, which is not suitable for large-scale data acquisition. And automated crawler browser plug-ins like Web Scraper that only require a small amount of work by the user also have the problem of being unsuitable for large-volume data crawling and unable to cope with font anti-crawling measures for web pages. Contents of the invention [0004] Aiming at the defects in the prior art, the object of the present invention is to provide a crawler system based on HTTP proxy and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566
Inventor 李宗伟童晓玲
Owner SHANGHAI INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products