IP restricted controlled source information capture method based on agent pools

A technology of controlled source and information source, which is applied in electrical digital data processing, special data processing applications, other database retrieval, etc. It can solve the problems of IP failure and cannot be changed all the time, so as to improve the grabbing speed and efficient grabbing. , Overcome the effect of high cost

Inactive Publication Date: 2017-11-24
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF3 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, it is worth noting that although the IP proxy pool is used, it does not mean that the proxy server will not be recognized as illegal access by the server providing the data, and its IP will be blocked
At the same time, because the proxy server has many unstable factors, in a real crawler system, the proxy in the IP proxy pool cannot remain unchanged, otherwise, the IP in the proxy pool will gradually become invalid as time goes by

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • IP restricted controlled source information capture method based on agent pools
  • IP restricted controlled source information capture method based on agent pools
  • IP restricted controlled source information capture method based on agent pools

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0047] This embodiment specifically describes the process of a method for grabbing IP-based controlled source information based on proxy pools in the present invention.

[0048] figure 1 A schematic flow diagram of an IP-restricted controlled source information grabbing method based on an agent pool in the present invention; As can be seen from the figure, the present invention mainly includes: an agent pool initialization module initializes an agent pool, an available agent test module tests available agents, and There are three parts of data capture and dynamic maintenance.

Embodiment 2

[0050] This embodiment specifically describes the proxy pool initialization module in a proxy pool-based IP-restricted controlled source information grabbing method of the present invention operation process .

[0051] figure 2 It is a schematic representation of the operation of the proxy pool initialization module in a proxy pool-based IP-restricted controlled source information grabbing method of the present invention.

Embodiment 3

[0054] This embodiment specifically narrates the available agent test module in a proxy pool-based IP-restricted controlled source information grabbing method of the present invention operation process .

[0055] image 3 It is an available agent test module in the method of grabbing information from an IP-restricted controlled source based on an agent pool in the present invention operation instructions . The specific implementation method shown in the figure is:

[0056] For each agent in the agent pool A, it is tested in turn, and the test method is to obtain an agent from the agent pool each time; the crawler system uses the agent as a proxy server to send N requests to the selected crawling source; Among them, the crawling source selects the root directory of the crawling source website by default; the value range of N is >=1; judge whether the request is successful or not by the status code returned by the server, and perform corresponding operations: if the first r...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an IP restricted controlled source information capture method based on agent pools, and belongs to the field of data acquisition and processing in the computer science. The method comprises the following steps: maintaining two agent pools A and B, and storing all agents in the agent pool A; testing available agents and initializing the agent pool B; randomly selecting agents in the agent pool B for use in a capture process; triggering an updating operation of the agent pool B by a dynamic maintenance operation of the agent pool B in the capture process; and realizing efficient capture of an IP restricted controlled information source by using the agent pool algorithm. According to the IP restricted controlled source information capture method, the two agent pools are maintained, the available agents are dynamically regulated and controlled in the information capture process to solve the problems of instable agent and low capture efficiency and expand resources for all kinds of data drive experiments.

Description

technical field [0001] The present invention relates to an information collection method used to break through the restriction of the same IP to access a server within a unit time, that is, a method for designing, maintaining and updating an agent pool, and in particular to a method for grasping and updating IP-restricted controlled source information of an agent pool. The method belongs to the technical field of data acquisition and processing in computer science. Background technique [0002] Data collection is the cornerstone of data acquisition in natural language processing, machine learning and other disciplines. The data obtained by the information collection system is widely used in natural language processing, user portraits, public opinion analysis and other fields. The integrity, accuracy and scale of data collection are directly related to related to the performance of the associated application. [0003] The Internet generates a large amount of data every day, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/12H04L29/08H04L12/26G06F17/30
CPCH04L43/16H04L43/50H04L61/30G06F16/951H04L67/02H04L61/5061H04L67/56
Inventor 史树敏杨旋赵蒙
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products