Multi-thread network crawler processing method based on connection proxy optimal management

A technology of web crawler and processing method, applied in the field of new web crawler processing, connection agent optimization management design

Inactive Publication Date: 2014-07-02
FUDAN UNIV
View PDF3 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This method has a certain adaptability and can solve the problem that crawlers are rejected when acquiring web pages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-thread network crawler processing method based on connection proxy optimal management
  • Multi-thread network crawler processing method based on connection proxy optimal management
  • Multi-thread network crawler processing method based on connection proxy optimal management

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The present invention will be further described in detail below with reference to the drawings and embodiments.

[0026] figure 1 It is a further description of the process of the present invention. In the figure, the process in virtual box A is the initial work needed to build the crawler, and it only needs to be executed once. The process in virtual box B is the process of crawling web pages by crawlers, which need to be repeated until the end.

[0027] (1) Obtain a proxy server and store it in the proxy server pool.

[0028] (2) Test the network connection performance of the proxy server.

[0029] (3) Create a certain number of multi-threads based on the performance of the proxy server.

[0030] (4) Convert the crawling target address started by the crawler into an Http request and send it from the proxy server pool

[0031] Obtain a valid proxy server in and set the Http request to be executed through the proxy server.

[0032] (5) Add the Http request to an Http request que...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of information processing, and particularly relates to a multi-thread network crawler processing method based on connection proxy optimal management. The multi-thread network crawler processing method comprises the steps of firstly, obtaining a public proxy server on a network, testing the network connection performance of the proxy server, obtaining the optimal number of threads according to the performance of the proxy server, then managing a proxy server pool, setting a valid proxy server for each Http request, and finally executing an access request for a Web page. The multi-thread network crawler processing method has the advantages that the number of the threads is obtained through calculation, resources can be effectively utilized to the maximum extent, resource waste cannot be caused, the number of use of each usable proxy server is balanced, and the phenomenon that frequent access is detected by a server terminal is effectively avoided.

Description

technical field [0001] The invention relates to the technical field of information processing, in particular to a novel web page information acquisition method, in particular to a novel web crawler processing method based on the existing web crawler principle for optimal management and design of connection agents. Background technique [0002] With the rapid development of the network, the network has become the carrier of a large amount of information, how to effectively extract this information has become a huge challenge. [0003] The web crawler is a very important part of the search engine system. It is responsible for collecting web pages from the Internet and collecting information. These web page information is used to build an index to provide support for the search engine. Its performance directly affects the effect of the search engine. [1]. As the amount of network information increases geometrically, the requirements for the performance and efficiency of web c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50G06F17/30H04L29/08
Inventor 罗邦慧曾剑平
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products