Distributed crawler task scheduling method based on weighted round-robin algorithm

A technology of weighted round-robin and task scheduling, applied in the field of network search, can solve the problem that the crawling ability cannot keep up with the growth rate of information on the Internet, and achieve flexible scalability and fault tolerance, good scalability, and ensure load balance. Effect

Active Publication Date: 2014-06-18
TONGJI UNIV
View PDF5 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

With the rapid development of the network, information is increasing rapidly. The crawling capabilities of traditional simpl...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed crawler task scheduling method based on weighted round-robin algorithm
  • Distributed crawler task scheduling method based on weighted round-robin algorithm
  • Distributed crawler task scheduling method based on weighted round-robin algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention adopts a master-slave crawler architecture. In the master control node, there is a node table, three URL queues, a scheduling module and a crawler feedback module. The node table records the information of each crawler node, including node number, weight, etc. It must be dynamically updated to keep in line with the actual crawler nodes. The timing of its dynamic update can be every time the crawler node performs a URL task feedback, or it can be performed once every certain time, which can be set according to the specific situation. The scheduling module first takes out a URL from the URL queue to be crawled, then takes out the information of each node from the node table, and selects a crawler node for scheduling, assigns the URL to the crawler node, and stores the URL in the Allocated URLs are in the queue. And when a crawler node completes the crawling work of a URL, the crawler feedback module queries the URL in the allocated URL queue, delet...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A distributed crawler task scheduling method based on a weighted round-robin algorithm at least includes the following steps: (1) according to different scales, network crawlers are divided into five types of crawlers, i.e. a stand-alone multi-thread crawler, a homogeneous centralized crawler, a heterogeneous centralized crawler, a small-scale distributed crawler and a large-scale distributed crawler; (2) a master-slave architecture is deployed; (3) when a crawler node is connected to a master node for the first time, the master node gives an initial weight to the crawler node; (4) according to the scheduling algorithm based on weighted round-robin, the master node continuously chooses a crawler node and assigns a URL (Uniform Resource Locator) task to be crawled to the crawler node; (5) each time when a URL task is crawled by a crawler node, a result is returned to the master node, and the weight of the crawler node is updated by the master node, and the like. The distributed crawler scheduling policy based on the weighted round-robin algorithm, which is put forward by the invention, is designed for small-scale distributed crawlers and can ensure the load balance of each crawler node and ensure that crawler nodes have flexible scalability and fault tolerance.

Description

technical field [0001] The invention relates to the technical field of network search. Background technique [0002] A search engine can be divided into several parts such as crawler, indexer, retriever and user interface. Among them, crawlers are responsible for continuously searching and collecting information on the Internet, and play an important role in search engines. With the rapid development of the network, information is increasing rapidly. The crawling capabilities of traditional simple stand-alone web crawlers and centralized web crawlers can no longer keep up with the growth rate of information on the Internet. Today, when the concept of distributed is being mentioned more and more, distributed crawlers have naturally become a solution to the problem of large data volumes. Distributed crawlers are composed of multiple nodes deployed in a wide area network, and can crawl in parallel to meet people's needs for crawler capabilities. Due to the different crawling...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F9/48G06F9/50
Inventor 蒋昌俊陈闳中闫春钢丁志军王鹏伟孙海春邓晓栋葛大劼
Owner TONGJI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products