Distributed crawler task scheduling method based on weighted round-robin algorithm

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of weighted round-robin and task scheduling, applied in the field of network search, can solve the problem that the crawling ability cannot keep up with the growth rate of information on the Internet, and achieve flexible scalability and fault tolerance, good scalability, and ensure load balance. Effect

Active Publication Date: 2014-06-18

TONGJI UNIV

View PDF5 Cites 24 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

With the rapid development of the network, information is increasing rapidly. The crawling capabilities of traditional simple stand-alone web crawlers and centralized web crawlers can no longer keep up with the growth rate of information on the Internet.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0036] The present invention adopts a master-slave crawler architecture. In the master control node, there is a node table, three URL queues, a scheduling module and a crawler feedback module. The node table records the information of each crawler node, including node number, weight, etc. It must be dynamically updated to keep in line with the actual crawler nodes. The timing of its dynamic update can be every time the crawler node performs a URL task feedback, or it can be performed once every certain time, which can be set according to the specific situation. The scheduling module first takes out a URL from the URL queue to be crawled, then takes out the information of each node from the node table, and selects a crawler node for scheduling, assigns the URL to the crawler node, and stores the URL in the Allocated URLs are in the queue. And when a crawler node completes the crawling work of a URL, the crawler feedback module queries the URL in the allocated URL queue, delet...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A distributed crawler task scheduling method based on a weighted round-robin algorithm at least includes the following steps: (1) according to different scales, network crawlers are divided into five types of crawlers, i.e. a stand-alone multi-thread crawler, a homogeneous centralized crawler, a heterogeneous centralized crawler, a small-scale distributed crawler and a large-scale distributed crawler; (2) a master-slave architecture is deployed; (3) when a crawler node is connected to a master node for the first time, the master node gives an initial weight to the crawler node; (4) according to the scheduling algorithm based on weighted round-robin, the master node continuously chooses a crawler node and assigns a URL (Uniform Resource Locator) task to be crawled to the crawler node; (5) each time when a URL task is crawled by a crawler node, a result is returned to the master node, and the weight of the crawler node is updated by the master node, and the like. The distributed crawler scheduling policy based on the weighted round-robin algorithm, which is put forward by the invention, is designed for small-scale distributed crawlers and can ensure the load balance of each crawler node and ensure that crawler nodes have flexible scalability and fault tolerance.

Description

technical field [0001] The invention relates to the technical field of network search. Background technique [0002] A search engine can be divided into several parts such as crawler, indexer, retriever and user interface. Among them, crawlers are responsible for continuously searching and collecting information on the Internet, and play an important role in search engines. With the rapid development of the network, information is increasing rapidly. The crawling capabilities of traditional simple stand-alone web crawlers and centralized web crawlers can no longer keep up with the growth rate of information on the Internet. Today, when the concept of distributed is being mentioned more and more, distributed crawlers have naturally become a solution to the problem of large data volumes. Distributed crawlers are composed of multiple nodes deployed in a wide area network, and can crawl in parallel to meet people's needs for crawler capabilities. Due to the different crawling...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F9/48G06F9/50

Inventor 蒋昌俊陈闳中闫春钢丁志军王鹏伟孙海春邓晓栋葛大劼

Owner TONGJI UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Distributed crawler task scheduling method based on weighted round-robin algorithm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology