Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for scheduling tasks of distributed network crawlers

A distributed network and task scheduling technology, applied in the field of distributed web crawler task scheduling, can solve problems such as the complexity of the distributed web crawler scheduling mechanism

Inactive Publication Date: 2014-01-15
SHENZHEN COSHIP ELECTRONICS CO LTD
View PDF1 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a distributed web crawler task scheduling method and its system, aiming to solve the complex technical problems of the existing distributed web crawler scheduling mechanism

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for scheduling tasks of distributed network crawlers
  • Method and system for scheduling tasks of distributed network crawlers

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0021] See figure 1 , figure 1 It is a flowchart of a method for scheduling distributed web crawler tasks provided by an embodiment of the present invention. like figure 1 As shown, the distributed web crawler task scheduling method 100 includes:

[0022] Step S101: Configure the distributed web crawler cluster. Specifically, the central server configures the total number of web crawler clusters, configures the serial number of each web crawler (hereinafter referred to as: crawler), and performs the same task configuration for each crawler according to the crawler configuration file. Wherein, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the technical field of internet search engines, and provides a method and system for scheduling tasks of distributed network crawlers. The method comprises the following steps: configurating distributed network crawler clusters; analyzing a webpage corresponding to a first layer link, and extracting a second layer link existed in the webpage by a first crawler; distributing a crawling task corresponding to the second layer link according to a Hash consistency algorithm; recording the crawling task corresponding to the second layer link to a crawling task document corresponding to a crawler with the corresponding sequence number if the second layer link is distributed to a crawler apart from the first crawler; packaging and uploading crawling task documents to a shared directory at every other pre-set time intervals; extracting and performing a corresponding crawling task in the shared directory by each crawler regularly. According to the invention, the cooperative task scheduling of the distributed network crawler tasks is realized through the shared directory, so that the tasks can be distributed to each crawler uniformly.

Description

technical field [0001] The invention belongs to the technical field of Internet search engines, and in particular relates to a method and system for dispatching distributed web crawler tasks. Background technique [0002] With the rapid development of Internet technology, search engines play an increasingly important role in Internet search services. The web crawler is a very important part of the search engine system. It is responsible for collecting web pages from the Internet, and these pages are used to build indexes to provide support for search engines. In the face of the current extremely inflated network information, centralized stand-alone crawlers have long been unable to adapt to the current scale of Internet information, so high-performance distributed network crawler systems have become the focus of research in the field of information collection. [0003] The overall design of a distributed web crawler focuses on how the crawlers communicate. At present, dist...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 何学敏
Owner SHENZHEN COSHIP ELECTRONICS CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products