Web crawler system and web crawler multitask executing and scheduling method

A web crawler and scheduling method technology, applied in the field of search engines, can solve the problems of long time consumption and low efficiency, and achieve the effects of improving speed, improving crawling efficiency, and avoiding low system reliability.

Active Publication Date: 2014-02-26
TCL CORPORATION
View PDF3 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In view of the above-mentioned deficiencies in the prior art, the object of the present invention is to provide a web crawler system and a web crawler multi-task execution and scheduling method, aiming at solving the problems of low efficiency and time-consuming of the current web crawler data collection method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web crawler system and web crawler multitask executing and scheduling method
  • Web crawler system and web crawler multitask executing and scheduling method
  • Web crawler system and web crawler multitask executing and scheduling method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] The present invention provides a web crawler system and a multi-task execution and scheduling method of the web crawler. The web crawler mentioned here is an artificial intelligence software program that executes a certain task without interruption. In order to make the objectives, technical solutions and effects of the present invention clearer and clearer, the present invention will be described in further detail below. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

[0040] like figure 1 A preferred embodiment of a web crawler multi-task execution and scheduling method shown, wherein the method includes:

[0041] S100. According to different content and website characteristics, fine-grained segmentation is performed on the content to be crawled, and each crawler parsing template file is made respectively according to the split content, and the web crawler ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a web crawler system and a web crawler multitask executing and scheduling method. The method includes: A, according to different content and website characteristics, subjecting crawled content to fine-grained segmentation, respectively creating crawler parsing template files, and arranging web crawlers to be respectively combined the crawler parsing template files to form acquisition modules used for executing crawling tasks; B, respectively deploying the web crawlers on multiple node servers, wherein each node server is provided with a scheduler used for scheduling the crawling tasks; C, invoking the associated acquisition modules to execute the crawling tasks to perform data acquisition through the schedulers according to a predefined scheduling strategy. The crawled content is subjected to fine-grained segmentation so as to realize high-concurrency execution of the tasks, a load balancing strategy is adopted, server resources are fully utilized, and crawling efficiency is improved obviously; besides, the problem of low system reliability caused by single-machine fault is solved, and high-reliability running of a system is guaranteed.

Description

technical field [0001] The invention relates to the technical field of search engines, in particular to a web crawler system and a web crawler multi-task execution and scheduling method. Background technique [0002] With the explosive growth of Internet information, the traditional way of collecting data by web crawlers has gradually shown its disadvantages. Traditional web crawlers do not have fine-grained segmentation of tasks when collecting data, which takes a long time. Due to the limitations of server CPU, memory and network bandwidth, data crawling efficiency is relatively low, and single point failures are prone to occur. [0003] Therefore, the existing technology still needs to be improved and developed. Contents of the invention [0004] In view of the above-mentioned deficiencies in the prior art, the purpose of the present invention is to provide a web crawler system and a web crawler multi-task execution and scheduling method, aiming at solving the problems...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F9/46
CPCG06F9/5083G06F16/951
Inventor 宋轲刘世才毛海涛
Owner TCL CORPORATION
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products