Unlock instant, AI-driven research and patent intelligence for your innovation.

Distributed crawler system

A crawler system and distributed technology, applied in the field of distributed crawler systems, can solve problems such as web page coverage and crawling time performance bottlenecks, insufficient system scheduling capabilities, and poor system crawling performance

Inactive Publication Date: 2019-05-31
ZHONGXIANGBOQIAN INFORMATION TECH CO LTD
View PDF5 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In view of this, the purpose of the present invention is to provide a distributed crawler system to overcome the limitations of the current traditional centralized web crawler due to the bottleneck of Web page coverage and crawling time performance, insufficient system scheduling capabilities, and poor system crawling performance. bad question

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed crawler system
  • Distributed crawler system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be described in detail below. Apparently, the described embodiments are only some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other implementations obtained by persons of ordinary skill in the art without making creative efforts fall within the protection scope of the present invention.

[0035] figure 1 It is a structural diagram provided by Embodiment 1 of the distributed crawler system of the present invention. Such as figure 1 As shown, the distributed crawler system of this embodiment may include: a uniform resource locator URL (Uniform Resource Location) reading and writing module 11 , a URL grabbing module 12 , a document parsing module 13 and a persistence module 14 .

[0036] Specifically, the distributed crawler system of this embodi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a distributed crawler system which comprises a URL read-write module, a URL capture module, a document analysis module and a persistence module. The URL read-write module is used for reading the URL from the input stream based on the Map end of the MapReduce and writing the URL into the output stream; The URL capturing module is used for taking the URL written into the output stream as an access address, and downloading a target document corresponding to the access address based on the Map terminal according to a preset network access mode; The document analysis moduleis used for extracting target data in the target document according to a preset mode based on the Map end; And the persistence module is used for storing the target data into the Hadoop distributed file system according to a preset path and a persistence rule based on the Map terminal. According to the scheme, the distributed crawler system is modularized, information interaction is carried out by transmitting data among the modules, the expandability, the usability and the maintainability of the system are improved, the scheduling capability of the system is improved, and the crawling performance of the system can be brought into play.

Description

technical field [0001] The invention relates to the technical field of web crawlers, in particular to a distributed crawler system. Background technique [0002] The advent of the Internet era has brought about a rapid expansion of the amount of information, and big data and cloud computing have also emerged as the times require. Internet companies, large communication companies, and sales companies generate huge amounts of logs and user behavior information every day. The characteristics of big data, such as huge data volume, complex data types, low value density, and fast processing speed, make traditional centralized web crawlers limited by web page coverage and crawling time performance bottlenecks, and insufficient system scheduling capabilities, resulting in system Crawling performance is poor. Contents of the invention [0003] In view of this, the purpose of the present invention is to provide a distributed crawler system to overcome the limitations of the current...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/953G06F16/9535G06F16/182
Inventor 张跃进胡勇喻蒙王猛王娟杜飞
Owner ZHONGXIANGBOQIAN INFORMATION TECH CO LTD