Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

A distributed, crawler technology, used in other database retrieval, program startup/switching, inter-program communication, etc.

Active Publication Date: 2016-06-15
INSPUR SOFTWARE CO LTD
View PDF1 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] For distributed crawlers, there are two key issues that need to be solved: one, the multi-node distribution problem of crawling queue messages; two, timing crawling Problem; To solve the above two problems, different distributed crawler architectures have different solutions. At present, because the distributed crawler architecture is often the core secret of each company, the specific implementation of distributed crawlers is generally not disclosed detail
Commonly used distributed crawlers that have been open source include GoogleCrawler, Mercator, Nutch, etc., but open source distributed crawlers lack certain customization and cannot well meet changing crawling needs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0034]A distributed crawler architecture based on Kafka and Quartz of the present invention, the crawler architecture includes basic crawler components, URL storage queues, URL message distribution mechanism based on Kafka, crawler job scheduling mechanism and front-end console based on Quartz; basic crawler components It is an open source stand-alone crawler component, including page parsing to generate URL, URL filter and page crawling; page parsing and generating URL is responsible for extracting URL links from the current page. The URL filter is responsible for filtering the generated URL links according to the crawling rules to obtain URL links that meet the rules. Page crawling is responsible for crawling URL links that meet the crawling rules, and customizing the content of the page crawling. The URL storage queue adopts the memory database, and the memory database is used to store the URL message queue to be crawled and the URL that has been crawled, so as to realize t...

Embodiment 2

[0037] An implementation method of a distributed crawler architecture based on Kafka and Quartz, adopting a distributed crawler architecture based on Kafka and Quartz in Embodiment 1, comprising the following steps:

[0038] (1) Set the parameters of crawling entry, crawling rules, storage method of crawling results and scheduling rules through the page of the front-end console, and select the cluster nodes to be deployed for deployment;

[0039] (2) According to the Quartz-based job scheduling mechanism, the producer job calls the basic crawler component to extract the crawl URL link according to the crawler entry, and deduplicate and store it in the queue to be crawled;

[0040] (3) The consumer job of each node calls the basic crawler component according to the Quartz-based job scheduling mechanism, obtains the URL link message distributed to the node by the Kafka-based message distribution mechanism, parses and crawls the URL link, and store the results in the system;

[...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a distributed crawler architecture based on Kafka and Quartz and an implementation method thereof, and belongs to the technical field of computer data mining.The technical problem how to meet the requirement of a distributed crawler through cooperation of a stand-alone crawler architecture and a distributed tool is solved, and crawling queue message multi-node distribution and timing crawling are achieved.According to the technical scheme, the distributed crawler architecture based on Kafka and Quartz comprises a basic crawler assembly, a URL storage queue, a URL message distribution mechanism based on Kafka, a crawler operation scheduling mechanism based on Quartz and a front-end console; the implementation method of the distributed crawler architecture based on Kafka and Quartz comprises the following steps that parameters of a crawling inlet, a crawling rule, a crawling result storage way and a scheduling rule are set according to the page of the front-end console, and cluster nodes to be deployed are selected to be deployed.

Description

technical field [0001] The invention relates to the technical field of computer data mining, in particular to a distributed crawler architecture based on Kafka and Quartz and an implementation method thereof. Background technique [0002] Web crawlers are a fundamental part of search engine technology. Web crawler technology starts from one or several URLs (UniformResourceLocator, Uniform Resource Locator) of the initial webpage, and the URL on the live initial webpage, in the process of crawling webpage information, according to the crawling strategy of the webpage, continuously from the current Pull new URLs on the page and put them in the queue until some kind of stop condition is met. Then, the captured web page information is stored in the server of the search engine, thereby speeding up the user's search speed. [0003] With the explosive growth of the Internet, the amount of data carried by the network has far exceeded people's imagination. In the era of big data, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F9/48G06F9/54H04L29/08
CPCG06F9/4843G06F9/546G06F16/90G06F16/9535G06F2209/483H04L67/02
Inventor 甄教明王茂帅于文才高峰柳廷娜
Owner INSPUR SOFTWARE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products