Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A distributed, crawler technology, used in other database retrieval, program startup/switching, inter-program communication, etc.

Active Publication Date: 2016-06-15

INSPUR SOFTWARE CO LTD

View PDF1 Cites 29 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] For distributed crawlers, there are two key issues that need to be solved: one, the multi-node distribution problem of crawling queue messages; two, timing crawling Problem; To solve the above two problems, different distributed crawler architectures have different solutions. At present, because the distributed crawler architecture is often the core secret of each company, the specific implementation of distributed crawlers is generally not disclosed detail

Commonly used distributed crawlers that have been open source include GoogleCrawler, Mercator, Nutch, etc., but open source distributed crawlers lack certain customization and cannot well meet changing crawling needs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0034]A distributed crawler architecture based on Kafka and Quartz of the present invention, the crawler architecture includes basic crawler components, URL storage queues, URL message distribution mechanism based on Kafka, crawler job scheduling mechanism and front-end console based on Quartz; basic crawler components It is an open source stand-alone crawler component, including page parsing to generate URL, URL filter and page crawling; page parsing and generating URL is responsible for extracting URL links from the current page. The URL filter is responsible for filtering the generated URL links according to the crawling rules to obtain URL links that meet the rules. Page crawling is responsible for crawling URL links that meet the crawling rules, and customizing the content of the page crawling. The URL storage queue adopts the memory database, and the memory database is used to store the URL message queue to be crawled and the URL that has been crawled, so as to realize t...

Embodiment 2

[0037] An implementation method of a distributed crawler architecture based on Kafka and Quartz, adopting a distributed crawler architecture based on Kafka and Quartz in Embodiment 1, comprising the following steps:

[0038] (1) Set the parameters of crawling entry, crawling rules, storage method of crawling results and scheduling rules through the page of the front-end console, and select the cluster nodes to be deployed for deployment;

[0039] (2) According to the Quartz-based job scheduling mechanism, the producer job calls the basic crawler component to extract the crawl URL link according to the crawler entry, and deduplicate and store it in the queue to be crawled;

[0040] (3) The consumer job of each node calls the basic crawler component according to the Quartz-based job scheduling mechanism, obtains the URL link message distributed to the node by the Kafka-based message distribution mechanism, parses and crawls the URL link, and store the results in the system;

[...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a distributed crawler architecture based on Kafka and Quartz and an implementation method thereof, and belongs to the technical field of computer data mining.The technical problem how to meet the requirement of a distributed crawler through cooperation of a stand-alone crawler architecture and a distributed tool is solved, and crawling queue message multi-node distribution and timing crawling are achieved.According to the technical scheme, the distributed crawler architecture based on Kafka and Quartz comprises a basic crawler assembly, a URL storage queue, a URL message distribution mechanism based on Kafka, a crawler operation scheduling mechanism based on Quartz and a front-end console; the implementation method of the distributed crawler architecture based on Kafka and Quartz comprises the following steps that parameters of a crawling inlet, a crawling rule, a crawling result storage way and a scheduling rule are set according to the page of the front-end console, and cluster nodes to be deployed are selected to be deployed.

Description

technical field [0001] The invention relates to the technical field of computer data mining, in particular to a distributed crawler architecture based on Kafka and Quartz and an implementation method thereof. Background technique [0002] Web crawlers are a fundamental part of search engine technology. Web crawler technology starts from one or several URLs (UniformResourceLocator, Uniform Resource Locator) of the initial webpage, and the URL on the live initial webpage, in the process of crawling webpage information, according to the crawling strategy of the webpage, continuously from the current Pull new URLs on the page and put them in the queue until some kind of stop condition is met. Then, the captured web page information is stored in the server of the search engine, thereby speeding up the user's search speed. [0003] With the explosive growth of the Internet, the amount of data carried by the network has far exceeded people's imagination. In the era of big data, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30G06F9/48G06F9/54H04L29/08

CPCG06F9/4843G06F9/546G06F16/90G06F16/9535G06F2209/483H04L67/02

Inventor 甄教明王茂帅于文才高峰柳廷娜

Owner INSPUR SOFTWARE CO LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Distributed crawler architecture based on Kafka and Quartz and implementation method thereof

What is Al technical title? Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document. A distributed, crawler technology, used in other database retrieval, program startup/switching, inter-program communication, etc.

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A distributed, crawler technology, used in other database retrieval, program startup/switching, inter-program communication, etc.

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology