Distributed spider system and periodical increment capture method

A crawler system and distributed technology, which is applied in the field of efficient data collection of Internet big data, can solve the problem of periodic increment of web page repetition, multi-node task distribution, and capture, etc., to reduce development costs and increase Usable, Simple Architecture Effects

Active Publication Date: 2017-09-22
NANJING UNIV
View PDF8 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Compared with the traditional stand-alone crawler, the distributed crawler can significantly improve the crawling efficiency of the crawler, but it also introduces new problems: multi-node task distribution in a distributed environment, load balancing problems, web page repetition problems and Periodic incremental crawling issues, etc.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed spider system and periodical increment capture method
  • Distributed spider system and periodical increment capture method
  • Distributed spider system and periodical increment capture method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0038] In order to better understand the technical content of the present invention, specific embodiments are given together with the accompanying drawings for further description.

[0039] figure 1 It is an architecture diagram of a distributed crawler system of the present invention, the system includes three parts: ZooKeeper-based distributed service, system components and database. Among them, the distributed service based on ZooKeeper provides distributed coordination services for each system component; the system components include the system monitoring component Monitor, the coordination component Coordinator, the log collection component Logger, and the basic crawler component Spider; the database includes Redis memory database and other storage capture For the database of web pages, the distributed URL task queue and distributed BloomFilter are stored in the Redis memory database.

[0040] The distributed service based on ZooKeeper coordinates with each system compon...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a distributed spider system. The system is configured to comprise three parts, namely a distributed service based on ZooKeeper, a system component and a database, wherein the system component comprises a system monitoring component Monitor, a coordination component Coordinator, a log collection component Logger and a basic spider component Spider, the database comprises a Redis memory database, redis is a key-value storage form, and a distributed URL task queue and a distributed BloomFilter are stored in the Redis memory database. The invention furthermore discloses a periodical increment capture method based on the system. The method comprises the steps that the coordination component Coordinator periodically imports tasks to the distributed URL task queue and awakens the Spider component in dormancy; and the Spider component performs dormancy or periodical increment capture according to execution of the current distributed URL task queue. Through the system and the method, stand-alone spiders are effectively combined, distributed spiders with high availability, high stability and a high throughput rate in a cluster environment are obtained, and periodical increment capture is realized.

Description

technical field [0001] The invention relates to the technical field of high-efficiency data collection of Internet big data, in particular to a distributed crawler system and a periodic incremental crawling method. Background technique [0002] The web crawler starts from the URL (Uniform Resource Locator, Uniform Resource Locator) of one or several initial webpages, and obtains the URL on the initial webpage. Extract new URLs and put them into the task queue until the stop condition of the system is met. [0003] With the rapid development of the Internet, network data has shown explosive growth, and network data sources are becoming more and more diversified. In the face of such a large and diverse Internet data, how to improve the crawling efficiency of web crawlers and how to implement customizable crawling strategies for different data sources is very important. [0004] Compared with the traditional stand-alone crawler, the distributed crawler can significantly impro...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 张雷韩建军张文哲谭龙海王崇骏
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products