Dicycle crawler system based on Spark Streaming and running method thereof

A crawler system, dual-cycle technology, applied in special data processing applications, instruments, electrical digital data processing, etc., to achieve the effect of strong versatility, stable operation, and difficult expansion of crawler

Inactive Publication Date: 2018-09-11
HOHAI UNIV
View PDF5 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Purpose of the invention: Aiming at the problems existing in the reptile system in the prior art, the present invention provides a dual-cycle crawler system and its operation method based on SparkStrea

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dicycle crawler system based on Spark Streaming and running method thereof
  • Dicycle crawler system based on Spark Streaming and running method thereof
  • Dicycle crawler system based on Spark Streaming and running method thereof

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

[0029] Such as figure 1 As shown, the dual-cycle crawler system based on Spark Streaming includes: page download module, DNS cache module, URL distribution scheduling module, URL extraction module, URL deduplication module, page scheduling module, page analysis module, page extraction module, storage system and web background.

[0030] (1) Page download module

[0031] The page download module is responsible for downloading pages. When downloading, it calls the DNS information cached in the DNS ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a dicycle crawler system based on Spark Streaming and a running method thereof. According to the dicycle crawler system, a page downloading module is responsible for downloading pages, DNS information cached in a DNS cache module is called during downloading, and DNS parsing is accelerated to shorten time of page downloading; a URL distribution scheduling module is responsible for calling a URL extraction module, URLs are extracted from newly downloaded pages, a URL repetition removing module is called to filter repetitive URLs, and the URLs are distributed to the pagedownloading module; a page scheduling module is responsible for calling a page analyzing module, it is determined that a current page accords with an extraction model, data extraction tasks are distributed to a page extracting module, the page extracting module is used for extracting data from the current page, and the extracted data is saved in a distributed storage system; parameter configuration and monitoring management of a whole crawler are both controlled in a web background. The dicycle crawler system based on Spark Streaming and the running method thereof solve the horizontal extension problem of traditional crawlers and have the advantages that module division is clear, a running mechanism is stable and efficient, crawler rule configuration is concise and the universality is high.

Description

technical field [0001] The invention relates to a dual-period crawler system based on Spark Streaming and an operating method thereof, belonging to the technical field of distributed crawlers. Background technique [0002] The crawler system is a system for collecting massive and scattered Internet data, and it is the basis of the search engine system. Big data has developed rapidly in recent years and is hot, not only because of the large capacity of data, but also because of the emphasis on the analysis of full sample data. Internet data contains a lot of valuable information and is an important data source for big data. Its organizational forms are also flexible and diverse. Most websites adopt anti-crawler strategies such as dynamic loading, which brings great challenges to the information collection and storage of traditional crawlers. challenge. Traditional crawlers run on a single node, which has poor horizontal scalability. The number of crawler threads and data st...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 连晓颖张雪洁王乐进王睿朱云
Owner HOHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products