Data acquisition system and method based on scrapy crawler framework

A data acquisition system and crawler technology, which is applied in the direction of network data indexing, network data retrieval, and other database retrieval, can solve the problems of slow crawling speed and exhaustion of single-machine memory, so as to ensure reliability, improve crawling breadth, Improve the effect of crawling stability

Pending Publication Date: 2020-05-29
QINGDAO NAT LAB FOR MARINE SCI & TECH DEV CENT
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, it is likely to cause the memory of the stand-alon

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data acquisition system and method based on scrapy crawler framework
  • Data acquisition system and method based on scrapy crawler framework

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0023] The specific embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings.

[0024] The data acquisition system based on the scrapy crawler framework proposed by the present invention, such as figure 1 As shown, it includes a crawler queue module 1, a crawler execution module 2, and a task scheduling module 3. The crawler queue module 1 includes a crawler seed queue 11, a crawler seed processing unit 12, and a crawler task queue 13; the crawler execution module 2 includes a web page download unit 21 and URL mining unit 22; task scheduling module 3 includes crawler process queue 31 and process manager 32.

[0025] The crawler seed queue 11 is used to store crawler tasks, including but not limited to crawler tasks issued by users and new crawler tasks submitted by the crawler execution module 2; the crawler seed processing unit 12 is used to de-duplicate the crawler tasks in the crawler seed queue Screening and pr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data acquisition system and method based on a scrapy crawler framework. The data acquisition system comprises a crawler queue module and a crawler execution module, the crawler queue module comprises a crawler seed queue, a crawler seed processing unit and a crawler task queue; wherein the crawler seed queue is used for storing crawler tasks; the crawler seed processing unit is used for carrying out deduplication screening processing on crawler tasks in the crawler seed queue and storing the crawler tasks subjected to deduplication screening into a crawler task queue;the crawler execution module comprises a webpage downloading unit and a URL mining unit; the webpage downloading unit is used for reading a crawler task needing to be executed currently from the crawler task queue and downloading a webpage based on the read crawler task; the URL mining unit is used for extracting a new URL link from the downloaded webpage to serve as a new crawler task to be stored in a crawler seed queue; deep mining of website domain names in a specific field is realized, and the crawling range of the system is improved.

Description

technical field [0001] The invention belongs to the technical field of data collection, and in particular relates to a data collection system and method based on a scrapy crawler framework. Background technique [0002] The rapid development of information network technology has brought about an exponential increase in the amount of network information. Under the condition of sufficient network information resources, in order to obtain relevant network information quickly and pertinently, the search engine was born. [0003] A search engine refers to the use of specific computer programs to automatically collect information from the Internet according to a certain strategy, organize and process the information, and provide users with retrieval services. The process of search engines collecting information from the Internet depends on the crawling of relevant website information by web spiders. A web spider is a program that automatically browses the web and analyzes web co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566
Inventor 魏志强贾东宁聂为之刘安安苏育挺
Owner QINGDAO NAT LAB FOR MARINE SCI & TECH DEV CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products