Data acquisition system and method based on scrapy crawler framework

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A data acquisition system and crawler technology, which is applied in the direction of network data indexing, network data retrieval, and other database retrieval, can solve the problems of slow crawling speed and exhaustion of single-machine memory, so as to ensure reliability, improve crawling breadth, Improve the effect of crawling stability

Pending Publication Date: 2020-05-29

QINGDAO NAT LAB FOR MARINE SCI & TECH DEV CENT

View PDF5 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Therefore, it is likely to cause the memory of the stand-alone machine to be exhausted, resulting in slower crawling speed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0023] The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0024] The data acquisition system based on the scrapy crawler framework proposed by the present invention, such as figure 1 As shown, it includes a crawler queue module 1, a crawler execution module 2 and a task scheduling module 3; wherein, the crawler queue module 1 includes a crawler seed queue 11, a crawler seed processing unit 12 and a crawler task queue 13; the crawler execution module 2 includes a webpage download unit 21 and a URL mining unit 22; the task scheduling module 3 includes a crawler process queue 31 and a process manager 32.

[0025] The crawler seed queue 11 is used to store crawler tasks, including but not limited to crawler tasks sent by users and new crawler tasks submitted by the crawler execution module 2; the crawler seed processing unit 12 is used to deduplicate the crawler tasks in the crawle...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a data acquisition system and method based on a scrapy crawler framework. The data acquisition system comprises a crawler queue module and a crawler execution module, the crawler queue module comprises a crawler seed queue, a crawler seed processing unit and a crawler task queue; wherein the crawler seed queue is used for storing crawler tasks; the crawler seed processing unit is used for carrying out deduplication screening processing on crawler tasks in the crawler seed queue and storing the crawler tasks subjected to deduplication screening into a crawler task queue;the crawler execution module comprises a webpage downloading unit and a URL mining unit; the webpage downloading unit is used for reading a crawler task needing to be executed currently from the crawler task queue and downloading a webpage based on the read crawler task; the URL mining unit is used for extracting a new URL link from the downloaded webpage to serve as a new crawler task to be stored in a crawler seed queue; deep mining of website domain names in a specific field is realized, and the crawling range of the system is improved.

Description

technical field [0001] The invention belongs to the technical field of data collection, and in particular relates to a data collection system and method based on a scrapy crawler framework. Background technique [0002] The rapid development of information network technology has brought about an exponential increase in the amount of network information. Under the condition of sufficient network information resources, in order to obtain relevant network information quickly and pertinently, the search engine was born. [0003] A search engine refers to the use of specific computer programs to automatically collect information from the Internet according to a certain strategy, organize and process the information, and provide users with retrieval services. The process of search engines collecting information from the Internet depends on the crawling of relevant website information by web spiders. A web spider is a program that automatically browses the web and analyzes web co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/951G06F16/955

CPCG06F16/951G06F16/9566

Inventor魏志强贾东宁聂为之刘安安苏育挺

OwnerQINGDAO NAT LAB FOR MARINE SCI & TECH DEV CENT

Data acquisition system and method based on scrapy crawler framework

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology