Unlock instant, AI-driven research and patent intelligence for your innovation.

Distributed data crawling system and method and storage medium

A distributed data and data retrieval technology, applied in network data indexing, network data retrieval, other database retrieval, etc., can solve problems such as performance and resource waste, reduced real-time performance, and wake-up delay of distributed crawler systems, reducing energy consumption. Consumption, avoid performance waste, achieve the effect of timely response

Inactive Publication Date: 2019-11-15
SHENZHEN LEXIN SOFTWARE TECH CO LTD
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the real-time crawling of unknown tasks, a large number of crawling tasks are required. Single-node crawling cannot bear the task load in this case. It is often necessary to implement a distributed data crawling system using the scrapy crawler framework and redis storage. , but the distributed crawling system of this architecture has unknown tasks, and the crawler system is often in an idling state, resulting in waste of performance and resources. If a task request arrives suddenly, there is a delay in the wake-up of the distributed crawler system, and it cannot respond to the task request in real time. lead to reduced real-time performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed data crawling system and method and storage medium
  • Distributed data crawling system and method and storage medium
  • Distributed data crawling system and method and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0021] figure 1 It is a schematic structural diagram of a distributed data crawling system provided by Embodiment 1 of the present invention. This embodiment is applicable to the situation of network data crawling, see figure 1 , the system can include:

[0022] A master node 10 and at least one slave node 11; the master node 10 is used to manage each of the slave nodes 11, and send the received crawling tasks to each of the slave nodes 11; each of the slave nodes 11 includes The crawler framework 12, the slave node 11 is used to crawl data according to the crawling task received under the crawler framework 12.

[0023] Wherein, the master node 10 can be the entrance of the crawling task of the distributed data crawling system, and the crawling task that the distributed crawling system needs to be completed can be obtained by the master node 10, and the master node 10 can be a server or a A server cluster, the master node 10 can manage other slave nodes 11, the master node 1...

Embodiment 2

[0033] image 3 It is a flow chart of the steps of a distributed data crawling method provided in Embodiment 2 of the present invention. The embodiment of the present invention can be used in the case of network data crawling. The method can be crawled by distributed data in the embodiment of the present invention. system, which can be realized by means of software and / or hardware, see image 3 The specific method of the embodiment of the present invention comprises:

[0034] Step 301, when the master node detects the crawling task, it acquires the stored routing information table.

[0035] Among them, the crawling task can be a request for crawling network data, which can include the address information of the target web page, and the routing information table can be a data table storing information of slave nodes, and the master node can search the routing information table to obtain each slave node the network address of .

[0036] Specifically, the master node can conti...

Embodiment 3

[0045] Figure 4 It is a flow chart of the steps of a distributed data crawling method provided by Embodiment 3 of the present invention. The embodiment of the present invention is based on the above-mentioned embodiments of the invention. Refer to Figure 4 , in an embodiment of the present invention, the method includes:

[0046] Step 401, the master node detects the data interface.

[0047] Wherein, the data interface may be an interface for receiving crawling tasks, and when a user requests data crawling, the crawling task may be sent to the data interface of the master node.

[0048] Specifically, the master node can detect the data interface in real time, and the detection method can include judging whether there is data in the data interface.

[0049] Step 402: If it is detected that there is a crawling address in the data interface, then determine that the master node has detected a crawling task, and use the crawling address as a crawling task; otherwise, determine ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed data crawling system and method and a storage medium. The system provided by the invention comprises a master node and at least one slave node. The master node isused for managing each slave node and sending a received crawling task to each slave node. Each slave node comprises a crawler framework, and the slave nodes are used for crawling data according to received crawling tasks under the crawler frameworks. According to the technical scheme, the crawling task is acquired through the master node and allocated to the slave nodes, and the slave nodes perform data crawling according to the crawling task, so that the timely response of the data crawling request is realized, the time delay of system awakening is reduced, the performance waste caused by idling of the distributed crawling system is avoided, and the energy consumption is reduced.

Description

technical field [0001] The embodiments of the present invention relate to the technical field of computer applications, and in particular to a distributed data crawling system, method and storage medium. Background technique [0002] With the advent of the era of big data, data has become more and more valuable. How to obtain data has become the focus of research in the industry. Existing methods for obtaining data in the network include batch crawling of fixed tasks and real-time crawling of unknown tasks. [0003] In the real-time crawling of unknown tasks, a large number of crawling tasks are required. Single-node crawling cannot bear the task load in this case. It is often necessary to implement a distributed data crawling system using the scrapy crawler framework and redis storage. , but the distributed crawling system of this architecture has unknown tasks, and the crawler system is often in an idling state, resulting in waste of performance and resources. If a task r...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951
CPCG06F16/951Y02D10/00
Inventor 肖淋峰吴志坚
Owner SHENZHEN LEXIN SOFTWARE TECH CO LTD