Distributed data crawling system, method and device, equipment and storage medium

A distributed data and distributed technology, which is applied in the direction of network data indexing, network data retrieval, and other database retrieval, etc., can solve the problems of changing job tasks, fixed job tasks, poor data crawling efficiency, etc., and achieve task volume Convenience and speed-enhancing effects

Inactive Publication Date: 2019-11-12
SHENZHEN LEXIN SOFTWARE TECH CO LTD
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the prior art, web crawlers are often used to crawl network data. Common web crawler implementation methods include: using Python's Requests library or Trep library to build a crawler system, using the Scrapy framework to build a single-process multi-threaded crawler system, or using th

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed data crawling system, method and device, equipment and storage medium
  • Distributed data crawling system, method and device, equipment and storage medium
  • Distributed data crawling system, method and device, equipment and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0039] figure 1 It is a schematic diagram of the architecture of a distributed data crawling system provided by Embodiment 1 of the present invention. This embodiment is applicable to the situation of network data crawling. See figure 1 , the system may include a task queue cluster 10 and a data crawling cluster 11;

[0040] Wherein, the task queue cluster 10 includes at least one terminal 101, an initial task queue and an intermediate task queue are arranged in the task queue cluster 10, and the initial task queue and the intermediate task queue are respectively used to save the initial crawling address and an intermediate crawling address; the data crawling cluster 11 includes at least one terminal 111 for accessing the task queue cluster 10 to obtain an initial crawling address and an intermediate crawling address, and according to the initial crawling address and The intermediate crawl address crawls the target webpage.

[0041] In the embodiment of the present invention...

Embodiment 2

[0046] figure 2 It is a schematic diagram of the architecture of a distributed data crawling system provided by Embodiment 2 of the present invention. This embodiment is embodied on the basis of the above-mentioned embodiments of the invention. See figure 2, the distributed data crawling system provided by the embodiment of the present invention includes: a task queue cluster 20 and a data crawl cluster 21, wherein the task queue cluster 20 includes at least one terminal, and the task queue cluster 20 is provided with an initial task Queue 202 and intermediate task queue 201, the initial task queue 202 and intermediate task queue 201 are respectively used to save the initial crawling address and the intermediate crawling address; the data crawling cluster 21 includes at least one terminal for accessing all The above task queue cluster is used to obtain the initial crawling address and the intermediate crawling address, and the target webpage is crawled according to the initi...

Embodiment 3

[0053] Figure 4 It is a flow chart of the steps of a distributed data crawling method provided by Embodiment 3 of the present invention. This embodiment is applicable to the data crawling cluster of a distributed crawler system, and the data crawling cluster can be composed of multiple terminals. , the method can be executed by the distributed data crawling device in the embodiment of the present invention, and the device can be realized by means of software and / or hardware. The method in the embodiment of the present invention specifically includes the following steps:

[0054] Step 301, when detecting that the initial task queue and / or the intermediate task queue set by the task queue cluster has a crawling address, access the task queue cluster to obtain the crawling address.

[0055] Wherein, the initial task queue may be a queue for storing initial tasks, and terminals in the data crawling cluster may start crawling network data according to crawling addresses in the ini...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a distributed data crawling system, method and device, equipment and a storage medium. The system provided by the embodiment of the invention comprises a taskqueue cluster and a data crawling cluster, wherein the task queue cluster comprises at least one terminal; wherein a starting task queue and a middle task queue are arranged in the task queue cluster; wherein the initial task queue and the intermediate task queue are respectively used for storing an initial crawling address and an intermediate crawling address; wherein the data crawling cluster comprises at least one terminal and is used for accessing the task queue cluster to obtain an initial crawling address and an intermediate crawling address and crawling a target webpage according to the initial crawling address and the intermediate crawling address. According to the system provided by the embodiment of the invention, the initial task queue and the intermediate task queue are respectively arranged in the task queue cluster, so that the task amount in the data crawling process is convenient to change, the resource scheduling difficulty is reduced, and the data crawling efficiencyis improved.

Description

technical field [0001] The embodiments of the present invention relate to the technical field of computer applications, and in particular to a distributed data crawling system, method, device, device and storage medium. Background technique [0002] The current society has entered the era of big data, human life is increasingly inseparable from data, and the amount of data is growing explosively. Since data contains a lot of value, how to obtain data has become an urgent problem to be solved. [0003] In the prior art, web crawlers are often used to crawl network data. Common web crawler implementation methods include: using Python's Requests library or Trep library to build a crawler system, using the Scrapy framework to build a single-process multi-threaded crawler system, or using the Scrapy framework and Redis database to build In the memory-based distributed crawler system, the above-mentioned technical solution has the problems of difficulty in resource scheduling and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951
CPCG06F16/951
Inventor 肖淋峰吴志坚
Owner SHENZHEN LEXIN SOFTWARE TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products