Distributed data crawling system and method and storage medium
A distributed data and data retrieval technology, applied in network data indexing, network data retrieval, other database retrieval, etc., can solve problems such as performance and resource waste, reduced real-time performance, and wake-up delay of distributed crawler systems, reducing energy consumption. Consumption, avoid performance waste, achieve the effect of timely response
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0021] figure 1 It is a schematic structural diagram of a distributed data crawling system provided by Embodiment 1 of the present invention. This embodiment is applicable to the situation of network data crawling, see figure 1 , the system can include:
[0022] A master node 10 and at least one slave node 11; the master node 10 is used to manage each of the slave nodes 11, and send the received crawling tasks to each of the slave nodes 11; each of the slave nodes 11 includes The crawler framework 12, the slave node 11 is used to crawl data according to the crawling task received under the crawler framework 12.
[0023] Wherein, the master node 10 can be the entrance of the crawling task of the distributed data crawling system, and the crawling task that the distributed crawling system needs to be completed can be obtained by the master node 10, and the master node 10 can be a server or a A server cluster, the master node 10 can manage other slave nodes 11, the master node 1...
Embodiment 2
[0033] image 3 It is a flow chart of the steps of a distributed data crawling method provided in Embodiment 2 of the present invention. The embodiment of the present invention can be used in the case of network data crawling. The method can be crawled by distributed data in the embodiment of the present invention. system, which can be realized by means of software and / or hardware, see image 3 The specific method of the embodiment of the present invention comprises:
[0034] Step 301, when the master node detects the crawling task, it acquires the stored routing information table.
[0035] Among them, the crawling task can be a request for crawling network data, which can include the address information of the target web page, and the routing information table can be a data table storing information of slave nodes, and the master node can search the routing information table to obtain each slave node the network address of .
[0036] Specifically, the master node can conti...
Embodiment 3
[0045] Figure 4 It is a flow chart of the steps of a distributed data crawling method provided by Embodiment 3 of the present invention. The embodiment of the present invention is based on the above-mentioned embodiments of the invention. Refer to Figure 4 , in an embodiment of the present invention, the method includes:
[0046] Step 401, the master node detects the data interface.
[0047] Wherein, the data interface may be an interface for receiving crawling tasks, and when a user requests data crawling, the crawling task may be sent to the data interface of the master node.
[0048] Specifically, the master node can detect the data interface in real time, and the detection method can include judging whether there is data in the data interface.
[0049] Step 402: If it is detected that there is a crawling address in the data interface, then determine that the master node has detected a crawling task, and use the crawling address as a crawling task; otherwise, determine ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


