Distributed-type nodes and distributed-type system in a crawler cluster

A distributed, node-based technology, applied in transmission systems, digital transmission systems, special data processing applications, etc., can solve performance bottlenecks and large-scale expansion, collaboration, url deduplication and network load balancing are difficult to solve, lack of management and other issues to achieve the effect of realizing a large-scale distributed crawler cluster

Active Publication Date: 2013-04-24
ZHENGZHOU SEANET TECH CO LTD
View PDF3 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The current research on distributed crawler clusters is mainly focused on the distributed crawler system under the master-slave mode, that is, there will be some core management nodes responsible for task management, uniform resource locator url deduplication and load balancing. Such master-slave The mode still cannot solve the problems of performance bottlenecks and large-scale ex

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed-type nodes and distributed-type system in a crawler cluster
  • Distributed-type nodes and distributed-type system in a crawler cluster
  • Distributed-type nodes and distributed-type system in a crawler cluster

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0013] Specific embodiments of the present invention will be further described in detail below in conjunction with the accompanying drawings.

[0014] The embodiment of the present invention builds the underlying overlay network by using the structured p2p algorithm kademlia, and establishes a communication mechanism between nodes; a complete set of crawling modules is run independently on each node, responsible for webpage crawling, data analysis and link extraction, etc. ;At the same time, a control center is configured on each node, which is responsible for receiving and distributing urls, load balancing and handling the transfer of url history records. Since each node has equal status and consistent functions, relying on the internal mechanism of the node to realize crawler cooperation, no additional operations outside the system are required for a single node to join the network, and the entire network can expand the number of crawler nodes at will to realize a large-scale...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An embodiment of the invention discloses decentralized distributed-type nodes and a distributed-type system in a crawler cluster, wherein the decentralized distributed-type nodes and the distributed-type system in the crawler cluster are based on structuralized peer-to-peer (p2p). The distributed-type nodes in the crawler cluster based on the structuralized p2p comprise a bottom covering net, a crawl module and a control center, wherein the bottom covering net is based on a p2p organization mode to execute a protocol of disseminating and receiving url between the nodes, the crawl module is based on the disseminated url to be in charge of grabbing corresponding resources from the internet, and the control center executes the function of disseminating and receiving the url. By using the features of kademlia of the structuralized p2p algorithm, the distributed-type nodes and the distributed-type system in the crawler cluster skillfully solve de-weight and load balancing problems in the distributed-type crawler system, achieve good expansibility and fault tolerance of the system, and are capable of providing a universal design method for a large-scale distributed-type crawler system.

Description

technical field [0001] The invention relates to the field of computer data mining, in particular to a distributed crawler cluster method. Background technique [0002] Crawlers are the most important tool for data collection of search engines. In today's era of information explosion, it is difficult for crawlers in the traditional client (C) / server (S) mode to capture all the data in the network; moreover, if If the number of crawlers increases, the server will bear a greater load and the system will not be able to improve service performance. In this context, the method of distributed crawler clusters has gradually entered people's field of vision. [0003] The current research on distributed crawler clusters is mainly focused on the distributed crawler system under the master-slave mode, that is, there will be some core management nodes responsible for task management, uniform resource locator url deduplication and load balancing, etc., such master-slave The mode still c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H04L29/08H04L12/803G06F17/30
Inventor 陈君黄志敏吴京洪王玲芳
Owner ZHENGZHOU SEANET TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products