Distributed network crawler URL (uniform resource locator) duplicate removal system and method

A distributed network and crawler technology, applied in the Internet field, can solve problems such as low efficiency of deduplication, large consumption of memory resources, and inappropriateness, and achieve the effects of high cost performance, simple communication process, and flexible deployment and operation

Active Publication Date: 2013-02-13
CHINA ACADEMY OF INFORMATION & COMM
View PDF4 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The deduplication method based on memory Hash uses the Hash table in memory to perform deduplication, but this method consumes too much memory resources
The deduplication method based on the disk path is to encode the link in MD5 and create the corresponding path in the disk. This path-based deduplication method has high time efficiency, but it is not suitable for a large-scale web crawling Distributed crawler system
The last weight-removal method based on the Bloom filter is a more commonly used weight-removal scheme. The advantage is that it is very fast and saves space, but there will be a certain rate of misrecognition. It is also a memory-based weight-removal method.
[0004] Generally, distributed web crawlers will choose a relatively stable database deduplication method, but the deduplication efficiency is too low, and it is not suitable for large amounts of data.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed network crawler URL (uniform resource locator) duplicate removal system and method
  • Distributed network crawler URL (uniform resource locator) duplicate removal system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0031] A schematic diagram of the URL deduplication system architecture of a distributed web crawler provided by the embodiment of the present invention is as follows figure 1 As shown, it is characterized in that it includes: crawler collection sub-node (crawler sub-node), central server, database server, wherein,

[0032] The crawler collection sub-node is deployed on multiple servers in the network for first-level deduplication;

[0033] The central server is used for secondary ranking; the crawler collection sub-node needs to be registered on the central server to obtain identity authenticatio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a distributed network crawler URL (uniform resource locator) duplicate removal system and method. The system comprises crawler collection sub-nodes, a central server and a database server. The method comprises the following steps that: each crawler collection sub-node registers to the central server; the crawler collection sub-node obtains a URL from a database waiting queue and obtains new URL information from the URL; the crawler collection sub-node performs first-level duplicate removal on the newly obtained URL; if the first-level duplicate removal fails, the URL is discarded; if the first-level duplicate removal succeeds, the newly obtained URL is added to a local URL abstract table and sent to the central server; the central server performs second-level duplicate removal on the newly obtained URL; if the second-level duplicate removal succeeds, the URL is added to a global URL abstract table; and the crawler collection sub-node adds the link of the URL to the waiting queue. According to the system and method provided by the invention, the duplicate removal task concentrated on the central node can be decomposed to the crawler collection sub-nodes through the first-level duplicate removal by a hierarchical duplicate removal mechanism, and the central server maintains a global duplicate removal table through a second-level duplicate removal mode, thus the system expansion is remarkably facilitated, and the system design, deployment and operation are very flexible and convenient.

Description

technical field [0001] The invention relates to the field of the Internet, in particular to a system and method for URL deduplication of a distributed web crawler based on a distributed architecture. Background technique [0002] At present, distributed web crawlers can be divided into three types: master-slave mode, autonomous mode and hybrid mode according to different communication methods. The master-slave mode means that one host is used as the control node to manage all the hosts running web crawlers. The crawler only needs to receive tasks from the control node and submit new tasks to the control node. In this process, there is no need to Communicating with other crawlers is simple and easy to manage. The control node needs to communicate with all crawlers, and it needs an address list to save the information of all crawlers in the system. When the number of crawlers in the system changes, the coordinator needs to update the data in the address list, and this proces...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): H04L29/08
Inventor 刘述徐贵宝江文学何宝宏高强赵劲
Owner CHINA ACADEMY OF INFORMATION & COMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products