Webpage URL repetition elimination method based on distributed database

A database and distributed technology, applied in the field of distributed databases, can solve problems such as sacrificing accuracy, achieve low collision rate and solve memory problems.

Inactive Publication Date: 2016-09-21
HUNAN ANTVISION SOFTWARE
View PDF6 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although the use of Bloom Filter can achieve the purpose of saving memory, but this space efficiency is based on the premise of sacrificing accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage URL repetition elimination method based on distributed database

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] Such as figure 1 As shown, the webpage URL deduplication method based on the distributed database of the present invention includes the following steps, step S101: Obtaining the URL to be crawled, and the distributed crawler obtains the URL of the webpage to be crawled.

[0025] Step S102: Calculate the hash value of the URL; use the MurmurHash method to map the web page URL to a long-type hash value. The advantages of MurmurHash are high computing performance and low collision rate. In addition, the algorithm can also achieve data compression, thereby improving communication efficiency and saving storage space.

[0026] Step S103: Query the database, and the distributed crawlers compress the URLs in their collection databases and send them to the distributed database for deduplication processing. The database system in the present invention adopts a decentralized structure, and the main technical means for realization is consistent hashing.

[0027] The consistent h...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of distributed databases, and particularly relates to a webpage URL repetition elimination method based on the distributed database. The method comprises the following steps: a step S101: acquiring to-be-crawled URLs, wherein to-be-crawled webpage URLs of a webpage are acquired by distributed crawlers; a step S102: calculating hash values of the URLs; a step S103: inquiring the database, wherein the distributed crawlers compress and uniformly send the URLs in an own collection library to the distributed database for executing repetition elimination; a step S104: feeding back a result, wherein a data query result is returned back; and a step S105: data acquisition, wherein crawler nodes determine whether the webpage can be crawled or not according to the returned result. With the method mentioned above, the webpage URL repetition elimination method based on the distributed database, provided by the invention, solves a memory problem and a single point problem in a massive URL repetition elimination process better, and simultaneously guarantees high query efficiency and low collision rate.

Description

technical field [0001] The invention relates to the technical field of distributed databases, in particular to a method for deduplicating web page URLs based on distributed databases. Background technique [0002] Web page URL deduplication is of great significance to crawlers. The current deduplication strategies are mainly divided into two categories: memory-based deduplication methods and disk-based deduplication methods. [0003] The memory-based deduplication method needs to face the problem of memory overflow, especially in the case of a large and growing web page URL. The current general solution is to use Bloom Filter. Although this method solves the problem of memory overflow, it sacrifices accuracy. As the amount of data increases, the probability of collision will also increase. [0004] There is no memory overflow problem in the disk-based deduplication method, which generally uses the database deduplication method. For traditional relational databases, when d...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9566G06F16/951
Inventor 陈丹黄三伟
Owner HUNAN ANTVISION SOFTWARE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products