Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)

A web crawler and dynamic technology, applied in the Internet field, can solve the problems of deduplication and difficulty in adaptation, and achieve the effects of efficient collection and processing, excellent time efficiency and space efficiency

Active Publication Date: 2015-07-29
SOUTHEAST UNIV
View PDF1 Cites 23 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For the URL deduplication problem in this application scenario, the existing Bloom Filter and some of its improved solutions are mostly difficult to adapt to

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
  • Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)
  • Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

[0025] A web crawler URL deduplication method based on dynamically splittable Bloom Filter, including:

[0026] (1) Firstly, a dynamically splittable Bloom Filter must be constructed, and the binary array of each leaf Bloom Filter is stored in the Redis database. Redis is an in-memory database with excellent read and write performance, but its performance will drop sharply when the stored content approaches or exceeds the memory size. Therefore, according to the scale and characteristics of the w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for web crawler URL (uniform resource locator) deduplicating based on a DSBF (dynamic splitting Bloom Filter). The method is based on the DSBF, and is different from the Bloom Filter which is of a fixed structure and uniformly bears the URL storage tasks in an Interner Archive crawler and an Apoide crawler, and the method has a dynamic extensible structure which can be flexibly split into a plurality of layers according to the requirements. The method for the web crawler URL deduplicating based on the DSBF has the advantages that the number of the processed URLs can be continuously increased, the false positive false judging rate of the Bloom Filter can be controlled within the setting range, and the Bloom Filter has a flexible storage structure with easy distributing; the method is more suitable for constructing the large-scale, distributed and multiple-web crawler type parallel processing environment, and can support the high-efficiency collecting and treatment of massive webpage information of an internet.

Description

technical field [0001] The invention relates to a web crawler URL deduplication method, which can be used to realize large-scale, distributed high-performance web crawler applications, in particular to a web crawler URL deduplication method based on a dynamically splittable Bloom Filter, which belongs to the Internet technology field. Background technique [0002] Web crawler (Web Crawler) is an important part of many Internet information collection systems. Based on the URL of the webpage, it can automatically crawl the webpages in the Internet according to certain rules. Since there are hundreds of millions of URLs on the Internet, and different URLs are linked to each other, in order to avoid repeatedly crawling the same URL, the web crawler needs to determine whether the current URL to be crawled has been crawled during the crawling process. This process is called URL deduplication. The key to implementing URL deduplication lies in how to store crawled URL information ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 杨鹏袁志伟刘旋
Owner SOUTHEAST UNIV