URL duplicate removal method and device and storage medium

A technology for storing media and processing results, applied in the computer field, can solve the problems of misjudgment, affecting the efficiency of deduplication, and inaccurate judgment results, so as to make up for misjudgment and improve accuracy.

Pending Publication Date: 2020-05-12
SF TECH
View PDF3 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] When using the Bloom filter for URL deduplication, it is judged whether the hash function value of the URL to be crawled is in the Bloom filter to determine whether the URL has been crawled, because a crawled URL When inputting the hash function value of the Bloom filter into the Bloom filter, it may cause the value of elements in other positions to be set to 1, resulting in misjudgment of other uncrawled URLs, that is, a certain URL has not actually been crawled, resulting in inaccurate judgment results , affecting the deduplication efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • URL duplicate removal method and device and storage medium
  • URL duplicate removal method and device and storage medium
  • URL duplicate removal method and device and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] The application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain related inventions, not to limit the invention. It should also be noted that, for ease of description, only parts related to the invention are shown in the drawings.

[0025] It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.

[0026] It can be understood that, in the process of crawling programs or scripts in the Internet by a web crawler, some seed URLs may be selected first, such as URLs conforming to a predetermined format. Uniform Resource Locator URL (Uniform Resource Locator) is a representation of the location and access...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a URL duplicate removal method and device and a storage medium. The method comprises the steps of obtaining a to-be-crawled URL corresponding to a to-be-crawled webpage; performing hash processing on the features of the URL to be crawled to obtain a processing result of the URL to be crawled; judging whether the processing result is in a Bloom filter or not, and if the processing result is in the Bloom filter, judging whether the feature of the URL is in a pre-established data list or not; and if the feature is in the data list, abandoning the URL to be crawled. According to the URL duplicate removal method provided by the embodiment of the invention, whether the URL to be crawled is crawled or not is determined again by utilizing the pre-established data list, so that the misjudgment of the Bloom filter is made up, the abandonment of the URL which is not crawled due to the misjudgment of the Bloom filter is avoided, and the accuracy of URL duplicate removal isimproved.

Description

technical field [0001] The present application generally relates to the field of computer technology, and specifically relates to a URL deduplication method, device and storage medium. Background technique [0002] In the process of using search engines to obtain information, web crawlers actively grab programs or scripts from Internet information, download web pages on the Internet to the local area to form a mirror backup of Internet content, and provide data sources for users. In order to obtain as much network information as possible, web crawlers are usually distributed to multiple machine clusters for crawling. [0003] In order to avoid repeated crawling of already crawled webpages, it is necessary to deduplicate Uniform Resource Locators (Uniform Resource Locators, URLs) corresponding to already crawled webpages. Currently, deduplication methods commonly used include database-based deduplication, memory-based deduplication, disk path-based deduplication, and Bloom f...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/955
Inventor 曾庆维
Owner SF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products