Repeated data detection method and device

A technology of duplicate data and detection methods, applied in the computer field, can solve the problems that web pages cannot be crawled repeatedly, cannot be found to be updated, and cannot be deleted.

Pending Publication Date: 2021-04-27
BEIJING GRIDSUM TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, for the standard Bloom filter, once the data is recorded, it cannot be deleted. If the web crawler uses the standard Bloom filter to judge duplicate URLs, it means that the same web page cannot be crawled repeatedly, which will lead to the inability to find the content of the page. update, which in turn affects the accuracy of the test results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Repeated data detection method and device
  • Repeated data detection method and device
  • Repeated data detection method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052] In order to better understand the above-mentioned technical solution, the above-mentioned technical solution will be described in detail below in conjunction with the accompanying drawings and specific implementation methods. In the case of no conflict, the following embodiments and features in the embodiments can be combined with each other.

[0053] It should be noted that, in the description of the present application, terms such as "first" and "second" are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.

[0054] refer to figure 1, is a schematic structural diagram of an electronic device 20 provided in an embodiment of the present application, the electronic device 20 includes at least one processor 201, at least one memory 202 connected to the processor 201, and a bus 203; wherein, the processor 201 and the memory 202 pass The bus 203 completes mutual communication; the processor 201 is used to cal...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a repeated data detection method and device, and relates to the technical field of computers. In the method, when the target data to be detected is judged to be non-repeated data, the target data is recorded in a Bloom filter, and the feature information corresponding to the target data is recorded in a persistent storage unit, so that the repeated data which is the same as the target data can be detected through the Bloom filter in the subsequent process, and meanwhile, the target data can be deleted from the Bloom filter based on the feature information recorded in the persistent storage unit when the target data is expired, so that mass data repeated detection supporting expiration time is achieved, and the accuracy of repeated data detection of equipment in an application scene with an expiration time requirement is improved.

Description

technical field [0001] The present application relates to the field of computer technology, in particular, to a method and device for detecting duplicate data. Background technique [0002] In the prior art, Bloom filters are often used to meet the application scenarios of quickly determining duplication of massive data, for example: using Bloom filters to realize deduplication and anti-spam of Url (Uniform Resource Locators, Uniform Resource Locators) by web crawlers Mail and other needs. [0003] Standard Bloom filters are capable of retrieving whether an element exists in a set with extremely high time efficiency, while being extremely space efficient. However, for the standard Bloom filter, once the data is recorded, it cannot be deleted. If the web crawler uses the standard Bloom filter to judge duplicate URLs, it means that the same web page cannot be crawled repeatedly, which will lead to the inability to find the content of the page. update, which in turn affects t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/9035
CPCG06F16/9035
Inventor 赵一飞
Owner BEIJING GRIDSUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products