Extendible repeated data detection method

A technology of duplicate data and detection method, which is applied in the field of scalable duplicate data detection, and can solve the problem that the storage capacity cannot be expanded efficiently.

Active Publication Date: 2014-08-06
HUAZHONG UNIV OF SCI & TECH
View PDF3 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0029] The present invention provides an expandable duplicate data detection method, which solves the problem that the storage capacity cannot be expanded efficiently in the existin

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Extendible repeated data detection method
  • Extendible repeated data detection method
  • Extendible repeated data detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0096] For ease of understanding, first explain the unit conversion in the calculation: 1T=10 3 G=10 6 M=10 9 k=10 12

[0097] Suppose we need to detect duplicate data on a server with a capacity of 32T bytes, and the false positive rate is expected to be controlled below 0.005, that is, ε’=0.005. The block size is 8K bytes per block, the Bloom filter group base g=64 (assuming the server word length is 64), and the maximum number of Bloom filters is set to r=128; Bloom filter expansion factor t=4 ;Fingerprint byte number Y=20;

[0098] The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0099] Such as image 3 As shown, the embodiment of the present invention includes a block processing step, a fingerprint extraction step, a Bloom filter retrieval step, a fingerprint subset table retrieval step, a less than full Bloom filter judgment step, a new fingerprint marking step, and a Bloom filter The step of ju...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an extendible repeated data detection method, belongs to the technical field of computer storage, and solves the problem that in the existing repeated data detecting method, the storage capacity cannot be efficiently extended, so as to meet the requirements of the current situation that the storage demand increases and repeatedly deleted systems need upgrading and updating. The extendible repeated data detection method comprises the following steps: partitioning processing, fingerprint extraction, retrieving of Bloom filters, retrieving of fingerprint subset table, judgment of unfulfilled Bloom filters, new fingerprint marking, judgment of Bloom filter quantity, and extending of Bloom filter array. In the invention, the Bloom filter array is used to retrieve the fingerprint data, so as to quickly locate the retrieval range, improve the retrieval efficiency and realize detection on the repeated data; the extendible repeated data detection method is high in expansibility and querying performance, can support element location and control the misjudgment rate, and further can effectively reduce the memory overhead. The Bloom filter array is composed of a series of isomorphic Bloom filters, so that once the misjudgment rate epsilon' and the pre-established retrieving fingerprint total quantity nmax are provided, the quantity of the required Bloom filters and the number of the hush functions can be worked out.

Description

technical field [0001] The invention belongs to the technical field of computer storage, and in particular relates to an expandable repeated data detection method. Background technique [0002] In 1998, Jim Gray concluded in his speech when he was awarded the Computer Turing Award that "the information industry has grown exponentially in the past 100 years", and proposed a new empirical law based on "Moore's Law" that "in the future, every 18 months the world will increase Storage capacity is the sum of all storage capacity ever created". Since 2007, IDC and EMC have cooperated to release information storage market research reports for five consecutive years. The estimated data shows that the total amount of digital information created and copied worldwide has increased from 161EB (exabytes) in 2006 to 1.8ZB (zettabytes) in 2011. ), the total amount of global digital information exceeded the available storage capacity for the first time in 2007, and the difference between t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F3/0641
Inventor 王桦周可李春花张攀峰魏建生
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products