Data de-duplication method based on combination of similarity and locality

A technology of deduplication and similarity, which is applied in digital data processing, special data processing applications, instruments, etc., can solve problems such as affecting the throughput rate of deduplication, avoid accessing disk indexes, reduce memory overhead, and duplicate data. Remove efficient effects

Active Publication Date: 2011-10-19
HUAZHONG UNIV OF SCI & TECH
View PDF3 Cites 79 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

That is to say, every time a data block is input, the entire disk fingerprint index needs to be traversed, which seriously affects the throughput of deduplication.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data de-duplication method based on combination of similarity and locality
  • Data de-duplication method based on combination of similarity and locality
  • Data de-duplication method based on combination of similarity and locality

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The deduplication method of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0028] The data deletion method of the present invention divides the data stream to be backed up into blocks and groups, uses the fingerprint set of each group of data blocks to construct a similarity unit, and selects the representative fingerprint of the similarity unit, that is, selects the smallest prefix of the fingerprint value in the similarity unit Put the representative fingerprint into the memory and use it as the key value index for data deduplication to judge the similarity.

[0029] Because if the data block sets represented by two similarity units have a lot of repeated data blocks, the probability that their fingerprints are equal is equal to the ratio that they have a common fingerprint, so the similarity judging method described in the present invention is based on the similarity probability, The greater the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a data de-duplication method for combining the similarity and the locality of data, with less system memory expense and high data de-duplication efficiency. The method comprises the steps of: firstly partitioning and grouping files in a data stream, determining a similarity unit and a representative fingerprint of every data group, and storing the representative fingerprint in a memory; and traversing all the data groups and performing a similarity determination to determine which data in data groups are completely duplicate data and which data in data groups have non-duplicate data. If the non-duplicate data exists in the data groups, the locality determination can be continued to further determine which data in the data groups is the duplicate data. According to the method, the representative fingerprint is only stored in the memory, thus the memory expense is greatly reduced; supplement can be performed after the similarity determination by mining the locality of the data stream and caching the locality of the data stream in the memory, thus more duplicate data can be found, and simultaneously, frequent access to a disk index can be avoided and the utilization rate of the memory can be enhanced.

Description

technical field [0001] The invention belongs to the field of computer storage, in particular to a method for deleting duplicated data based on the combination of similarity and locality. Background technique [0002] In recent years, with the development and popularization of computer technology and networks, the amount of data information storage in the world has shown an explosive growth trend. Although the price of storage devices has been declining, it is far behind the speed of data expansion. Data deduplication (Data Deduplication), as a technology to effectively eliminate redundant data on a large scale, has become a hotspot in storage system research in recent years. Data deduplication can not only greatly save storage space and improve storage system performance, but also save network bandwidth by avoiding redundant data transmission. The rise of data deduplication originated from the demand for massive data backup and archiving in the storage market, and the dema...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 冯丹夏文华宇
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products