Data de-duplication method based on combination of similarity and locality

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology of deduplication and similarity, which is applied in digital data processing, special data processing applications, instruments, etc., can solve problems such as affecting the throughput rate of deduplication, avoid accessing disk indexes, reduce memory overhead, and duplicate data. Remove efficient effects

Active Publication Date: 2011-10-19

HUAZHONG UNIV OF SCI & TECH

View PDF3 Cites 79 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

That is to say, every time a data block is input, the entire disk fingerprint index needs to be traversed, which seriously affects the throughput of deduplication.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0027] The deduplication method of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0028] The data deletion method of the present invention divides the data stream to be backed up into blocks and groups, uses the fingerprint set of each group of data blocks to construct a similarity unit, and selects the representative fingerprint of the similarity unit, that is, selects the smallest prefix of the fingerprint value in the similarity unit Put the representative fingerprint into the memory and use it as the key value index for data deduplication to judge the similarity.

[0029] Because if the data block sets represented by two similarity units have a lot of repeated data blocks, the probability that their fingerprints are equal is equal to the ratio that they have a common fingerprint, so the similarity judging method described in the present invention is based on the similarity probability, The greater the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a data de-duplication method for combining the similarity and the locality of data, with less system memory expense and high data de-duplication efficiency. The method comprises the steps of: firstly partitioning and grouping files in a data stream, determining a similarity unit and a representative fingerprint of every data group, and storing the representative fingerprint in a memory; and traversing all the data groups and performing a similarity determination to determine which data in data groups are completely duplicate data and which data in data groups have non-duplicate data. If the non-duplicate data exists in the data groups, the locality determination can be continued to further determine which data in the data groups is the duplicate data. According to the method, the representative fingerprint is only stored in the memory, thus the memory expense is greatly reduced; supplement can be performed after the similarity determination by mining the locality of the data stream and caching the locality of the data stream in the memory, thus more duplicate data can be found, and simultaneously, frequent access to a disk index can be avoided and the utilization rate of the memory can be enhanced.

Description

technical field [0001] The invention belongs to the field of computer storage, in particular to a method for deleting duplicated data based on the combination of similarity and locality. Background technique [0002] In recent years, with the development and popularization of computer technology and networks, the amount of data information storage in the world has shown an explosive growth trend. Although the price of storage devices has been declining, it is far behind the speed of data expansion. Data deduplication (Data Deduplication), as a technology to effectively eliminate redundant data on a large scale, has become a hotspot in storage system research in recent years. Data deduplication can not only greatly save storage space and improve storage system performance, but also save network bandwidth by avoiding redundant data transmission. The rise of data deduplication originated from the demand for massive data backup and archiving in the storage market, and the dema...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

Inventor冯丹夏文华宇

OwnerHUAZHONG UNIV OF SCI & TECH

Data de-duplication method based on combination of similarity and locality

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology