Replicated data deleting method based on file content types

A technology of deduplication and content type, which is applied to the redundant data error detection in computing, digital data processing, special data processing applications, etc. It can solve problems such as single block strategy and inability to optimize file content type. , to achieve the effect of improving the overall performance

Inactive Publication Date: 2010-05-12
HUAZHONG UNIV OF SCI & TECH
View PDF0 Cites 155 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The present invention provides a deduplication method based on file content type, which solves the problem that the existing deduplication method has a single block strategy and cannot be optimized according to the file content type

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Replicated data deleting method based on file content types
  • Replicated data deleting method based on file content types
  • Replicated data deleting method based on file content types

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The present invention will be further described below in conjunction with the accompanying drawings.

[0046] likefigure 1 As shown, the present invention performs the block boundary feature calculation step in advance, and the following sequence includes the content type identification step, the file block step, the digital fingerprint calculation step, the repeated data block judgment step and the end step.

[0047] An example of a complete flow for a content-type-based deduplication approach is given below:

[0048] Perform block boundary feature calculation steps in advance, including the following sub-steps:

[0049] A. Generate a sample file collection in the storage pool: extract the backup file collection generated by the backup process performed on September 30, 2009 from the backup system, a total of 14427 files, as a sample file collection, and put them into the storage pool;

[0050] B. Classification of sample files: Extract the metadata of each sample fil...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a replicated data deleting method based on file content types, which belongs to the replicated data deleting method of computer data backup, is applicable to disk-based backup systems, and solves the problems that the existing replicated data deleting method is single in block strategies and can not carry out optimization according to the file content types. The deleting method carries out a block boundary characteristic calculation step in advance, and then comprises the following steps sequentially: content type identification, file blocking, digital fingerprint calculation, replicated data block judgment and ending. The deleting method carries out classification on backup files based on content types, computes the optimal block boundary characteristic value aiming at every content type; and when the backup files are processed, the file content type identification step is added, and the block boundary characteristic is selected according to identification result, therefore, the overall effectiveness of the replicated data deleting method is improved when the complex backup files are processed.

Description

technical field [0001] The invention belongs to a method for deleting duplicate data of computer data backup, in particular to a method for deleting duplicate data based on file content type (Content Type), which is suitable for a backup system based on a disk. Background technique [0002] After entering the 21st century, with the acceleration of the information age, data has shown a trend of explosive growth, user storage capacity is becoming increasingly tight, data management is becoming increasingly difficult, and storage expenditures are gradually increasing. In order to deal with these problems, a data deduplication technology is proposed to effectively reduce the repeated data in the user's daily backup, so that the backup data is greatly reduced, thereby saving the storage capacity for the user and reducing the difficulty of data management. Many storage vendors have launched backup systems or software based on data deduplication, such as EMC's Avamar Data Store bac...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F11/14
Inventor 周敬利秦磊华曾东聂雪军刘科朱建峰
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products