Method of reducing redundancy between two or more datasets

a dataset and redundancy reduction technology, applied in the field of system and method for storing data, can solve the problems of limiting the number of backups the user maintains, consuming a great deal of disk space, and modern disk drives also have undesirable properties, so as to achieve lower deduplication efficiency, high deduplication efficiency, and high deduplication performance

Inactive Publication Date: 2013-01-10
CHRYSALIS STORAGE
View PDF6 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0032]As currently implemented prior to this invention, the random access memory (RAM) requirements to remove redundancies in, say, a pair of multi-terabyte files, exceed the capacity of most modern home computers and de-duplicating appliances. This Method vastly reduces the RAM requirements to remove much of the duplicate data between these two files as well as removing duplicate data within the files themselves.V.(2) Definition of Local Information
[0090]The Method we teach produces high deduplication efficiency (often up to 20-1; higher deduplication performance will be achieved with highly similar data in the Current and Reference Files) with oversubscription ratios of 30 to 1 or more, allowing us to use small (e.g., 512 byte) blocks even for 200+ GB files while consuming only 128 MB of RAM.

Problems solved by technology

Maintaining multiple versions of such datasets consumes a great deal of disk space, so that only a relatively few, if any, versions are, generally, kept.
Users of this Method will find this method particularly efficacious when used with Repositories with the hardware property of “read many, write many” or “read many, write few” or “read many, write once”.
Modern disk drives also have the undesirable property that random seeks take milliseconds of time.
Modern solid-state memory has the undesirable properties of being far more expensive per byte than disk-based memory as well as (with some types of solid-state memory) limiting the number of rewrites before the device fails to be rewriteable.
Because users of computers accidentally delete files or their computers become infected by computer viruses, users often wish to retain multiple backups over time.
Thus, an unsophisticated backup scheme of uncompressed 250 gigabytes (125 gigabytes compressed) would cost the user about $12 for each backup, quickly limiting the number of backups the user maintains.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method of reducing redundancy between two or more datasets
  • Method of reducing redundancy between two or more datasets

Examples

Experimental program
Comparison scheme
Effect test

example cases

XXXVI. Example Cases

XXXVI. (1) Worst Case

[0548]As those familiar with the art of compression know, no compression technique is guaranteed to produce compression. Indeed, under worst-case conditions, every compression technique is guaranteed to produce output that is larger than its inputs.

[0549]Our Method is no exception and it is useful to understand this Method's limitations.

[0550]The amount of compression that one is likely to get from our Method is, on average, dependent on the size of the DDS; the larger the DDS the more likely it is that our Method will be able to detect and eliminate common data redundancy.

[0551]Consider a DDS with zero entries. The Method will proceed, roughly, as follows:

[0552]A pointless “Reference File Analysis Phase” would be done to build a non-existent DDS. As usual, the Reference File would be physically and / or logically copied to the Extended Reference File while digests are inserted into the DDS. Since the DDS has, by our example, no entries then th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for reducing redundancy between two or more datasets of potentially very large size. The method improves upon current technology by oversubscribing the data structure that represents a digest of data blocks and using positional information about matching data so that very large datasets can be analyzed and the redundancies removed by, having found a match on digest, expands the match in both directions in order to detect and eliminate large runs of data by replace duplicate runs with references to common data. The method is particularly useful for capturing the states of images of a hard disk. The method permits several files to have their redundancy removed and the files to later be reconstituted. The method is appropriate for use on a WORM device. The method can also make use of L2 cache to improve performance.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS[0001]This application is a continuation of U.S. patent application Ser. No. 12 / 455,864, filed Jun. 8, 2009, which claims the benefit of U.S. Provisional Patent Application No. 61 / 059,276, each of which is hereby incorporated herein by reference in its entirety.FIELD OF THE INVENTION[0002]This invention relates generally to a system and method for storing data. More particularly, this invention relates to a form of data size reduction which involves finding redundancies in and between large data sets (files) and eliminating these redundancies in order to conserve repository memory (generally, disk space).BACKGROUND OF THE INVENTION[0003]This invention relates generally to a system and method for storing data. More particularly, this invention relates to storing data efficiently in both the time and space domains by removing redundant data between two or more data sets.[0004]The inventors of this invention noticed that there are many times when ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F17/3015G06F16/174G06F16/2365G06F16/217G06F16/2379
Inventor HELLER, STEVESHNELVAR, RALPH
Owner CHRYSALIS STORAGE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products