Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Distributed memory calculation based data deduplication method

A memory computing and distributed technology, applied in computing, database query, electrical digital data processing, etc., can solve the problems of time-consuming and system resources, low deduplication efficiency, etc., and achieve the effect of fast deduplication.

Active Publication Date: 2016-02-24
SOUTH CHINA UNIV OF TECH
View PDF5 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Although a lot of research has been done on the deduplication of cloud data backup in recent years, at present, the deduplication of massive data is mainly aimed at optimal file partitioning, which requires data preprocessing and data modeling in advance. Or read the fingerprint information on the disk and do real-time analysis and calculation, and then compare it. This method is not efficient in deduplication and consumes time and system resources.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed memory calculation based data deduplication method
  • Distributed memory calculation based data deduplication method
  • Distributed memory calculation based data deduplication method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] A data deduplication method based on distributed memory computing, comprising the following steps:

[0036] (1) Create a file block fingerprint set in the distributed memory, and cache the fingerprint set in the memory. Among them, the content of the fingerprint set: one part is the corresponding path of the block, the creation time of the block, the HASH value of the block, etc.; the other part is the creation time of the fingerprint set, the number of references of the fingerprint set, the weight of the fingerprint set, etc. The first part is used to map the fingerprint set and the block, and the second part is used to control the cache of the fingerprint set to the distributed memory or to the disk.

[0037](2) When creating a fingerprint set, add a unified initial weight to it to determine the cache location of the fingerprint set. The initial weight of each fingerprint set decays gradually over time until the initial weight is zero.

[0038] (3) When performing f...

Embodiment 2

[0047] Apply the present invention to data deduplication based on Spark system:

[0048] Such as figure 1 As shown, it is a flow chart of the present invention. Firstly, a file block fingerprint set is constructed in the distributed memory, and an initial weight is added to the created fingerprint set to determine the cache location of the fingerprint set. The initial weight gradually decays over time. , until it is zero; the file is divided into blocks according to the optimal file block division strategy, and the block fingerprint calculation is completed, compared with the fingerprint set cached in the memory, if a matching fingerprint set is found, a corresponding reference is added to it, if not If it is found, create the block and a new fingerprint set on the disk; the activity of the fingerprint set in memory is represented by its weight value, and according to the order of the weight value, use the weight value to control whether the fingerprint set is cached in memory...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed memory calculation based data deduplication method. The method comprises the following steps: creating a file block fingerprint set and caching the file block fingerprint set into a distributed memory; performing block segmentation on a file according to an optimal file block segmentation policy, finishing block fingerprint calculation, comparing block fingerprints with the fingerprint set cached in the memory to find matched blocks, and adding corresponding citations for the blocks; adopting a multi-level cache policy for storage of the block fingerprint set, caching block fingerprints with high weights into the memory, and caching block fingerprints with small weights into a disk; and dividing the memory into a plurality of regions for storing different types of fingerprint information to perform different fingerprint comparison operations on the file. According to the data deduplication method, the efficiency of mass data deduplication is improved, so that host space and network bandwidth are saved, and the costs of data operation and maintenance are reduced for service providers.

Description

technical field [0001] The invention relates to the field of massive data deduplication, in particular to a data deduplication method based on distributed memory computing. Background technique [0002] At present, distributed systems have been widely used in the information industry to cope with the increasing volume of massive data. Although the distributed system solves the storage problem of massive data, it also brings new challenges—data backup and restoration takes longer and longer, data redundancy increases, data storage and maintenance costs Higher and higher. Although the unit storage price has decreased significantly, the total cost of storage has continued to rise, so data deduplication technology has received more and more attention. How to efficiently deduplicate the secondary storage of massive data and reduce the time spent in the deduplication process as much as possible is an urgent problem to be solved. [0003] In recent years, research on data dedupl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/215G06F16/24
Inventor 林伟伟钟坯平利业鞑
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products