Distributed duplicated data deleting system and method based on Hadoop platform

A deduplication and distributed technology, which is applied in database management systems, electronic digital data processing, structured data retrieval, etc., can solve the problem of low efficiency of distributed deduplication methods, lack of overall coordination, and ineffective deduplication. Sufficient and other issues, to achieve good deduplication and expansion capabilities, solve scalability problems, and improve efficiency

Active Publication Date: 2016-02-10
PLA UNIV OF SCI & TECH
View PDF3 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem mainly solved by the present invention is to provide a distributed deduplication system and method based on the Hadoop platform, which solves the problems caused by the low efficiency of the distributed deduplication method in the prior art and the lack of overall coordination among the servers. Insufficient deduplication effect and low reliability

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed duplicated data deleting system and method based on Hadoop platform
  • Distributed duplicated data deleting system and method based on Hadoop platform
  • Distributed duplicated data deleting system and method based on Hadoop platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to facilitate the understanding of the present invention, the present invention will be described in more detail below in conjunction with the accompanying drawings and specific embodiments. Preferred embodiments of the invention are shown in the accompanying drawings. However, the present invention can be implemented in many different forms and is not limited to the embodiments described in this specification. On the contrary, these embodiments are provided to make the understanding of the disclosure of the present invention more thorough and comprehensive.

[0023] It should be noted that, unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by those skilled in the technical field of the present invention. Terms used in the description of the present invention are only for the purpose of describing specific embodiments, and are not used to limit the present invention. The term "...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a distributed duplicated data deleting system and method based on a Hadoop platform. The system comprises a client, a master node and worker nodes. Distributed parallel duplicated data deleting processing is implemented by a MapReduce parallel programming frame of the Hadoop platform. The implementation method comprises: the client sends a file to the master node; the master node completes file fragmentation and data distribution and construction of a file metadata table; and each worker node, according to fine granularity, performs blocking on data fragmentation, calculates fingerprint values of fine-grained data blocks, carries out query and comparison in an index of a database Hbase, stores new data blocks in a distributed file system HDFS and feeds back index information to the master node. The system and method can have high throughput and excellent expansibility while ensuring a high deduplication rate.

Description

technical field [0001] The invention relates to the field of computer data storage management, in particular to a distributed duplicate data deletion system and method based on Hadoop platform. Background technique [0002] With the rapid development of information technology, emerging technologies such as cloud computing, Internet of Things, information grid, and various social platforms continue to emerge, data types are gradually diversified, and data volume is increasing rapidly. Facing the continuous expansion of massive data, storage system capacity and storage data management have gradually become challenging issues. On the one hand, data centers need to add a large number of storage devices to meet the demand for massive data storage. On the other hand, the increase in storage devices will bring costs such as procurement, management, and electricity for enterprises. However, data storage in data centers generally has high redundancy characteristics, especially backu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/182G06F16/215G06F16/25
Inventor 付印金刘青倪桂强姜劲松胡谷雨
Owner PLA UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products