Duplicated data detecting method based on clustering

A technology of repeated data and detection methods, which is applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of inability to detect repeated data in large data sets and low detection performance, and achieve the effect of narrowing the scope and improving performance.

Active Publication Date: 2017-12-26
HUAZHONG UNIV OF SCI & TECH
View PDF7 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In view of the above defects or improvement needs of the prior art, the present invention provides a cluster-based duplicate data detection method and system, the purpose of which is to solve the relatively low detection performance of the existing duplicate data detection method based on fingerprint detection, The technical problem of being unable to achieve effective duplicate data detection for large data sets

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Duplicated data detecting method based on clustering
  • Duplicated data detecting method based on clustering
  • Duplicated data detecting method based on clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

[0043] The present invention provides an efficient clustering-based duplicate data detection method. The method mainly faces data sets with strong similarity, and stores similar data in the data set together through similarity principles and clustering ideas to solve the problem of The problem of low detection efficiency in existing duplicate data detection methods is to adapt to the status quo ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a duplicated data detecting method based on clustering. The duplicated data detecting method mainly aims at the data set type which is high in data similarity to improve the duplicated data detecting performance and meanwhile improve the data duplication eliminating performance by using the data similarity principle in a data set. The method comprises the specific steps that for the data which is likely to be duplicated in the data set, a detected fingerprint list is segmented by using a similarity merging strategy, a typical fingerprint is selected in each segment, and different segments are classified and merged into different fingerprint containers according to the typical fingerprints. The fingerprint containers collect duplicate fingerprints in the similar segments of the data set so as to improve the data duplication eliminating efficiency and meanwhile improved the duplication eliminating performance. The fingerprint containers stored on a disk can be used as a whole to be written to and read from the disk, so that the fingerprint searching efficiency is improved, and the problem of the segmented storage of the similar segments is solved.

Description

technical field [0001] The invention belongs to the technical field of computer storage, and more particularly relates to a method and system for detecting duplicate data based on clustering. Background technique [0002] With the rapid development of information technology, information has become a precious resource for our survival and the biggest driving force for the rapid development of productivity. The extensive application of information technology is also accompanied by the generation of massive data, and more and more valuable data needs to be stored. Then, how to effectively improve the storage efficiency of existing storage media to meet the ever-increasing storage demand has become one of the urgent problems to be solved in the field of storage research. At the same time, IDC's research report shows that about 75% of the existing data is redundant information, that is, only 25% of the data is unique. In this context, data deduplication, as a new technology to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/215G06F16/2228G06F18/23G06F18/22
Inventor 周可王桦张攀峰
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products