Duplicated data detecting method based on clustering

A technology of repeated data and detection methods, which is applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of inability to detect repeated data in large data sets and low detection performance, and achieve the effect of narrowing the scope and improving performance.
CN107515931AActive Publication Date: 2017-12-26HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN Β· China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Publication Date
2017-12-26

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a duplicated data detecting method based on clustering. The duplicated data detecting method mainly aims at the data set type which is high in data similarity to improve the duplicated data detecting performance and meanwhile improve the data duplication eliminating performance by using the data similarity principle in a data set. The method comprises the specific steps that for the data which is likely to be duplicated in the data set, a detected fingerprint list is segmented by using a similarity merging strategy, a typical fingerprint is selected in each segment, and different segments are classified and merged into different fingerprint containers according to the typical fingerprints. The fingerprint containers collect duplicate fingerprints in the similar segments of the data set so as to improve the data duplication eliminating efficiency and meanwhile improved the duplication eliminating performance. The fingerprint containers stored on a disk can be used as a whole to be written to and read from the disk, so that the fingerprint searching efficiency is improved, and the problem of the segmented storage of the similar segments is solved.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the technical field of computer storage, and more particularly relates to a method and system for detecting duplicate data based on clustering. Background technique

[0002] With the rapid development of information technology, information has become a precious resource for our survival and the biggest driving force for the rapid development of productivity. The extensive application of information technology is also accompanied by the generation of massive data, and more and more valuable data needs to be stored. Then, how to effectively improve the storage efficiency of existing storage media to meet the ever-increasing storage demand has become one of the urgent problems to be solved in the field of storage research. At the same time, IDC's research report shows that about 75% of the existing data is redundant information, that is, only 25% of the data is unique. In this context, data deduplication, as a new technology to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More