Duplicated data detecting method based on clustering

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of repeated data and detection methods, which is applied in electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of inability to detect repeated data in large data sets and low detection performance, and achieve the effect of narrowing the scope and improving performance.

Active Publication Date: 2017-12-26

HUAZHONG UNIV OF SCI & TECH

View PDF7 Cites 10 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] In view of the above defects or improvement needs of the prior art, the present invention provides a cluster-based duplicate data detection method and system, the purpose of which is to solve the relatively low detection performance of the existing duplicate data detection method based on fingerprint detection, The technical problem of being unable to achieve effective duplicate data detection for large data sets

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0042] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

[0043] The present invention provides an efficient clustering-based duplicate data detection method. The method mainly faces data sets with strong similarity, and stores similar data in the data set together through similarity principles and clustering ideas to solve the problem of The problem of low detection efficiency in existing duplicate data detection methods is to adapt to the status quo ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a duplicated data detecting method based on clustering. The duplicated data detecting method mainly aims at the data set type which is high in data similarity to improve the duplicated data detecting performance and meanwhile improve the data duplication eliminating performance by using the data similarity principle in a data set. The method comprises the specific steps that for the data which is likely to be duplicated in the data set, a detected fingerprint list is segmented by using a similarity merging strategy, a typical fingerprint is selected in each segment, and different segments are classified and merged into different fingerprint containers according to the typical fingerprints. The fingerprint containers collect duplicate fingerprints in the similar segments of the data set so as to improve the data duplication eliminating efficiency and meanwhile improved the duplication eliminating performance. The fingerprint containers stored on a disk can be used as a whole to be written to and read from the disk, so that the fingerprint searching efficiency is improved, and the problem of the segmented storage of the similar segments is solved.

Description

technical field [0001] The invention belongs to the technical field of computer storage, and more particularly relates to a method and system for detecting duplicate data based on clustering. Background technique [0002] With the rapid development of information technology, information has become a precious resource for our survival and the biggest driving force for the rapid development of productivity. The extensive application of information technology is also accompanied by the generation of massive data, and more and more valuable data needs to be stored. Then, how to effectively improve the storage efficiency of existing storage media to meet the ever-increasing storage demand has become one of the urgent problems to be solved in the field of storage research. At the same time, IDC's research report shows that about 75% of the existing data is redundant information, that is, only 25% of the data is unique. In this context, data deduplication, as a new technology to ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30G06K9/62

CPCG06F16/215G06F16/2228G06F18/23G06F18/22

Inventor周可王桦张攀峰

OwnerHUAZHONG UNIV OF SCI & TECH

Duplicated data detecting method based on clustering

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology