Rapid similar data detection method based on unified sampling

A data detection and fast technology, applied in other database retrieval, other database index, other database query and other directions, can solve the problem of slow calculation speed, reduce the number of fingerprints, quickly and efficiently detect similar data, and simplify the calculation.

Active Publication Date: 2019-08-02
HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
View PDF23 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] It is necessary to calculate the Rabin fingerprint value for the global scan content of the data block (Rabin calculation is time-consuming), and at the same time, it is necessary to perform M linear transformation calculations on all the Rabin fingerprint values ​​(linear transformation calculation is time-consuming) to obtain M eigenvalues, thereby further Assembling multiple super eigenvalues, the overall calculation speed is very slow

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rapid similar data detection method based on unified sampling
  • Rapid similar data detection method based on unified sampling
  • Rapid similar data detection method based on unified sampling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The present invention will be further described below in conjunction with the description of the drawings and specific embodiments.

[0041] Such as Figure 1 to Figure 3 As shown, a fast similar data detection method based on uniform sampling includes the following steps:

[0042] A. Quickly calculate the hash set based on the sliding window algorithm to ensure that as many duplicate or similar content are covered as possible, that is, if two data blocks are similar, the corresponding hash set also has many repeated values;

[0043] B. Quickly and uniformly sample the calculated hash set. If the two data sets are very similar, then the data set after uniform sampling of this data set is also very similar;

[0044] C. Perform M linear transformations on the sampled hash sets to obtain M new sets, and based on the principle of calculating the maximum value, extract a feature value (maximum value or minimum value) from each set, and calculate the feature value The formu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a rapid similar data detection method based on unified sampling. The rapid similar data detection method comprises the following steps of A, rapidly calculating a Hash set basedon a sliding window algorithm; B, quickly and uniformly sampling the hash set obtained by calculation; and C, based on the sampled Hash set, extracting a similarity characteristic value and a super characteristic value for similarity matching searching. The beneficial effects of the invention are as follows: the paint has good effects; on the premise that the original similarity detection efficiency is kept; rapid sliding Hash calculation is carried out; and the number of fingerprints needing linear transformation is greatly reduced through a unified sampling method, so that the subsequent calculation of the extracted characteristic values and the super characteristic values is simplified, and finally, the similar data detection speed is greatly increased, so that a rapid and efficient similar data detection effect oriented to a large-scale storage system is realized.

Description

technical field [0001] The invention relates to a similar data detection method, in particular to a fast similar data detection method based on unified sampling. Background technique [0002] In recent years, with the development and popularization of computer technology and networks, the amount of data information storage in the world has shown an explosive growth trend. Although the price of storage devices has been continuously falling, it is far behind the speed of data expansion. As a technology to effectively eliminate redundant data on a large scale, data elimination (or redundant data elimination) has become a hotspot in storage system research in recent years. The elimination of redundant data can not only save a large amount of storage space and improve storage system performance, but also save network bandwidth by avoiding redundant data transmission. The rise of redundant data elimination technology stems from the demand for massive data backup and archiving in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/901G06F16/903
CPCG06F16/9014G06F16/90335
Inventor 夏文王轩
Owner HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products