Balance clustering compression method based on data similarity

A clustering compression and data similarity technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of insufficient execution efficiency, uneven, data-dependent system load, etc.

Inactive Publication Date: 2009-06-24
ZHEJIANG UNIV
View PDF0 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The general compression method only compresses a single file and cannot take advantage of the data redundancy between files, so the compression ratio is very limited
In addition, although various methods proposed by the academic community can utilize data redundancy between files, the amount

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Balance clustering compression method based on data similarity
  • Balance clustering compression method based on data similarity
  • Balance clustering compression method based on data similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0055] Such as figure 1 As shown, the implementation steps of the present invention are as follows:

[0056] 1. File feature vector extraction:

[0057] The feature vector is extracted from the document data to calculate the document similarity. The specific implementation steps are as follows:

[0058] 1) Choose an independent permutation function (h 1 , H 2 ,..., h k }, each permutation function is independent of each other, here an independent linear function is used, namely h i =a i x+b i mod p, where a i , B i , Is a randomly generated integer;

[0059] 2) Scan the input file f byte by byte from front to back, use the efficient Rabin fingerprint function to calculate the fingerprint of the data in the current sliding window, record the fingerprint as fp, and use the k independent permutation functions mentioned above to act on the fingerprint fp to obtain k Replace fingerprint h 1 (fp), h 2 (fp), …, h k (fp), record the feature vector F(f) of file f as {F 1 (f), F 2 (f),......

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a cluster compression method based on data similarity. By analyzing file data, structural characteristic vector of characteristic fingerprint is extracted from files to calculate the data similarity; files are input in cluster by utilizing a graph partitioning method with a restricted condition, so that a plurality of categories in even sizes are formed; and finally, compression is respectively performed on each category by utilizing compression methods, such as BMCOM, so as to remove the redundant data in category interior. The invention adopts a clustering method based on data sampling; and key data with a high condensability serves as sample data. Firstly, clustering is performed on the sample data; then, the remaining data is classified through a marriage stabilizing method, thereby improving clustering efficiency under a condition that the compressing effect is not reduced. As a compressing and filing method, the invention can be applied to a distributed storage system, so that the problem of uneven data dependence and load in the prior method can be solved.

Description

Technical field [0001] The invention relates to the fields of data compression, distributed storage and archiving and data mining, in particular to a balanced clustering compression method based on data similarity. Background technique [0002] With the explosive growth of the total amount of information, massive distributed storage systems have become the core facilities of current Internet applications. The performance of distributed storage systems directly determines the performance of the entire information system. In a distributed storage system, except for a small part of hot data, a large part of the data is rarely accessed at all, but it takes up a lot of storage space and system resources. Therefore, compressing and archiving such data can not degrade the user experience Under the premise of reducing the occupation of system resources and saving costs. [0003] The general compression method only compresses a single file, and cannot make use of the data redundancy betwe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 陈刚陈珂余利华胡天磊寿黎但
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products