Balance clustering compression method based on data similarity

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A clustering compression and data similarity technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of insufficient execution efficiency, uneven, data-dependent system load, etc.

Inactive Publication Date: 2009-06-24

ZHEJIANG UNIV

View PDF0 Cites 22 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] The general compression method only compresses a single file and cannot take advantage of the data redundancy between files, so the compression ratio is very limited

In addition, although various methods proposed by the academic community can utilize data redundancy between files, the amount of calculation is too large and the execution efficiency is insufficient. Moreover, these methods rarely consider the storage of compressed data, and have not been developed for massive distributed storage systems. Optimization, it is easy to cause dependencies between data and uneven system load

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0055] Such as figure 1 As shown, the implementation steps of the present invention are as follows:

[0056] 1. File feature vector extraction:

[0057] The feature vector is extracted from the document data to calculate the document similarity. The specific implementation steps are as follows:

[0058] 1) Choose an independent permutation function (h 1 , H 2 ,..., h k }, each permutation function is independent of each other, here an independent linear function is used, namely h i =a i x+b i mod p, where a i , B i , Is a randomly generated integer;

[0059] 2) Scan the input file f byte by byte from front to back, use the efficient Rabin fingerprint function to calculate the fingerprint of the data in the current sliding window, record the fingerprint as fp, and use the k independent permutation functions mentioned above to act on the fingerprint fp to obtain k Replace fingerprint h 1 (fp), h 2 (fp), …, h k (fp), record the feature vector F(f) of file f as {F 1 (f), F 2 (f),......

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a cluster compression method based on data similarity. By analyzing file data, structural characteristic vector of characteristic fingerprint is extracted from files to calculate the data similarity; files are input in cluster by utilizing a graph partitioning method with a restricted condition, so that a plurality of categories in even sizes are formed; and finally, compression is respectively performed on each category by utilizing compression methods, such as BMCOM, so as to remove the redundant data in category interior. The invention adopts a clustering method based on data sampling; and key data with a high condensability serves as sample data. Firstly, clustering is performed on the sample data; then, the remaining data is classified through a marriage stabilizing method, thereby improving clustering efficiency under a condition that the compressing effect is not reduced. As a compressing and filing method, the invention can be applied to a distributed storage system, so that the problem of uneven data dependence and load in the prior method can be solved.

Description

Technical field [0001] The invention relates to the fields of data compression, distributed storage and archiving and data mining, in particular to a balanced clustering compression method based on data similarity. Background technique [0002] With the explosive growth of the total amount of information, massive distributed storage systems have become the core facilities of current Internet applications. The performance of distributed storage systems directly determines the performance of the entire information system. In a distributed storage system, except for a small part of hot data, a large part of the data is rarely accessed at all, but it takes up a lot of storage space and system resources. Therefore, compressing and archiving such data can not degrade the user experience Under the premise of reducing the occupation of system resources and saving costs. [0003] The general compression method only compresses a single file, and cannot make use of the data redundancy betwe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor陈刚陈珂余利华胡天磊寿黎但

OwnerZHEJIANG UNIV

Balance clustering compression method based on data similarity

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology