Data compression engine and method used for big data storage system

A big data storage and data compression technology, applied in file systems, electronic digital data processing, special data processing applications, etc., can solve the problems of increasing access tasks on resource occupation period, adverse effects on system performance, and reducing timeliness, etc. Improve the query response speed, improve the access task mechanism, and relieve the effect of load

Active Publication Date: 2017-12-12
ZHEJIANG LISHI TECH
View PDF6 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

If there are too many small files in the HDFS system, it will have a huge adverse effect on the performance of the system.
The reason is: first, the metadata established for each file (regardless of its size) in NameNode 2 occupies at least a fixed space of 150 bytes; obviously, if there are tens of thousands of small files, the files generated by the small files The metadata will consume a large amount of the available storage space of NameNode 2; moreover, NameNode 2 serially retrieves each metadata to query the blocks involved in the file to be read, and the excessive metadata generated by small files will bring great inconvenience to the retrieval. big difficulty
Second, if the client node needs to access a large number of small files, it must continuously obtain the blocks of each small file from one data node 3 to another data node 3, which will consume a lot of system resources
[0009] However, the data compression process will also increase the amount of calculations for reading and writing files under the HDFS system, and slow down the response speed. Sometimes this adverse effect will be very obvious
First of all, during the file writing process, the file structure such as the Sequence file after merging small files needs to be compressed, which will inevitably prolong the operation time
On the other hand, in the process of reading the file, the data node must first decompress the compressed file in the block, and then retrieve the small file requested by the client node from the decompressed Sequence file and other large file structures. This delays the time when the client node actually gets the content of the small file
In addition, although the small file merging process significantly reduces the amount of metadata and query load at the name node, the client node still needs to go through the whole process of an access task to access each required small file, and the access task still takes up the required resources, and the decompression operation further increases the resource occupation cycle of the access task
[0010] It can be seen that in the prior art, the problem of inefficiency caused by excessive small files in the HDFS system can be dealt with by means of file merging processing + compression operation, which can fully reduce the number of small files and relieve the load on the name node, but it also causes write and read problems. The increased burden of computation consumes more resources during the process from establishment to recovery of access tasks, reducing the timeliness of client nodes to obtain the required small file content

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data compression engine and method used for big data storage system
  • Data compression engine and method used for big data storage system
  • Data compression engine and method used for big data storage system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The technical solutions of the present invention will be further specifically described below through examples.

[0043] figure 2 It is a schematic structural diagram of the data compression engine system under the HDFS system provided by the present invention. Such as figure 2 As shown, the HDFS system for realizing distributed file big data storage includes: client node 1, name node 2, data nodes 3-1, 3-2...3-N. The stored data is used as a file, which is split into blocks by the client node 1 that uploads the file. The size of each block is 64M by default, and is stored in the data nodes 3-1, 3 of the HDFS system in blocks. - At least one data node in 2...3-N. The name node 2 registers a metadata item for each file, and the metadata stores the file identifier of the file, the block identifier of each block corresponding to the file, and the network address of the data node where each block is located. In the process of accessing data, the client node 1 that req...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention puts forward a data compression engine and method used for a big data storage system. An access popular degree level to which each small file belongs is judged by aiming at mass small files in an HDFS (Hadoop Distributed File System), and a metadata table copy mechanism and a retrieval process number mechanism corresponding to each access popular degree level are set, wherein the small file with a high popular degree level is not compressed and combined; the small file with a medium popular degree level is imported and combined but is not compressed; and the small file with a low popular degree level is imported, combined and compressed, the small file of the same source are combined into a bigger file structure, and data compression is carried out. In addition, a resident access task is established by aiming at the small file with the high popular degree level, and therefore, a low efficiency phenomenon brought in a way that the access task is frequently set and recovered is avoided.

Description

technical field [0001] The invention relates to a big data application technology, in particular to a data compression engine and method for a big data storage system. Background technique [0002] Hadoop is a system architecture based on distributed computer clusters for high-speed computing and data storage. It is currently the mainstream platform selected by various network service providers for the collection and analysis of massive big data. [0003] HDFS is a distributed file system that can be built on top of computer clusters to provide reliable, low-cost, high-transmission data storage, access, and management functions. It can accommodate massive amounts of big data and support such Large-scale data volume-based network applications are an indispensable and important part of the Hadoop system. [0004] The architecture and operation process adopted by HDFS are as follows: Figure 1A -B is shown. HDFS performs access in the form of data streams, and supports multip...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/113G06F16/13G06F16/1727G06F16/182
Inventor 陈海江周岐武
Owner ZHEJIANG LISHI TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products