Method for storing and processing small log type files in Hadoop distributed file system

A technology of distributed files and hadoop clusters, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of complex and opaque analysis and processing of small files, and achieve the effect of solving memory load problems

Inactive Publication Date: 2015-06-24
JIANGSU R & D CENTER FOR INTERNET OF THINGS +2
View PDF4 Cites 52 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, these methods are all focused on storage, and the interface provided is not transparent to t...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for storing and processing small log type files in Hadoop distributed file system
  • Method for storing and processing small log type files in Hadoop distributed file system
  • Method for storing and processing small log type files in Hadoop distributed file system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0025] The computers in the cluster are divided into NameNode and DataNode according to their functions. When a client accesses a specific file in HDFS, it first obtains the Metadata information of the file from the NameNode, and then establishes a connection with the DataNode to obtain the read and write file data. The operation process of client accessing files is encapsulated in the form of client library, and the process of communicating with NameNode and DataNode is transparent to the client.

[0026] Small log files are merged according to the nearest physical path. Specifically, small log files in the same directory (excluding subdirectories) are merged into one file, which is called MergeFile. The Metadata of small log files are sequentially stored in a file, which is called MergeIndex. The merged file and the merged file index are located in the original HDFS directory and are named with the reserved file name. MergeFile supports append, modify, and delete operations...

Embodiment 2

[0028]On the basis of Embodiment 1, this embodiment performs special processing on log-type small files. Log-type small files are a kind of derivation of HDFS files at the interface level. When creating a file, the client specifies whether the created file is log-type or not. small files. There is a unique pair of MergeIndex file and MergeFile file in the parent directory of each small log file. The file merge operation is triggered when the write operation of the log-type small file ends, the file content is appended to MergeFile, and the file Metadata is appended to MergeIndex. The MergeFile structure is as follows figure 1 As shown, multiple small files are tightly connected and stored in MergeFile, and the data is not compressed. The MergeIndex structure is as follows figure 2 As shown, each file Metadata record occupies one line (the end of the line adopts "carriage return line feed character CRLF").

[0029] The detailed writing process of log-type small files is as...

Embodiment 3

[0039] On the basis of Embodiment 2, the operation process of the client in this example to read and write files is as follows:

[0040] (1) According to the file path specified by the client, the client library communicates with the NameNode to confirm whether the file corresponding to the file path exists. If the file exists, the file is an ordinary HDFS file, and no special processing is performed according to the HDFS original read and write process; if the file does not exist, the file may be a small journal-type file, and then go to step (2).

[0041] (2) The client library reads the MergeIndex under the parent directory of the specified path, and traverses the file items from back to front to find the specified file. If the search fails, the specified path does not exist, and an error is returned; if the search is successful, the file is a log-type small file, and the read and write requests are correspondingly transferred to steps (3) and (4).

[0042] (3) According t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of an HDFS of a computer, and discloses a method for storing and processing small log type files in a Hadoop distributed file system (HDFS). According to the method, files are combined in a nearby mode according to physical locations, and a Copy-On-Write mechanism is used for optimizing read-write of the small files; specifically, the small log type files are combined in a nearby mode according to a physical path, a client side reads and writes the combined files from a NameNode and Metadata information of indexes of the combined files when reading and writing the small log type files, and then all the small log type file data are read and written from the combined files according to the indexes of the combined files. According to the new processing method of the small log type files, the memory load of the metadata of the small files are transmitted to the client side from the NameNode, and the problem that when the HDFS processes a large number of small files, efficiency is low is effectively solved. The client side caches the metadata of the small files, so that the speed of access to the small files is improved, and a user does not need to send a metadata request to the NameNode when sequentially accessing small files which are adjacent in physical location.

Description

technical field [0001] The invention relates to the field of computer HDFS distributed file systems, in particular to a HDFS storage and processing method for log-type small files. Background technique [0002] HDFS is the abbreviation of Hadoop Distributed File System, which is a distributed file storage system. [0003] As the application of the Internet penetrates into every aspect of people's lives, more and more devices are added to the Internet. These devices are producing data all the time, and the amount and types of data we need to process are increasing. As an open source implementation of GFS, HDFS under Hadoop handles large files very well, but the efficiency of processing small files is very low. Specifically, a large number of small files occupy NameNode memory resources and DataNode disk utilization is low. [0004] The industry has tried some HDFS optimization methods for small files. However, these methods are all focused on storage, and the interface pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/172G06F16/1734G06F16/182
Inventor 徐锐刘斌台宪青
Owner JIANGSU R & D CENTER FOR INTERNET OF THINGS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products