Mass small file processing method based on HDFS

A technology of massive small files and processing methods, applied in the field of distributed data optimization storage, can solve the problems of inability to process image upload requests, inability to delete and modify files, and cost, achieve efficient file real-time update processing, and improve writing. And the effect of reading efficiency and overall performance improvement

Inactive Publication Date: 2016-03-16
HOHAI UNIV
View PDF4 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Hadoop itself provides a solution: HadoopArchive, which is the Hadoop archive format (hereinafter referred to as HAR). Reading a file through HAR is actually less efficient than reading a file directly from HDFS; Suzhou Liangjiang Technology Co., Ltd. in its In the patent, SequenceFile (a key-value file format of Hadoop) is used to package files. Since SequenceFile has no direct index, it is inefficient to retrieve the entire file every time it is read; Mapfile (a key-value file format of Hadoop) type format) is an indexed SequenceFile, but requires additional memory to save the index file Metadata (metadata)
[0005] In addit

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass small file processing method based on HDFS
  • Mass small file processing method based on HDFS
  • Mass small file processing method based on HDFS

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0027] The technical scheme of the present invention will be further described below in conjunction with the drawings and embodiments:

[0028] The present invention is a method for processing massive small files based on HDFS such as figure 1 As shown, the specific content will not be repeated here.

[0029] File upload using a method for processing massive small files based on HDFS of the present invention, such as figure 2 As shown, its working process is as follows:

[0030] 1) Filter the uploaded files received by the server according to the set first threshold. Set the first threshold to 1M. If the size of the uploaded file is greater than 1M, it is a large file. The uploaded file directly uses the HDFS file storage interface, that is Allocate the file storage block (BlockID) through Namenode and store it in HDFS; otherwise, it is a small file, go to 2);

[0031] 2) Get the file name of the small file, the length of the file, and after uploading the timestamp CreateTime, use th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a mass small file processing method based on an HDFS (Hadoop Distributed File System). The method comprises the following steps of generating a file ID (Identifier) through small file filtration and metadata reading to complete preprocessing on files; opening up a memory cache region; building a file uploading queue; performing time-delay storage on the files; merging the small files into a Mapfile file of a <key, value> structure in the cache region to be stored; storing file metadata into a distributive database Hbase, wherein the Hbase realizes the persistence in the HDFS; expressing the file status by a Status flag bit to complete operations such as fast reading of the small files of the cache region and Mapfile fragment merging, so that the real-time insert update and delete operation of the HDFS on the small files can be supported. The method has the advantages that the reading efficiency of the HDFS on the small files is improved, so that the system supports the real-time update operation on the small files, and the integral performance of the system is improved.

Description

technical field [0001] The invention relates to a method for processing massive small files based on HDFS, belonging to the field of distributed data optimized storage. Background technique [0002] With the rise of Internet web2.0, the amount of network data is increasing exponentially. In the era of big data, traditional data storage technology can no longer meet the needs of technological development. Hadoop Distributed File System, HDFS for short, is a distributed file system. At present, in the field of distributed file storage technology represented by HDFS, HDFS is widely used to efficiently process various large files. [0003] HDFS is optimized for high data throughput application scenarios. In other words, it is developed for accessing large files. If you access a large number of small files, you need to continuously jump from a Datanode (a data node of HDFS, which provides storage blocks for HDFS) to Another Datanode, severely impacting performance. Finally, pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/162G06F16/1724
Inventor 陈洁王龙宝张雪洁孙泽群安纪存马鹏举
Owner HOHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products