Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Mass small file processing method based on HDFS

A technology of massive small files and processing methods, applied in the field of distributed data optimization storage, can solve the problems of inability to process image upload requests, inability to delete and modify files, and cost, achieve efficient file real-time update processing, and improve writing. And the effect of reading efficiency and overall performance improvement

Inactive Publication Date: 2016-03-16
HOHAI UNIV
View PDF4 Cites 46 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Hadoop itself provides a solution: HadoopArchive, which is the Hadoop archive format (hereinafter referred to as HAR). Reading a file through HAR is actually less efficient than reading a file directly from HDFS; Suzhou Liangjiang Technology Co., Ltd. in its In the patent, SequenceFile (a key-value file format of Hadoop) is used to package files. Since SequenceFile has no direct index, it is inefficient to retrieve the entire file every time it is read; Mapfile (a key-value file format of Hadoop) type format) is an indexed SequenceFile, but requires additional memory to save the index file Metadata (metadata)
[0005] In addition, several merging file solutions provided by Hadoop must be packaged and uploaded at one time. In this way, the file cannot be deleted, modified, or appended after the file is uploaded.
In the patent of Beijing University of Aeronautics and Astronautics, the interface of HDFS to read files is improved, and the MapReduce (Hadoop programming model) model is used for processing, but this method is not suitable for the environment of online real-time storage modification, and cannot achieve high concurrency Image upload request processing
This prevents the performance of HDFS from being fully utilized in many application areas.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass small file processing method based on HDFS
  • Mass small file processing method based on HDFS
  • Mass small file processing method based on HDFS

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] The technical scheme of the present invention is further elaborated below in conjunction with accompanying drawing and embodiment:

[0028] A method for processing a large amount of small files based on HDFS in the present invention is as follows: figure 1 As shown, the specific content will not be repeated here.

[0029] Adopt the file upload of a kind of massive small file processing method based on HDFS of the present invention, as figure 2 As shown, its working process is as follows:

[0030] 1) Filter the uploaded files received by the server according to the set first threshold. Set the first threshold to 1M. If the size of the uploaded file is greater than 1M, it is a large file. The uploaded file directly uses the HDFS file storage interface, that is Allocate the file storage block (BlockID) through Namenode and store it in HDFS; otherwise, if it is a small file, go to 2);

[0031] 2) After obtaining the file name of the small file, the length of the file, a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a mass small file processing method based on an HDFS (Hadoop Distributed File System). The method comprises the following steps of generating a file ID (Identifier) through small file filtration and metadata reading to complete preprocessing on files; opening up a memory cache region; building a file uploading queue; performing time-delay storage on the files; merging the small files into a Mapfile file of a <key, value> structure in the cache region to be stored; storing file metadata into a distributive database Hbase, wherein the Hbase realizes the persistence in the HDFS; expressing the file status by a Status flag bit to complete operations such as fast reading of the small files of the cache region and Mapfile fragment merging, so that the real-time insert update and delete operation of the HDFS on the small files can be supported. The method has the advantages that the reading efficiency of the HDFS on the small files is improved, so that the system supports the real-time update operation on the small files, and the integral performance of the system is improved.

Description

technical field [0001] The invention relates to a method for processing massive small files based on HDFS, belonging to the field of distributed data optimized storage. Background technique [0002] With the rise of Internet web2.0, the amount of network data is increasing exponentially. In the era of big data, traditional data storage technology can no longer meet the needs of technological development. Hadoop Distributed File System, HDFS for short, is a distributed file system. At present, in the field of distributed file storage technology represented by HDFS, HDFS is widely used to efficiently process various large files. [0003] HDFS is optimized for high data throughput application scenarios. In other words, it is developed for accessing large files. If you access a large number of small files, you need to continuously jump from a Datanode (a data node of HDFS, which provides storage blocks for HDFS) to Another Datanode, severely impacting performance. Finally, pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/162G06F16/1724
Inventor 陈洁王龙宝张雪洁孙泽群安纪存马鹏举
Owner HOHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products