Mass small file processing method based on HDFS

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of massive small files and processing methods, applied in the field of distributed data optimization storage, can solve the problems of inability to process image upload requests, inability to delete and modify files, and cost, achieve efficient file real-time update processing, and improve writing. And the effect of reading efficiency and overall performance improvement

Inactive Publication Date: 2016-03-16

HOHAI UNIV

View PDF4 Cites 46 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Hadoop itself provides a solution: HadoopArchive, which is the Hadoop archive format (hereinafter referred to as HAR). Reading a file through HAR is actually less efficient than reading a file directly from HDFS; Suzhou Liangjiang Technology Co., Ltd. in its In the patent, SequenceFile (a key-value file format of Hadoop) is used to package files. Since SequenceFile has no direct index, it is inefficient to retrieve the entire file every time it is read; Mapfile (a key-value file format of Hadoop) type format) is an indexed SequenceFile, but requires additional memory to save the index file Metadata (metadata)

[0005] In addition, several merging file solutions provided by Hadoop must be packaged and uploaded at one time. In this way, the file cannot be deleted, modified, or appended after the file is uploaded.

In the patent of Beijing University of Aeronautics and Astronautics, the interface of HDFS to read files is improved, and the MapReduce (Hadoop programming model) model is used for processing, but this method is not suitable for the environment of online real-time storage modification, and cannot achieve high concurrency Image upload request processing

This prevents the performance of HDFS from being fully utilized in many application areas.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0027] The technical scheme of the present invention is further elaborated below in conjunction with accompanying drawing and embodiment:

[0028] A method for processing a large amount of small files based on HDFS in the present invention is as follows: figure 1 As shown, the specific content will not be repeated here.

[0029] Adopt the file upload of a kind of massive small file processing method based on HDFS of the present invention, as figure 2 As shown, its working process is as follows:

[0030] 1) Filter the uploaded files received by the server according to the set first threshold. Set the first threshold to 1M. If the size of the uploaded file is greater than 1M, it is a large file. The uploaded file directly uses the HDFS file storage interface, that is Allocate the file storage block (BlockID) through Namenode and store it in HDFS; otherwise, if it is a small file, go to 2);

[0031] 2) After obtaining the file name of the small file, the length of the file, a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a mass small file processing method based on an HDFS (Hadoop Distributed File System). The method comprises the following steps of generating a file ID (Identifier) through small file filtration and metadata reading to complete preprocessing on files; opening up a memory cache region; building a file uploading queue; performing time-delay storage on the files; merging the small files into a Mapfile file of a <key, value> structure in the cache region to be stored; storing file metadata into a distributive database Hbase, wherein the Hbase realizes the persistence in the HDFS; expressing the file status by a Status flag bit to complete operations such as fast reading of the small files of the cache region and Mapfile fragment merging, so that the real-time insert update and delete operation of the HDFS on the small files can be supported. The method has the advantages that the reading efficiency of the HDFS on the small files is improved, so that the system supports the real-time update operation on the small files, and the integral performance of the system is improved.

Description

technical field [0001] The invention relates to a method for processing massive small files based on HDFS, belonging to the field of distributed data optimized storage. Background technique [0002] With the rise of Internet web2.0, the amount of network data is increasing exponentially. In the era of big data, traditional data storage technology can no longer meet the needs of technological development. Hadoop Distributed File System, HDFS for short, is a distributed file system. At present, in the field of distributed file storage technology represented by HDFS, HDFS is widely used to efficiently process various large files. [0003] HDFS is optimized for high data throughput application scenarios. In other words, it is developed for accessing large files. If you access a large number of small files, you need to continuously jump from a Datanode (a data node of HDFS, which provides storage blocks for HDFS) to Another Datanode, severely impacting performance. Finally, pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/162G06F16/1724

Inventor 陈洁王龙宝张雪洁孙泽群安纪存马鹏举

Owner HOHAI UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Mass small file processing method based on HDFS

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology