Mass small file processing method based on HDFS

A technology of massive small files and processing methods, applied in the field of distributed data optimization storage, can solve the problems of inability to process image upload requests, inability to delete and modify files, and cost, achieve efficient file real-time update processing, and improve writing. And the effect of reading efficiency and overall performance improvement

Inactive Publication Date: 2016-03-16
HOHAI UNIV
4 Cites 46 Cited by

AI-Extracted Technical Summary

Problems solved by technology

Hadoop itself provides a solution: HadoopArchive, which is the Hadoop archive format (hereinafter referred to as HAR). Reading a file through HAR is actually less efficient than reading a file directly from HDFS; Suzhou Liangjiang Technology Co., Ltd. in its In the patent, SequenceFile (a key-value file format of Hadoop) is used to package files. Since SequenceFile has no direct index, it is inefficient to retrieve the entire file every time it is read; Mapfile (a key-value file format of Hadoop) type format) is an indexed SequenceFile, but requires additional memory to save the index file Metadata (metadata)
[0005] In addit...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

According to the above embodiments, it can be seen that for the HDFS distributed file system, the utilization of a large amount of small file data storage resources is low, the file access efficiency is low and the problem that the file cannot be updated in real time, the method of the present invention improves the performance of HDFS on the small file. The efficiency of writing and reading realizes the support for instant update of small files and i...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention provides a mass small file processing method based on an HDFS (Hadoop Distributed File System). The method comprises the following steps of generating a file ID (Identifier) through small file filtration and metadata reading to complete preprocessing on files; opening up a memory cache region; building a file uploading queue; performing time-delay storage on the files; merging the small files into a Mapfile file of a <key, value> structure in the cache region to be stored; storing file metadata into a distributive database Hbase, wherein the Hbase realizes the persistence in the HDFS; expressing the file status by a Status flag bit to complete operations such as fast reading of the small files of the cache region and Mapfile fragment merging, so that the real-time insert update and delete operation of the HDFS on the small files can be supported. The method has the advantages that the reading efficiency of the HDFS on the small files is improved, so that the system supports the real-time update operation on the small files, and the integral performance of the system is improved.

Application Domain

Technology Topic

Image

  • Mass small file processing method based on HDFS
  • Mass small file processing method based on HDFS
  • Mass small file processing method based on HDFS

Examples

  • Experimental program(1)

Example Embodiment

[0027] The technical scheme of the present invention will be further described below in conjunction with the drawings and embodiments:
[0028] The present invention is a method for processing massive small files based on HDFS such as figure 1 As shown, the specific content will not be repeated here.
[0029] File upload using a method for processing massive small files based on HDFS of the present invention, such as figure 2 As shown, its working process is as follows:
[0030] 1) Filter the uploaded files received by the server according to the set first threshold. Set the first threshold to 1M. If the size of the uploaded file is greater than 1M, it is a large file. The uploaded file directly uses the HDFS file storage interface, that is Allocate the file storage block (BlockID) through Namenode and store it in HDFS; otherwise, it is a small file, go to 2);
[0031] 2) Get the file name of the small file, the length of the file, and after uploading the timestamp CreateTime, use the small file file name Name and the upload timestamp CreateTime to perform string splicing, and then use SHA-1 (secure scattered Column algorithm) to generate the storage ID of the small file;
[0032] 3) Use the Status flag to indicate the storage status of the small file. The meaning of the Status flag is shown in the following table; the Status flag of the small file is set to 0;
[0033] Table Status flag meaning table
[0034] Flag Status 0 Local Temp File (buffer file) 1 HDFS Mapfile (synchronized files in HDFS) 2 Deleted File
[0035] 4) Use Hbase to store the metadata of small files: use the distributed non-relational database Hbase to store the metadata of small files, and persist the Hbase data in HDFS; use the distributed non-relational database Hbase to store the metadata of small files , Specifically: use the file storage ID as the row key to establish two column families Attr and Var. The column family Attr includes file name (Name), file length (Length), file storage ID (MapfileID), and file storage block ( BlockID) four columns; the column family Var includes two columns of file status flag (Status) and file update time (UpdateTime);
[0036] 5) According to the server memory size, apply for a certain memory buffer, establish an upload queue, and use local IO to store small files in the buffer upload queue;
[0037] 6) If the size of the file in the buffer has exceeded the set second threshold or the size of the small file to be uploaded exceeds the remaining buffer, the small files in the buffer are merged and the buffer is cleared. The small files are merged as: the small files in the buffer are merged into a structure Mapfile in the form of key-value collection, that is, the file name of each small file is used as the key value, and the byte stream of the file content is used as the value value for splicing, such as image 3 Shown
[0038] 7) Upload the merged file to HDFS via asynchronous thread with delay.
[0039] The read file in the embodiment of the present invention, such as Figure 4 As shown, the specifics are: read the file name, if it is a small file, query Hbase to obtain the metadata of the file, and read the status value of the flag bit. If the flag bit is 0, you can directly perform local IO. The file whose bit is 1 is read through the Mapfile reading interface. If the file whose flag bit is 2 is not read, the file does not exist information is returned; if it is a large file, it is read directly from the Namenode.
[0040] In this embodiment, adding a file is specifically: adding the file to the buffer upload queue, and setting the Status flag to 0.
[0041] Delete files in the embodiment of the present invention, such as Figure 5 As shown, the specific is: read the file name, if it is a small file, query Hbase, the file Status mark position 2, which means that the area file in the Mapfile is invalid and becomes fragments, and establish a check thread to determine the number of fragments. When the number reaches a certain number, an asynchronous maintenance thread is started to perform delayed maintenance on the Mapfile, and the Mapfile is merged to remove fragments and delete Hbase records; if it is a large file, it is directly deleted from the Namenode.
[0042] Among them, merge Mapfile to remove fragments, specifically:
[0043] 6.1) Create a new Mapfile file;
[0044] 6.2) Read the old Mapfile sequentially, read the key value of the current Record, query Hbase to get its Status flag;
[0045] 6.6) If the Status is 2, continue to read the next Record; otherwise, read the value of the current Record, the Write a new Mapfile and update the MapfileID column of the record corresponding to the key in Hbase.
[0046] 6.7) Return to 6.2), until the end of the old Mapfile is read.
[0047] The update file in the embodiment of the present invention, such as Image 6 Shown, specifically: read the file name, if it is a small file, query Hbase, set the file Status to 2, start an asynchronous maintenance thread for delayed maintenance, upload a new file to the buffer; if it is a large file, then Upload new files from Namenode to directly overwrite old files.
[0048] According to the above embodiments, it can be seen that in view of the low utilization of the HDFS distributed file system on the storage resources of the massive small file data, the low file access efficiency and the inability to update the files in real time, the method of the present invention improves the HDFS writing and writing of small files. Reading efficiency realizes the support for real-time update of small files and improves the overall performance of the system. In this invention, the method of filtering small files, pre-reading metadata, merging files and delaying writing solves the problem of low storage efficiency of small files in HDFS; through flag bit processing, the method of merging fragments realizes efficient file instant Update processing; At the same time, the asynchronous thread processing method makes the system respond faster to users and improves overall performance.
[0049] The above are only specific implementations of the present invention, but the protection scope of the present invention is not limited to this. Anyone familiar with the technology can understand conceivable changes or substitutions within the technical scope disclosed in the present invention. All should be covered within the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Software defined network link fault recovery method based on in-band control

ActiveCN107302496AImprove overall performanceData switching networksHigh level techniquesLink Layer Discovery ProtocolControl flow
Owner:CHONGQING UNIV OF POSTS & TELECOMM

Antistatic finishing process of chemical fiber fabric

InactiveCN103993473AImprove comfort performanceImprove overall performanceSynthetic fibresDyeing processDyeingPolymer chemistry
Owner:TAICANG MENGFAN CHEM FIBER

Classification and recommendation of technical efficacy words

Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products