Small file storage method based on Hadoop distributed file system

A distributed file and small file technology, applied in the computer field, can solve the problem of low memory usage and storage access efficiency, and achieve the effect of satisfying low-latency access, reducing storage burden, and high efficiency

Active Publication Date: 2014-06-11
XIDIAN UNIV
4 Cites 53 Cited by

AI-Extracted Technical Summary

Problems solved by technology

This method can effectively overcome the shortcomings of high memory usage of the name node NameNode and low storage a...
View more

Method used

Fig. 2 is the comparison diagram of the memory usage trend of the NameNode between the present invention and two existing methods, the abscissa represents the quantity of small files, and the ordinate represents the NameNode memory occupied by the metadata of the small files, and the physical unit is MB. The present invention selects 2000, 4000, 6000, 8000, and 10,000 small files respectively, uses the original HDFS system, HAR, and the three methods of the present invention for simulation, and makes statistics on the NameNode memory occupied by the small file metadata under the three methods, and finally obtains Three curves of NameNode's memory usage trend. It can be seen from Figure 2 that for the original HDFS system and the HAR method, as the number of files gradually increases, the NameNode memory usage increases linearly. Using the HAR method can alleviate the storage pressure of the NameNode to a certain extent. However, when the number of small files is the same, the comparison of the vertical coordinates of the three curves shows that the memory occupied by the present invention is obviously much smaller than that of the two existing methods, and the storage efficiency of small files is much higher than that of the original HDFS and HAR. method, and as the number of small files increases, the slower the linear growth of the method curve of the present ...
View more

Abstract

The invention discloses a small file storage method based on a Hadoop distributed file system. The method comprises steps of (1) additionally arranging two servers; (2) judging whether a file is a small file; (3) judging the request state of a large file; (4) judging the request state of the small file; (5) pre-processing a write request; (6) processing the write request; (7) detecting a cache; (8) pre-processing a read request; (9) processing a read request; (10) separating small files; (11) establishing a prefetching record; and (12) updating the prefetching record. Compared with existing methods for storing lots of small files, the small file storage method guarantees universality of the system, and also has advantages of having high reading and writing performance and efficiency, easing NameNode internal storage burden, and solving problems of high NameNode memory usage rate in storing lots of small files, and low storage access efficiency. The small file storage method can be used by the distributed file system for storing and managing lots of small files.

Application Domain

Transmission

Technology Topic

Distributed File SystemFile system +2

Image

  • Small file storage method based on Hadoop distributed file system
  • Small file storage method based on Hadoop distributed file system
  • Small file storage method based on Hadoop distributed file system

Examples

  • Experimental program(1)

Example Embodiment

[0042] The present invention will be further described below in conjunction with the drawings.
[0043] Reference figure 1 The specific implementation steps of the present invention are as follows:
[0044] Step 1. Add two services.
[0045] In addition to the Hadoop distributed file system HDFS, a new web server Websever is added to monitor file read and write requests, and a small file processing server is added to process small files: the system architecture of the present invention consists of the web server Websever, The small file processing server and the original HDFS system consist of three parts. The small file processing server mainly performs file merging, file mapping, and file prefetching operations on small files.
[0046] Step 2. Determine whether the file is a small file.
[0047] The web server Websever judges whether the monitored request file is a file smaller than 16M. If it is smaller than 16M, it is regarded as a small file and proceed to step 4; otherwise, it is regarded as a large file and proceed to step 3.
[0048] Step 3. Determine the status of the large file request.
[0049] The web server Websever judges the status of the monitored large file read and write request. If it is a large file write request, perform step 6; if it is a large file read request, perform step 9.
[0050] Step 4. Determine the status of the small file request.
[0051] The web server Websever judges the status of the monitored small file read and write request. If it is a small file write request, perform step 5; if it is a small file read request, perform step 7.
[0052] Step 5. Preprocess the write request.
[0053] The small file processing server adopts the file merging method to merge the small files written by the request, and establish a local index for the small file at the head of the merged file, obtain the merged file, and send the merged file to the client of the Hadoop distributed file system HDFS .
[0054] The document merging method is carried out as follows:
[0055] In the first step, after receiving the small file write request sent by the web server Websever, the small file processing server builds a local index for the small file, and continues to add new small file local index information to the local index;
[0056] In the second step, the small file processing server determines whether the total memory size of the local index and the small file exceeds the size of the block: if it does not exceed the size of the block, continue to add small files and their local indexes to the block, otherwise, add a new one Block block, continue to add small files and their local indexes to the new block;
[0057] The third step is to use the local index as the header file of the merged file, the offset in the local index, and the value of the file length data pair point to the position of the small file in the merged file, to obtain the merged file of the small file.
[0058] Step 6. Process the write request.
[0059] The client of the Hadoop Distributed File System HDFS writes the large files or merged files requested to be written into the Hadoop Distributed File System HDFS to complete the write operation.
[0060] Step 7. Check the buffer area.
[0061] In the first step, the small file processing server detects whether there is a read request file record monitored by the web server Websever in the cache area. If it exists, the small file processing server retrieves the read request file in the cache area and returns it to the client to complete the read operation. Otherwise, perform the second step;
[0062] In the second step, the small file processing server detects whether there is metadata information of the read request file monitored by the web server Websever in the cache area. If it exists, the small file processing module directly interacts with the HDFS client to retrieve the small file from HDFS and return it To the client, complete the read operation, otherwise, go to step 8.
[0063] Step 8, preprocess the read request.
[0064] According to the file name of the small file and the merged file, the small file processing server maps the received request to read the small file to the merged file of the small file, and sends the merged file to the client of the Hadoop distributed file system HDFS.
[0065] Step 9. Process the read request.
[0066] The client of the Hadoop Distributed File System HDFS reads the received large file or merged file from the Hadoop Distributed File System HDFS to obtain the metadata information and local index information of the merged file to complete the reading operating.
[0067] Step 10. Separate small files.
[0068] The small file processing server adopts the small file separation method to read the combined file from the Hadoop distributed file system HDFS, separates the small file requested to be read from the combined file and returns it to the user to complete the reading operation.
[0069] The document merging method is carried out as follows:
[0070] In the first step, the small file processing server obtains the local index of the requested file through the metadata information of the merged file, the offset in the local index, and the file length data pair points to the position of the small file in the merged file;
[0071] In the second step, the small file processing server separates the small file from the combined file according to the position of the small file in the combined file.
[0072] Step 11. Create a prefetch record.
[0073] In the first step, the small file processing module extracts the file name, data node location, data block location, offset offset, and file length length of each small file from the metadata information and local index information of the merged file obtained in step 9 , Create metadata prefetch records of small files.
[0074] In the second step, the small file processing module reads the small file belonging to the same block as the requested file from the Hadoop distributed file system HDFS, and establishes a prefetch record of the small file.
[0075] Step 12. Update the prefetch record.
[0076] The small file processor uses the method of updating the prefetch record to update the metadata record of the prefetched small file and the prefetch record of the small file.
[0077] The method of updating the prefetch record is performed as follows:
[0078] The first step is to add a 32-bit file access identifier value to record the file access frequency in the header of the metadata prefetch record of the small file and the prefetch record of the small file respectively;
[0079] The second step is to set the initial value of the file access identifier value to 1, with one minute as the unit of time. If there are user access to prefetched local index file records and prefetched small file records, the value of the file access identifier value is increased by 1 , Otherwise, the value of the file access identifier value minus 1;
[0080] In the third step, when the value of the file access identifier value is 0, the prefetch information is removed from the cache of the small file processor.
[0081] The effect of the present invention can be verified through the following simulation experiments:
[0082] 1. Simulation conditions:
[0083] The simulation of the present invention is carried out under the hardware environment of 2.5GHz Intel(R)Core(TM)i5CPU and the software environment of MATLAB R2009b and Window XP Professional.
[0084] 2. Simulation content and result analysis:
[0085] Compared with the original Hadoop distributed file system HDFS and HAR archiving method using the method of storing small files based on the Hadoop distributed file system of the present invention, the comparison of the memory usage trend of NameNode is as follows figure 2 As shown, the comparison of access efficiency is as image 3 Shown.
[0086] figure 2 It is a comparison diagram of the memory usage trend of the NameNode between the present invention and the two existing methods. The abscissa represents the number of small files, and the ordinate represents the small file metadata occupies the NameNode memory, and the physical unit is MB. The present invention selects 2000, 4000, 6000, 8000, 10000 small files respectively, uses the original HDFS system, HAR and the three methods of the present invention to simulate respectively, and calculates the small file metadata occupancy of the NameNode memory under the three methods, and finally obtains Three curves of the memory usage trend of the NameNode. by figure 2 It can be seen that for the original HDFS system and HAR method, with the gradual increase in the number of files, the memory occupation of the NameNode increases linearly, and the HAR method can relieve the storage pressure of the NameNode to a certain extent. However, by comparing the ordinates of the three curves with the same number of small files, it can be seen that the present invention occupies a much smaller NameNode memory than the existing two methods, and the storage efficiency of small files is much higher than that of the original HDFS and HAR. Method, and with the increase of the number of small files, the slower the linear growth of the method curve of the present invention, the more prominent the superiority.
[0087] image 3 It is a comparison diagram of access efficiency between the present invention and the existing method. The abscissa represents the three solutions, and the ordinate represents the average access time for accessing 10,000 small files under the three solutions, and the physical unit is ms. The present invention selects 10,000 small files, respectively uses the original HDFS system, HAR and the three methods of the present invention to simulate, calculates the total time spent accessing 10,000 small files under the three methods, and calculates the average access to a small file Time, and finally get the access efficiency comparison chart. by image 3 It can be seen that by comparing the access time of 10,000 small files under the three methods, that is, the ordinate, it can be known that compared with the original HDFS and HAR, the average access time MPM of the method of the present invention is greatly reduced, and the access efficiency is higher.
[0088] The simulation results show that the present invention uses a small file processing server independent of the original HDFS system to separately process small file merging, mapping, prefetching, etc., which reduces the load of the NameNode and improves the storage access efficiency of HDFS to small files. At the same time, the versatility of the system is guaranteed.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Automatic device for welding and detecting lithium battery

ActiveCN103904367AShorten the running distanceImprove work efficiency
Owner:DONGGUAN HONBRO LI BATTERY EQUIP TECH

Vacuum cleaner

InactiveCN103356140Aprevent liftingImprove work efficiency
Owner:SUZHOU HAAN TECH +1

Classification and recommendation of technical efficacy words

  • Reduce storage burden
  • Improve work efficiency

Outsourcing data ownership checking and updating method based on alliance chain

ActiveCN110225012AReduce storage burdeneliminate mutual mistrust
Owner:UNIV OF ELECTRONIC SCI & TECH OF CHINA

A hosted registration system and method for offline assets to be chained

PendingCN109670833AReduce storage burdenImprove system security
Owner:姚前

Cooperation service platform facing different source data

InactiveCN101174957AImprove management level and qualityImprove work efficiency
Owner:NANJING UNIV OF FINANCE & ECONOMICS

Screw drilling tool, vertical drilling tool testing method and well inclination simulation testing equipment

InactiveCN111594144AReduced height requirementsImprove work efficiency
Owner:DEZHOU UNITED GASOLINEEUM MACHINERY

Method for intelligent automatic identification of transmission circuit parts

Owner:INFORMATION COMM COMPANY STATE GRID SHANDONG ELECTRIC POWER +2

Keyboard spill-proofing mechanism

ActiveUS7030330B2minimize manufacturing costimprove work efficiency
Owner:LITE ON SINGAPORE PTE LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products