[0042] The present invention will be further described below in conjunction with the drawings.
[0043] Reference figure 1 The specific implementation steps of the present invention are as follows:
[0044] Step 1. Add two services.
[0045] In addition to the Hadoop distributed file system HDFS, a new web server Websever is added to monitor file read and write requests, and a small file processing server is added to process small files: the system architecture of the present invention consists of the web server Websever, The small file processing server and the original HDFS system consist of three parts. The small file processing server mainly performs file merging, file mapping, and file prefetching operations on small files.
[0046] Step 2. Determine whether the file is a small file.
[0047] The web server Websever judges whether the monitored request file is a file smaller than 16M. If it is smaller than 16M, it is regarded as a small file and proceed to step 4; otherwise, it is regarded as a large file and proceed to step 3.
[0048] Step 3. Determine the status of the large file request.
[0049] The web server Websever judges the status of the monitored large file read and write request. If it is a large file write request, perform step 6; if it is a large file read request, perform step 9.
[0050] Step 4. Determine the status of the small file request.
[0051] The web server Websever judges the status of the monitored small file read and write request. If it is a small file write request, perform step 5; if it is a small file read request, perform step 7.
[0052] Step 5. Preprocess the write request.
[0053] The small file processing server adopts the file merging method to merge the small files written by the request, and establish a local index for the small file at the head of the merged file, obtain the merged file, and send the merged file to the client of the Hadoop distributed file system HDFS .
[0054] The document merging method is carried out as follows:
[0055] In the first step, after receiving the small file write request sent by the web server Websever, the small file processing server builds a local index for the small file, and continues to add new small file local index information to the local index;
[0056] In the second step, the small file processing server determines whether the total memory size of the local index and the small file exceeds the size of the block: if it does not exceed the size of the block, continue to add small files and their local indexes to the block, otherwise, add a new one Block block, continue to add small files and their local indexes to the new block;
[0057] The third step is to use the local index as the header file of the merged file, the offset in the local index, and the value of the file length data pair point to the position of the small file in the merged file, to obtain the merged file of the small file.
[0058] Step 6. Process the write request.
[0059] The client of the Hadoop Distributed File System HDFS writes the large files or merged files requested to be written into the Hadoop Distributed File System HDFS to complete the write operation.
[0060] Step 7. Check the buffer area.
[0061] In the first step, the small file processing server detects whether there is a read request file record monitored by the web server Websever in the cache area. If it exists, the small file processing server retrieves the read request file in the cache area and returns it to the client to complete the read operation. Otherwise, perform the second step;
[0062] In the second step, the small file processing server detects whether there is metadata information of the read request file monitored by the web server Websever in the cache area. If it exists, the small file processing module directly interacts with the HDFS client to retrieve the small file from HDFS and return it To the client, complete the read operation, otherwise, go to step 8.
[0063] Step 8, preprocess the read request.
[0064] According to the file name of the small file and the merged file, the small file processing server maps the received request to read the small file to the merged file of the small file, and sends the merged file to the client of the Hadoop distributed file system HDFS.
[0065] Step 9. Process the read request.
[0066] The client of the Hadoop Distributed File System HDFS reads the received large file or merged file from the Hadoop Distributed File System HDFS to obtain the metadata information and local index information of the merged file to complete the reading operating.
[0067] Step 10. Separate small files.
[0068] The small file processing server adopts the small file separation method to read the combined file from the Hadoop distributed file system HDFS, separates the small file requested to be read from the combined file and returns it to the user to complete the reading operation.
[0069] The document merging method is carried out as follows:
[0070] In the first step, the small file processing server obtains the local index of the requested file through the metadata information of the merged file, the offset in the local index, and the file length data pair points to the position of the small file in the merged file;
[0071] In the second step, the small file processing server separates the small file from the combined file according to the position of the small file in the combined file.
[0072] Step 11. Create a prefetch record.
[0073] In the first step, the small file processing module extracts the file name, data node location, data block location, offset offset, and file length length of each small file from the metadata information and local index information of the merged file obtained in step 9 , Create metadata prefetch records of small files.
[0074] In the second step, the small file processing module reads the small file belonging to the same block as the requested file from the Hadoop distributed file system HDFS, and establishes a prefetch record of the small file.
[0075] Step 12. Update the prefetch record.
[0076] The small file processor uses the method of updating the prefetch record to update the metadata record of the prefetched small file and the prefetch record of the small file.
[0077] The method of updating the prefetch record is performed as follows:
[0078] The first step is to add a 32-bit file access identifier value to record the file access frequency in the header of the metadata prefetch record of the small file and the prefetch record of the small file respectively;
[0079] The second step is to set the initial value of the file access identifier value to 1, with one minute as the unit of time. If there are user access to prefetched local index file records and prefetched small file records, the value of the file access identifier value is increased by 1 , Otherwise, the value of the file access identifier value minus 1;
[0080] In the third step, when the value of the file access identifier value is 0, the prefetch information is removed from the cache of the small file processor.
[0081] The effect of the present invention can be verified through the following simulation experiments:
[0082] 1. Simulation conditions:
[0083] The simulation of the present invention is carried out under the hardware environment of 2.5GHz Intel(R)Core(TM)i5CPU and the software environment of MATLAB R2009b and Window XP Professional.
[0084] 2. Simulation content and result analysis:
[0085] Compared with the original Hadoop distributed file system HDFS and HAR archiving method using the method of storing small files based on the Hadoop distributed file system of the present invention, the comparison of the memory usage trend of NameNode is as follows figure 2 As shown, the comparison of access efficiency is as image 3 Shown.
[0086] figure 2 It is a comparison diagram of the memory usage trend of the NameNode between the present invention and the two existing methods. The abscissa represents the number of small files, and the ordinate represents the small file metadata occupies the NameNode memory, and the physical unit is MB. The present invention selects 2000, 4000, 6000, 8000, 10000 small files respectively, uses the original HDFS system, HAR and the three methods of the present invention to simulate respectively, and calculates the small file metadata occupancy of the NameNode memory under the three methods, and finally obtains Three curves of the memory usage trend of the NameNode. by figure 2 It can be seen that for the original HDFS system and HAR method, with the gradual increase in the number of files, the memory occupation of the NameNode increases linearly, and the HAR method can relieve the storage pressure of the NameNode to a certain extent. However, by comparing the ordinates of the three curves with the same number of small files, it can be seen that the present invention occupies a much smaller NameNode memory than the existing two methods, and the storage efficiency of small files is much higher than that of the original HDFS and HAR. Method, and with the increase of the number of small files, the slower the linear growth of the method curve of the present invention, the more prominent the superiority.
[0087] image 3 It is a comparison diagram of access efficiency between the present invention and the existing method. The abscissa represents the three solutions, and the ordinate represents the average access time for accessing 10,000 small files under the three solutions, and the physical unit is ms. The present invention selects 10,000 small files, respectively uses the original HDFS system, HAR and the three methods of the present invention to simulate, calculates the total time spent accessing 10,000 small files under the three methods, and calculates the average access to a small file Time, and finally get the access efficiency comparison chart. by image 3 It can be seen that by comparing the access time of 10,000 small files under the three methods, that is, the ordinate, it can be known that compared with the original HDFS and HAR, the average access time MPM of the method of the present invention is greatly reduced, and the access efficiency is higher.
[0088] The simulation results show that the present invention uses a small file processing server independent of the original HDFS system to separately process small file merging, mapping, prefetching, etc., which reduces the load of the NameNode and improves the storage access efficiency of HDFS to small files. At the same time, the versatility of the system is guaranteed.