Storage optimization method for Hadoop distributed file system

A distributed file and storage optimization technology, applied in the direction of file system, file system type, file/folder operation, etc., can solve the problems of large proportion of stored data, small proportion of duplicate data, data screening and deletion, etc.

Active Publication Date: 2021-09-10
XIAN UNIV OF TECH
View PDF10 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method mainly eliminates and deletes duplicate data in the file system. However, in a real distributed file system, the proportion of duplicate data is not large, and the proportion of stored data is large because a large number of them have only been used once or several times. low-value data, and the above method cannot effectively identify and delete a large amount of low-reuse data in distributed storage systems, especially HDFS

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Storage optimization method for Hadoop distributed file system
  • Storage optimization method for Hadoop distributed file system
  • Storage optimization method for Hadoop distributed file system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0036] The present invention is a kind of storage optimization method for Hadoop distributed file system, such as figure 1 As shown, the specific steps are as follows:

[0037] Step 1, extract the file operation records, specifically:

[0038] Step 1.1: Select the INFO level log file, the selected log file contains the specific execution time stamp and file name information;

[0039] HDFS stores a large amount of log file content, recording various operations on the distributed file system, mainly divided into three levels: WARN, INFO, and DEBUG, and the detail level of the records increases in turn. The DEBUG-level logs are located at the bottom layer, and the recorded content is the most direct and detailed, but the data volume is large; the WARN-level logs are at the top layer, and only key information and information that may cause erro...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a storage optimization method for a Hadoop distributed file system, and the method specifically comprises the following steps: firstly, selecting an INFO-level log file which comprises a specific execution timestamp and file name information, and obtaining an access record and a deletion record of the INFO-level log file; extracting and sorting all information containing keywords in the IFNO level log, and then sorting and numbering according to timestamps; then determining feature tags, selecting features, constructing feature vectors, and forming a sample set of a training file elimination model; selecting three feature values of the feature vector as three classification nodes of a decision tree in sequence, establishing the decision tree by adopting an ID3 algorithm, and constructing a file elimination model by the decision tree; and finally, predicting the reusability of the file by using the established file elimination model. According to the method, the storage efficiency of the distributed file system is optimized, the data storage scale is reduced, and the storage efficiency of the HDFS is improved.

Description

technical field [0001] The invention belongs to the technical field of data storage, and in particular relates to a storage optimization method for a Hadoop distributed file system. Background technique [0002] With the increasing application of big data computing engines (such as Apache Hadoop and Apache Spark), a large amount of new data needs to be stored in the Hadoop distributed file system HDFS, which puts a lot of pressure on HDFS storage. The traditional method continues to expand the capacity of HDFS by increasing hardware investment, so as to store massively increased data, but the cost of doing so is high, and most of the data stored in HDFS has low utilization value, and the probability of being used or accessed by other devices Low, wasting a lot of hardware resources and software costs. [0003] In the era of cloud computing, the problem of storage optimization for large-scale distributed file systems is getting more and more attention. For example, Kirsten ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/18G06F16/182G06F16/172G06F16/16
CPCG06F16/182G06F16/1815G06F16/172G06F16/16
Inventor 王周恺贾乔马维纲王怀军曹霆李宇昕王侃
Owner XIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products