Method based on Hadoop small file optimization and reverse index establishment

An inverted index, small file technology, applied in the field of system processing, can solve problems such as low efficiency, achieve the effect of improving speed and efficiency, and optimizing processing performance

Inactive Publication Date: 2014-03-26
SOUTHEAST UNIV
View PDF3 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, there are two retrieval methods to retrieve unstructured data. The first is the sequential scanning method, which scans the unstructured document from the head to the end of the document. If the required information is found in the document, the document is scanned Record it, and then select the next document to continue scanning from beginning to end until all documents are scanned and return to the recorded document, which is less efficient

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method based on Hadoop small file optimization and reverse index establishment
  • Method based on Hadoop small file optimization and reverse index establishment
  • Method based on Hadoop small file optimization and reverse index establishment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037]The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art, so as to define the protection scope of the present invention more clearly.

[0038] see Figure 1-Figure 5 , the embodiment of the present invention includes:

[0039] A method based on small file optimization and inverted index in Hadoop, the method can upload a large amount of small files to the HDFS distributed file system and set up an inverted index to files on the distributed file system, the method includes small file optimization and The process of establishing an inverted index; among them:

[0040] 1) Small file optimization is to add a separate small file processing module on the basis of HDFS. The specific optimization scheme is as follows:

[0041] 1.1) First, the user uploads a file to the cloud storage pla...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method based on Hadoop small file optimization and reverse index establishment. A large number of small files can be uploaded to an HDFS, and reverse indexes can be established for the files in the HDFS. The method comprises small file optimization and reverse index establishment and mainly comprises the steps that (1) a user uploads a large number of small files corresponding to HDFS blocks in size to small file queues in Hadoop; (2) the size of the small files in the file queues is calculated regularly, (3) the files, meeting requirements, in the small file queues are combined through the Sequence file method and then are uploaded to the HDFS; (4) the reverse indexes are established for the files in the HDFS. According to the method based on Hadoop small file optimization and reverse index establishment, the defect that the Hadoop small file process is inconvenient is overcome, the processing performance of the small files can be optimized, the memory is released, and the retrieval speed and efficiency are improved.

Description

technical field [0001] The invention relates to the field of system processing, in particular to a method based on small file optimization and inverted index in Hadoop. Background technique [0002] When the size of a file is smaller than the block size on HDFS (Distributed File System), such a file is called a small file in Hadoop. A large number of small files will seriously affect the scalability and performance of Hadoop. [0003] (1) In the HDFS distributed file system, any file information and file block information are stored in the form of an object in the memory of the NameNode (master node). Each object occupies about 150 bytes. A large number of small files make the NameNode Memory usage severely limits cluster scaling. [0004] (2) The speed of HDFS accessing a large number of small files is much slower than accessing large files of the same size. It is primarily designed for streaming access to large files. Reading small files usually results in a large numb...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F3/06
CPCG06F16/134G06F16/182
Inventor 吴含前姚莉马风新李露
Owner SOUTHEAST UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products