A file retrieval system based on hdfs

A file retrieval and file technology, applied in the file system, file metadata retrieval, file access structure, etc., can solve the problems of a single-machine file retrieval system, inability to process massive data, and time-consuming index creation, etc. The effect of reducing query load, good horizontal scalability and stability

Active Publication Date: 2019-04-26
NORTHEASTERN UNIV LIAONING
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But when the number of index files increases to a certain amount, due to the internal mechanism of Lucene, a large amount of data is loaded into the memory, and will be discarded after the query is completed.
After a large amount of data occupies the memory, the Java Virtual Machine (JVM) will be frequently shortened and recycled, resulting in a serious bottleneck in query performance
Moreover, traditional file retrieval systems are stand-alone systems
With the advent of the era of big data, the stand-alone file retrieval system cannot handle massive amounts of data, and its index creation takes a long time and the query efficiency is low

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A file retrieval system based on hdfs
  • A file retrieval system based on hdfs

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] An embodiment of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0021] Such as figure 1 As shown, a file retrieval system based on HDFS in this embodiment is set on 4 PCs (Intel(R) Core(TM) i7-4790@3.60GHZ, 8G, 1T): PC1, PC2, PC3 and PC4 , 4 PCs are interconnected by 100M network.

[0022]The system includes: an administrator-oriented system configuration module, a file management module and an index management module, a user-oriented retrieval portal module, a MongoDB database, and a background storage computing cluster; the background storage computing cluster includes HDFS clusters, Spark clusters, Elastic Search Cluster; HDFS (Hadoop Distributed FileSystem) cluster is a distributed file storage cluster, Spark cluster is an index computing cluster, and ElasticSearch cluster is an index storage cluster. All three clusters adopt a master-slave architecture, that is, one master node and two slave nodes The node...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

HDFS-based file retrieval system, including system configuration module, file management module, index management module, retrieval portal module, MongoDB database, HDFS cluster, Spark cluster, and ElasticSearch cluster; the file management module stores files in the HDFS cluster; the index management module passes The Spark cluster creates an index and stores it in the ElasticSerch cluster; the retrieval portal module sends the retrieval conditions to the ElasticSearch cluster for index matching to obtain the retrieval results, and the MongoDB database is used to store the records produced in the file retrieval process; the HDFS cluster and the Spark cluster of the present invention Both ElasticSearch and ElasticSearch clusters are distributed to reduce query load and improve query efficiency; the master-slave architecture has horizontal scalability and stability, which is convenient for improving the overall processing capacity of the cluster, and the system is in a stable working state; the replica redundancy strategy is adopted to ensure indexing reliability and integrity.

Description

technical field [0001] The invention belongs to the field of distributed search engines, and in particular relates to an HDFS-based file retrieval system. Background technique [0002] The traditional full-text retrieval system is implemented based on Lucene. Using Lucene can realize the establishment, optimization and query of file index. But when the number of index files increases to a certain amount, due to the internal mechanism of Lucene, a large amount of data is loaded into the memory, and will be discarded after the query is completed. After a large amount of data occupies the memory, the Java Virtual Machine (JVM) will be frequently shortened and recycled, resulting in a serious bottleneck in query performance. And the traditional file retrieval systems are stand-alone systems. With the advent of the big data era, stand-alone file retrieval systems cannot handle massive amounts of data, and its index creation takes a long time and query efficiency is low. [00...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/13G06F16/14
CPCG06F16/134G06F16/14
Inventor 陈东明胡阳黄新宇
Owner NORTHEASTERN UNIV LIAONING
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products