An implementation method for full-text retrieval scenarios of massive data

An implementation method and technology for massive data, applied in the implementation field of full-text retrieval scenarios for massive data, can solve the problems of not supporting clustering, not providing clustering support, etc., to ensure continuity and correctness, achieve retrieval efficiency, and load balance. Effect

Active Publication Date: 2019-11-08
BEIJING SCISTOR TECH +1
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

First of all, Lucene's built-in does not support clustering, Lucene appears as an embedded toolkit, and does not provide support for clustering in the core code

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An implementation method for full-text retrieval scenarios of massive data
  • An implementation method for full-text retrieval scenarios of massive data
  • An implementation method for full-text retrieval scenarios of massive data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

[0024] When retrieving existing massive data, RCFile is used as the data storage file format by default. RCFile is a column-oriented data format introduced by Hive. It is an excellent storage format based on row-column hybrid storage, which meets the needs of fast data loading and high adaptability to dynamic loads. It follows the "column first, then vertical" design philosophy. When querying for columns it doesn't care about, it will skip those columns on IO. It should be noted that RCFile does not directly skip unnecessary columns and jump to the columns that need to be read, but by scanning the header definition of each row group. In the header of the entire Block level and It is not defined which row group each colu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an implementation method for a mass data full-text retrieval scene and belongs to the field of mass data full-text retrieval. According to the method, Lucene is introduced into a retrieval engine, and Lucene index files are established for data files existing in a cluster; a daemon is set on each node storing the corresponding Lucene index file to maintain the Lucene index file on the node. Through the method, a scheduling mechanism of fragments by a coordinator is optimized, counting judgment is performed on each execution node, and the execution location of each fragment is adjusted to balance node resources; a full-text reading engine mechanism is also optimized, Lucene is preferentially adopted to perform retrieval, and RCFile retrieval is called in case of a problem to guarantee the continuity and correctness of retrieval; retrieval performance is improved, and cluster resources can be utilized more reasonably.

Description

technical field [0001] The invention relates to the field of full-text retrieval of massive data, in particular to an implementation method for the full-text retrieval scene of massive data. Background technique [0002] In today's era of information explosion, every unit or individual is making various contributions to the rapid growth of information. The types of information are also constantly expanding, and more and more unstructured information is constantly appearing, including various reports, bills, electronic documents, various elements of the website, pictures, faxes, scanned images, and a large number of Multimedia audio, video information, etc. 85% of all stored data is in unstructured formats, and unstructured information doubles every three months. Due to the great differences in information formats, it is basically impossible to integrate them into a unified interface for easy use. [0003] The full-text retrieval technology is a retrieval technology that u...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/31G06F16/33
CPCG06F16/313G06F16/334
Inventor 王宇徐晓燕周渊吴小伟刘庆良王振宇郑彩娟李斌斌黄成周游
Owner BEIJING SCISTOR TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products