Spark SQL-based distributed full text retrieval system and method

A retrieval system and distributed technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as full-text retrieval that does not support massive data, and achieve the effect of small index storage

Active Publication Date: 2017-09-01
INST OF SOFTWARE - CHINESE ACAD OF SCI
View PDF3 Cites 44 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technology of the present invention solves the problem: overcomes the problem that existing data analysis tools do not support full-text retrieval under massive data, provides a distributed full-text retrieval system and method based on Spark SQL, enhances the data analysis function of Spark SQL, and can effectively meet traditional business requirements Migration and existing business needs for full-text retrieval of massive data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark SQL-based distributed full text retrieval system and method
  • Spark SQL-based distributed full text retrieval system and method
  • Spark SQL-based distributed full text retrieval system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0062] The present invention will be described in more detail below in conjunction with specific embodiments and accompanying drawings.

[0063] Such as figure 1 As shown, the present invention designs and implements a relational data-oriented distributed full-text retrieval system based on Spark SQL, and the system includes four parts: SQL translation layer, data source management layer, parallel computing layer, and distributed storage layer. In the SQL translation layer, the grammar of full-text retrieval based on SQL and the translation process of SQL statements in the SQL translation layer are proposed; in the data source management module, a parallel method for the full-text retrieval process is designed; the retrieval optimization module In the index building phase, two storage models and corresponding original table data restoration strategies are designed, namely, the full storage model and the index-specified column storage model, and a storage model for the original...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Spark SQL-based distributed full text retrieval system and method. The system comprises an SQL translation layer, a data source management layer, a parallel calculation layer and a distributed storage layer; an SQL-based full text retrieval method and translation processes, among modules of the SQL translation layer, of full text retrieval SQL statements are proposed; a full text retrieval process parallelization method is designed in a data source management module; and in a retrieval optimization module, two index storage models and corresponding primitive table data reduction strategies during query are designed, wherein a partition align connection algorithm which is used for reducing primitive table data during query and has a complexity of O (n) is designed for an index appointed column-based storage model. Under the two storage models, the index construction time is shortened to 0.6% / 0.5% of the traditional database, the query time is shortened to the 1% / 10% of the traditional database, and the index storage amount is decreased to 55.0% of the traditional database. According to the method, the Spark SQL data analysis function is strengthened, and the requirements for traditional business migration and full text retrieval carried out on mass data in the existing businesses can be satisfied.

Description

technical field [0001] The present invention relates to data analysis and information retrieval technology under massive data, and more specifically relates to a distributed full-text retrieval system and method based on Spark SQL. It belongs to the field of software technology. Background technique [0002] With the development of technologies such as cloud computing and the Internet of Things, as well as the emergence of blogs and social networks, application models based on location services (LBS) have emerged (see literature: Meng Xiaofeng, Cixiang. Big Data Management: Concepts, Technologies and Challenges [J]. Computer Research and Development, 2013, (01): 146-169.), the types and scale of data are growing at an unprecedented rate, and the valuable value contained in big data has become the driving force for people to store and process big data (see literature: Cheng Xueqi , Jin Xiaolong, Wang Yuanzhuo, Guo Jiafeng, Zhang Tieying, Li Guojie. A review of big data syste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/3332G06F16/334
Inventor 许利杰崔光范刘杰马志柔吴怀林叶丹
Owner INST OF SOFTWARE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products