Spark-based large-scale distributed DataFrame query method

A query method and distributed technology, which is applied in the field of large-scale distributed DataFrame query, can solve the problems of DataFrame lack of flexible and easy-to-use query functions, and achieve good scalability, good ease of use, and improved query performance

Active Publication Date: 2019-07-23
NANJING UNIV
View PDF3 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Purpose of the invention: in order to solve the problem that Pandas DataFrame cannot handle large-scale data and the existing distributed DataFrame programming model of Spark lacks the flexible and easy-to-use query function, the present invention provides a kind of query method based on Spark's large-scale distributed DataFrame, the The method can efficiently query large-scale distributed DataFrames, including location-based and tag-based queries, and provides a Pandas-like DataFrame interface, which solves the problem of lack of flexible and easy-to-use query functions for distributed DataFrames under existing big data processing platforms Problems make the function of Spark DataFrame richer and more powerful

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark-based large-scale distributed DataFrame query method
  • Spark-based large-scale distributed DataFrame query method
  • Spark-based large-scale distributed DataFrame query method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these embodiments are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention Modifications in equivalent forms all fall within the scope defined by the appended claims of this application.

[0025] The technical scheme of the present invention is mainly based on the distributed big data processing system Spark for distributed computing, and the distributed memory database Redis and the shared memory object storage database Plasma Store for storage. The distributed big data processing system Spark is an open source system of the Apache Foundation (project homepage http: / / spark.apache.org), and this software does not belong to the content of the present inventi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Spark-based large-scale distributed DataFrame query method, which comprises the following steps: using a system framework based on a distributed computing execution engine Spark, using DataFrame as a programming model, and using Python as a programming language; in the distributed system, through encapsulating an existing query interface of a Spark native DataFrame, eliminating the incompatibility with an API of a mainstream standalone DataFrame computing library Panas; constructing a lightweight global index, and providing a plurality of distributed DataFrame query functions according to different conditions; and establishing local indexes and auxiliary indexes, so that the query performance is improved. The problems that an existing single-machine platform DataFrame is poor in expandability and cannot process large-scale data, and an existing big data processing platform distributed DataFrame query interface is not rich, poor in usability and low in performance are solved.

Description

technical field [0001] The invention relates to the technical field of distributed computing, in particular to a spark-based large-scale distributed DataFrame query method. Background technique [0002] In big data analysis applications, structured big data analysis and processing based on table models is still the most basic requirement in many industries. DataFrame is an easy-to-use table data programming model in a programming language environment. It has a good abstraction for the statistical process of data analysis, so it has received extensive attention. [0003] The traditional relational database provides a table data model oriented to SQL query, but SQL query needs to provide the support of heavyweight (heavy-weighted) database system and SQL query engine in the background, coupled with the complexity of SQL query language, so based on SQL The table data model is still not convenient enough to operate and use in the common data analysis programming language enviro...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/2455G06F16/27G06F16/22
CPCG06F16/2455G06F16/278G06F16/22Y02D10/00
Inventor 顾荣黄宜华施军
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products