Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for improving query efficiency of Spark SQL

A technology for query efficiency and intermediate data, which is applied to improve the query efficiency of SparkSQL, and can solve the problem of high disk I/O overhead

Active Publication Date: 2018-10-26
SOUTHEAST UNIV
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The purpose of the present invention is to provide a kind of method that improves the query efficiency of Spark SQL, solves the higher problem of disk I / O overhead in Spark SQL query by Shuffle intermediate data cache processing method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for improving query efficiency of Spark SQL
  • Method for improving query efficiency of Spark SQL
  • Method for improving query efficiency of Spark SQL

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The present invention will be further illustrated below in conjunction with specific embodiments, and it should be understood that the following specific embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention.

[0041] A method for improving the query efficiency of Spark SQL, the method comprises the steps:

[0042] Step S1: Build a query pre-analysis module, calculate the size of the intermediate data generated by Shuffle through the estimation model, and calculate the total size of the intermediate data cache layer used to cache the intermediate data;

[0043] Step S2: According to the total size of the intermediate data cache layer calculated in step 1, combined with the distribution of input data of each node in the cluster, set a reasonable memory space size for each node through the cache layer allocation module.

[0044]Further, the specific method for calculating the size of the intermediate...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for improving query efficiency of a Spark SQL. The method comprises the steps of S1, establishing a query pre-analysis module, calculating sizes of intermediate data produced by Shuffle through utilization of an estimation model, and calculating a total size of an intermediate data cache layer for caching the intermediate data; and S2, setting a reasonable memory space size for each node based on a cache layer allocation module, through utilization of distribution condition of input data of each node in a cluster, according to the total size of the intermediatedata cache layer calculated in the S1. According to the method, the problem that the disk I / O cost is relatively high in Spark SQL query can be effectively solved through utilization of a Shuffle intermediate data cache processing method.

Description

Technical field: [0001] The invention relates to a method for improving the query efficiency of Spark SQL, in particular to a method for improving the query efficiency of Spark SQL through the processing of Shuffle operation intermediate data cache. Background technique: [0002] With the continuous development and popularization of the Internet, the amount of data generated every day by enterprises, government agencies, and scientific research institutions is considerable. For example, the amount of data generated by Taobao reaches 7T every day, and the amount of data that Baidu needs to process every day reaches 100PB. The actual demand for processing big data has promoted a lot of academic research in the field of cloud computing. Apache Spark is a memory-based computing framework. It takes the realization of RDD (Resilient Distributed Datasets) as the core to realize data distribution and fault tolerance. Spark is a high-speed and general-purpose big data processing eng...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
Inventor 宋爱波万雨桐
Owner SOUTHEAST UNIV
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More