Method for improving query efficiency of Spark SQL

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology for query efficiency and intermediate data, which is applied to improve the query efficiency of SparkSQL, and can solve the problem of high disk I/O overhead

Active Publication Date: 2018-10-26

SOUTHEAST UNIV

View PDF4 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] The purpose of the present invention is to provide a kind of method that improves the query efficiency of Spark SQL, solves the higher problem of disk I / O overhead in Spark SQL query by Shuffle intermediate data cache processing method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0040] The present invention will be further illustrated below in conjunction with specific embodiments, and it should be understood that the following specific embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention.

[0041] A method for improving the query efficiency of Spark SQL, the method comprises the steps:

[0042] Step S1: Build a query pre-analysis module, calculate the size of the intermediate data generated by Shuffle through the estimation model, and calculate the total size of the intermediate data cache layer used to cache the intermediate data;

[0043] Step S2: According to the total size of the intermediate data cache layer calculated in step 1, combined with the distribution of input data of each node in the cluster, set a reasonable memory space size for each node through the cache layer allocation module.

[0044]Further, the specific method for calculating the size of the intermediate...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for improving query efficiency of a Spark SQL. The method comprises the steps of S1, establishing a query pre-analysis module, calculating sizes of intermediate data produced by Shuffle through utilization of an estimation model, and calculating a total size of an intermediate data cache layer for caching the intermediate data; and S2, setting a reasonable memory space size for each node based on a cache layer allocation module, through utilization of distribution condition of input data of each node in a cluster, according to the total size of the intermediatedata cache layer calculated in the S1. According to the method, the problem that the disk I / O cost is relatively high in Spark SQL query can be effectively solved through utilization of a Shuffle intermediate data cache processing method.

Description

Technical field: [0001] The invention relates to a method for improving the query efficiency of Spark SQL, in particular to a method for improving the query efficiency of Spark SQL through the processing of Shuffle operation intermediate data cache. Background technique: [0002] With the continuous development and popularization of the Internet, the amount of data generated every day by enterprises, government agencies, and scientific research institutions is considerable. For example, the amount of data generated by Taobao reaches 7T every day, and the amount of data that Baidu needs to process every day reaches 100PB. The actual demand for processing big data has promoted a lot of academic research in the field of cloud computing. Apache Spark is a memory-based computing framework. It takes the realization of RDD (Resilient Distributed Datasets) as the core to realize data distribution and fault tolerance. Spark is a high-speed and general-purpose big data processing eng...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor 宋爱波万雨桐

Owner SOUTHEAST UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method for improving query efficiency of Spark SQL

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology