Iterative data equilibrium optimization method for Spark parallel computing framework

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A parallel computing and iterative technology, applied in the field of big data processing and high-performance computing, can solve the problems of insufficient accuracy, lack of versatility, and delayed job completion time, so as to improve overall performance and achieve overall balance.

Inactive Publication Date: 2017-12-22

ZHEJIANG UNIV OF TECH

View PDF4 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, it is difficult to grasp the timing of partitioning in this method, which is not universal

(3) Sampling data partitioning method: This method was proposed by Ramakrishnan et al. (Proceedings of the 3rd ACM Symposium on Cloud Computing, 2012), which combines sampling and data splitting, and adds an additional process in the execution of data processing to be responsible for Analyze the data distribution. When the data is processed to a certain proportion, the data is split and merged according to the analysis results of the sampling process, that is, the partition with a large amount of data is split and merged with the partition with a small amount of data; however, this method requires Additional overhead to collect data distribution will increase data access and data transmission overhead. Moreover, there is some uncertainty in data sampling. If sampling is too small, the accuracy will be insufficient, and sampling too much will add more additional overhead. (4) Delayed data partition method: This method is proposed by Kwon et al. (Proceedings of the 1st ACM Symposium on CloudComputing, 2010), which evaluates the size of the data partition by defining a cost model, and then evaluates the size of the generated data according to the cost model, and Start the data partition when the task runs to a certain point

However, this method requires additional data transmission costs, which delays the completion of the job to a certain extent;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0054] Taking the classic big data-oriented WordCount program as an example, combined with Figure 1 ~ Figure 3 , to further describe the specific embodiment of the present invention:

[0055] Assume that the WordCount program wants to count the words of the contents of 4 blocks and distribute them to 4 nodes. There are 2 rows of data in each block, and the data content of each block is as follows:

[0056] Block1:

[0057] Spark is a fast and Spark is a general-purpose engine for large-scale data processing.

[0058] Spark runs programs faster than Hadoop MapReduce in memory and ondisk.

[0059] Block2:

[0060] Spark performance is impacted by many soft system, hardware and dataset factors.

[0061] Spark can run both by itself, or over several existing cluster managers.

[0062] Block3:

[0063] Big Data can be defined as large data sets are being generated from different sources.

[0064] The use of the MapReduce and Spark are two approaches perform data analytics. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The present invention provides an iterative data equilibrium partitioning method for a Spark parallel computing framework. The method comprises: firstly, dividing a coarse-grained Block of big data into a fine-grained FG-Block, and creating a micro-partition and a micro-partition index according to the FG-Block; secondly, creating an equal quantity of Buckets according to a quantity of Reducers; thirdly, determining a timing and a quantity of an iterative data partition and a rule of iterative partitioning; fourthly, recording local and global data allocation conditions of each Bucket; fifthly, allocating a selected micro-partition to each Bucket according to a data equilibrium partitioning algorithm and an allocation condition; and finally, transmitting allocated data in the Bucket to a Reducer side. The present invention provides a new data equilibrium partitioning method for a Spark framework, which reduces data skew during big data processing and improves overall performance of big data of the Spark parallel computing framework.

Description

technical field [0001] The invention relates to the fields of big data processing and high-performance computing, and in particular proposes an iterative data balance optimization method oriented to the Spark parallel computing framework. Background technique [0002] MapReduce is a parallel computing model for big data processing proposed by Google in 2004. It improves the performance of data processing by simultaneously running multiple tasks on a large number of cheap cluster nodes to process massive data in parallel. It has been developed rapidly and widely used in recent years. Spark is a parallel computing framework based on MapReduce. It was developed by the AMPLab of the University of California, Berkeley in 2009. It has the advantages of MapReduce and saves the intermediate results of task calculations in memory, reducing disk read and write costs and improving performance. The performance of big data processing has become the mainstream framework for building big ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30G06F9/50G06F12/02

CPCG06F9/5061G06F12/023G06F16/2228

Inventor 张元鸣蒋建波黄浪游沈志鹏项倩红肖刚陆佳炜高飞

Owner ZHEJIANG UNIV OF TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Iterative data equilibrium optimization method for Spark parallel computing framework

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology