Supercharge Your Innovation With Domain-Expert AI Agents!

Iterative data equilibrium optimization method for Spark parallel computing framework

A parallel computing and iterative technology, applied in the field of big data processing and high-performance computing, can solve the problems of insufficient accuracy, lack of versatility, and delayed job completion time, so as to improve overall performance and achieve overall balance.

Inactive Publication Date: 2017-12-22
ZHEJIANG UNIV OF TECH
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, it is difficult to grasp the timing of partitioning in this method, which is not universal
(3) Sampling data partitioning method: This method was proposed by Ramakrishnan et al. (Proceedings of the 3rd ACM Symposium on Cloud Computing, 2012), which combines sampling and data splitting, and adds an additional process in the execution of data processing to be responsible for Analyze the data distribution. When the data is processed to a certain proportion, the data is split and merged according to the analysis results of the sampling process, that is, the partition with a large amount of data is split and merged with the partition with a small amount of data; however, this method requires Additional overhead to collect data distribution will increase data access and data transmission overhead. Moreover, there is some uncertainty in data sampling. If sampling is too small, the accuracy will be insufficient, and sampling too much will add more additional overhead. (4) Delayed data partition method: This method is proposed by Kwon et al. (Proceedings of the 1st ACM Symposium on CloudComputing, 2010), which evaluates the size of the data partition by defining a cost model, and then evaluates the size of the generated data according to the cost model, and Start the data partition when the task runs to a certain point
However, this method requires additional data transmission costs, which delays the completion of the job to a certain extent;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Iterative data equilibrium optimization method for Spark parallel computing framework
  • Iterative data equilibrium optimization method for Spark parallel computing framework
  • Iterative data equilibrium optimization method for Spark parallel computing framework

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] Taking the classic big data-oriented WordCount program as an example, combined with Figure 1 ~ Figure 3 , to further describe the specific embodiment of the present invention:

[0055] Assume that the WordCount program wants to count the words of the contents of 4 blocks and distribute them to 4 nodes. There are 2 rows of data in each block, and the data content of each block is as follows:

[0056] Block1:

[0057] Spark is a fast and Spark is a general-purpose engine for large-scale data processing.

[0058] Spark runs programs faster than Hadoop MapReduce in memory and ondisk.

[0059] Block2:

[0060] Spark performance is impacted by many soft system, hardware and dataset factors.

[0061] Spark can run both by itself, or over several existing cluster managers.

[0062] Block3:

[0063] Big Data can be defined as large data sets are being generated from different sources.

[0064] The use of the MapReduce and Spark are two approaches perform data analytics. ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention provides an iterative data equilibrium partitioning method for a Spark parallel computing framework. The method comprises: firstly, dividing a coarse-grained Block of big data into a fine-grained FG-Block, and creating a micro-partition and a micro-partition index according to the FG-Block; secondly, creating an equal quantity of Buckets according to a quantity of Reducers; thirdly, determining a timing and a quantity of an iterative data partition and a rule of iterative partitioning; fourthly, recording local and global data allocation conditions of each Bucket; fifthly, allocating a selected micro-partition to each Bucket according to a data equilibrium partitioning algorithm and an allocation condition; and finally, transmitting allocated data in the Bucket to a Reducer side. The present invention provides a new data equilibrium partitioning method for a Spark framework, which reduces data skew during big data processing and improves overall performance of big data of the Spark parallel computing framework.

Description

technical field [0001] The invention relates to the fields of big data processing and high-performance computing, and in particular proposes an iterative data balance optimization method oriented to the Spark parallel computing framework. Background technique [0002] MapReduce is a parallel computing model for big data processing proposed by Google in 2004. It improves the performance of data processing by simultaneously running multiple tasks on a large number of cheap cluster nodes to process massive data in parallel. It has been developed rapidly and widely used in recent years. Spark is a parallel computing framework based on MapReduce. It was developed by the AMPLab of the University of California, Berkeley in 2009. It has the advantages of MapReduce and saves the intermediate results of task calculations in memory, reducing disk read and write costs and improving performance. The performance of big data processing has become the mainstream framework for building big ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F9/50G06F12/02
CPCG06F9/5061G06F12/023G06F16/2228
Inventor 张元鸣蒋建波黄浪游沈志鹏项倩红肖刚陆佳炜高飞
Owner ZHEJIANG UNIV OF TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More