Unlock instant, AI-driven research and patent intelligence for your innovation.

A load balancing method and device for solving spark data skew problem

A load balancing and data technology, applied in multi-programming devices, electrical digital data processing, program control design, etc., can solve the problems of redundant task occupation and increase of the total completion time of the operation.

Active Publication Date: 2021-08-24
CHONGQING INST OF GREEN & INTELLIGENT TECH CHINESE ACADEMY OF SCI
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, when the reason for data skew is due to the unbalanced distribution of its input data, Spark's speculative execution mechanism is helpless
This is because the re-execution of tasks with the same input data on different machines will result in the same execution time, and redundant tasks occupy part of the resources of the cluster, which will eventually lead to an increase in the total completion time of the entire job

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A load balancing method and device for solving spark data skew problem
  • A load balancing method and device for solving spark data skew problem
  • A load balancing method and device for solving spark data skew problem

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0060] The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0061] Such as figure 2 As shown, a load balancing method for solving the Spark data skew problem described in this embodiment includes the following six steps:

[0062] S101. Monitor the average CPU utilization rate and memory utilization rate of the computing nodes, and initialize the weight information of the Executor after the Spark Executor process starts;

[0063] S102. Each computing node samples the local intermediate data according to the sampling ratio, which is set individually by the user, and then the computing node sends the local sampling information to the Master node through message communication;

[0064] S103. The Master node summarizes the sampling information of all computing nodes, and then establishes a histogram of data distribution according to the sampling ratio, and predicts the overall characteristics of the da...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention relates to a load balancing method for solving the Spark data skew problem, including S1: monitoring the average CPU utilization rate and memory utilization rate of computing nodes, and initializing Executor weight information after the Spark Executor process is started; S2: each computing node according to The preset sampling ratio samples the local intermediate data, and then sends the local sampling information to the Master node through message communication; S3: The Master node summarizes the sampling information, establishes a histogram of data distribution, and predicts the overall characteristics of the data distribution; S4: According to the data distribution In this case, the data is divided into multiple partitions, the number of partitions is an integer multiple of the total number of cores of all Executors, and the large Key is split during the partitioning process; S5: Calculate the performance factor of the Executor, each data partition corresponds to an Executor task, and the Tasks are allocated to the Executor with the highest performance factor according to the greedy strategy; S6: The weight of the Executor is dynamically adjusted according to its load and resource utilization, and step S5 is repeated until the tasks are allocated. Corresponding means are also included.

Description

technical field [0001] The invention belongs to the technical field of online cluster resource scheduling, and relates to a load balancing method and device for solving the Spark data skew problem. Background technique [0002] Large-scale in-memory computing platforms are widely adopted in academia and industry to process large amounts of data from diverse applications and data sources. These platforms greatly reduce the number of disk I / Os by caching intermediate application data in memory and utilizing a more powerful and flexible directed acyclic graph (DAG)-based task scheduling mechanism. At the same time, the DAG-based programming paradigm provides users with the flexibility to express application requirements. However, complex task scheduling makes users identify application bottlenecks and performance tuning brings great challenges. [0003] As a popular in-memory computing platform today, Spark is quickly sought after by academia and industry with its advanced des...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F9/50
CPCG06F9/5016G06F9/505
Inventor 田文洪黄超杰王金尚明生
Owner CHONGQING INST OF GREEN & INTELLIGENT TECH CHINESE ACADEMY OF SCI
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More