Method and device for improving Spark operation efficiency

An efficiency and process technology, applied in the field of big data analysis and processing, can solve problems such as insufficient use of data, consumption of network IO, serial submission of tasks, or insufficient parallelism, so as to improve operating efficiency and increase parallelism Effect

An efficiency and process technology, applied in the field of big data analysis and processing, can solve problems such as insufficient use of data, consumption of network IO, serial submission of tasks, or insufficient parallelism, so as to improve operating efficiency and increase parallelism Effect

CN107544844AInactive Publication Date: 2018-01-05ZTE CORP

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for improving Spark operation efficiency
  • Method and device for improving Spark operation efficiency
  • Method and device for improving Spark operation efficiency

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be understood that the preferred embodiments described below are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

[0057] figure 1 is a block diagram of a method for improving Spark performance provided by an embodiment of the present invention, such as figure 1 As shown, the steps include:

[0058] Step S101: Determine the tables that need to be cached in the system.

[0059] Determine the table that needs to be cached according to the out-degree of the table, the number of records in the table cache, and the ready time difference between multiple cache tasks in the table; and / or determine the table of the custom cache type as the table that needs to be cached.

[0060] Step S102: Identify the cache task that takes the determined table that needs to be cached (refe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and a device for improving Spark operation efficiency, and relates to the field of big data analysis and processing. The method comprises the following steps that: determining a table which needs to be cached in a system; identifying a cache task which takes the determined table which needs to be cached as input or output; grouping the identified cache task, and creating a processing process for the corresponding cache task group; and according to the current state of each processing process and the real-time use situation of Spark cluster resources, combiningcache tasks to be submitted, and sending the combined cache tasks to be submitted to a Spark cluster to be processed. By use of the embodiment of the invention, the system resources of the cluster canbe fully utilized, tables and contents to be cached are reasonably determined, the process is dynamically decided and scheduled, a degree of parallelism is increased to a maximum degree under a situation that resources permit, and a purpose of improving the Spark operation efficiency is achieved.

Description

technical field [0001] The invention relates to the field of big data analysis and processing, in particular to a method and a device for improving Spark operation performance. Background technique [0002] With the development of informatization, the data to be processed by enterprises has grown explosively, and the amount of data has reached the level of terabytes (terabytes, TB) and petabytes (petabytes, PB). In order to support the analysis and processing of such large-scale data, various big data frameworks, tools and technologies have emerged, and Spark is one of them. [0003] Spark is a big data processing framework built around speed, ease of use, and complex analysis. It improves the "map-reduce" model (Map Reduce) by adopting a lower-cost shuffling (Shuffle) method in the data processing process. to a higher level, and utilizes in-memory data storage and near real-time processing capabilities, making its performance many times faster than other big data processin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
05 Jan 2018
Publication
CN107544844A
IPC
G06F9/50
Inventors
肖丽华; 王跃