Method and device for improving Spark operation efficiency

An efficiency and process technology, applied in the field of big data analysis and processing, can solve problems such as insufficient use of data, consumption of network IO, serial submission of tasks, or insufficient parallelism, so as to improve operating efficiency and increase parallelism Effect

Inactive Publication Date: 2018-01-05
ZTE CORP
View PDF3 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Although Spark SQL has a powerful optimizer and supports columnar storage, memory caching, and storage compression through the cache table, these optimizations are all focused on the execution process of Spark SQL, that is, after the task is submitted to Spark, if the application cannot make full use of it according to the actual situation The advantages of Spark SQL, reasonable organization of task submission, will inevitably lead to some problems, offsetting the performance improvement bonus brought by Spark SQL's own advantages:
[0006] 1. The system has sufficient memory resources and core (CPU) resources, but they cannot be fully used, such as serial submission of tasks or insufficient parallelism
[0007] 2. The timing of caching and releasing the cache storage table is not accurate
[0010] 4. The cached data is not fully used, resulting in repeated caching of the same data
Moving data, moving data from one node to another for calculation, not only consumes network IO, but also consumes disk IO, reducing the efficiency of the entire calculation

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for improving Spark operation efficiency
  • Method and device for improving Spark operation efficiency
  • Method and device for improving Spark operation efficiency

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings. It should be understood that the preferred embodiments described below are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

[0057] figure 1 is a block diagram of a method for improving Spark performance provided by an embodiment of the present invention, such as figure 1 As shown, the steps include:

[0058] Step S101: Determine the tables that need to be cached in the system.

[0059] Determine the table that needs to be cached according to the out-degree of the table, the number of records in the table cache, and the ready time difference between multiple cache tasks in the table; and / or determine the table of the custom cache type as the table that needs to be cached.

[0060] Step S102: Identify the cache task that takes the determined table that needs to be cached (refe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for improving Spark operation efficiency, and relates to the field of big data analysis and processing. The method comprises the following steps that: determining a table which needs to be cached in a system; identifying a cache task which takes the determined table which needs to be cached as input or output; grouping the identified cache task, and creating a processing process for the corresponding cache task group; and according to the current state of each processing process and the real-time use situation of Spark cluster resources, combiningcache tasks to be submitted, and sending the combined cache tasks to be submitted to a Spark cluster to be processed. By use of the embodiment of the invention, the system resources of the cluster canbe fully utilized, tables and contents to be cached are reasonably determined, the process is dynamically decided and scheduled, a degree of parallelism is increased to a maximum degree under a situation that resources permit, and a purpose of improving the Spark operation efficiency is achieved.

Description

technical field [0001] The invention relates to the field of big data analysis and processing, in particular to a method and a device for improving Spark operation performance. Background technique [0002] With the development of informatization, the data to be processed by enterprises has grown explosively, and the amount of data has reached the level of terabytes (terabytes, TB) and petabytes (petabytes, PB). In order to support the analysis and processing of such large-scale data, various big data frameworks, tools and technologies have emerged, and Spark is one of them. [0003] Spark is a big data processing framework built around speed, ease of use, and complex analysis. It improves the "map-reduce" model (Map Reduce) by adopting a lower-cost shuffling (Shuffle) method in the data processing process. to a higher level, and utilizes in-memory data storage and near real-time processing capabilities, making its performance many times faster than other big data processin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50
Inventor 肖丽华王跃刘晏
Owner ZTE CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products