Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for memory estimation and configuration optimization in distributed data processing system

一种分布式数据、配置优化的技术,应用在电数字数据处理、资源分配、多道程序装置等方向,能够解决算子数量少、内存预估应用局限性等问题,达到通用性强的效果

Active Publication Date: 2018-08-17
HUAZHONG UNIV OF SCI & TECH
View PDF4 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] The present invention can solve the defects of the existing memory estimation for specific application limitations: because the present invention adopts the data feature collection strategy, based on the small number of Spark operators, the data is formed in the memory through the processing operations of each operator Data change flow, each change can be regarded as a dynamic feature of the data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for memory estimation and configuration optimization in distributed data processing system
  • Method for memory estimation and configuration optimization in distributed data processing system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] This embodiment provides a method for memory estimation and configuration optimization in a distributed data processing system. A method for memory estimation and configuration optimization in a distributed data processing system at least includes:

[0045] S1: Match the data program flow that has been analyzed and processed for the conditional branch and / or loop body of the program code in the application jar package with the data feature library and estimate the memory upper limit of at least one stage based on the successful matching result,

[0046] S2: optimize the configuration parameters of the application program,

[0047] S3: Collect the static and / or dynamic features of the program data during the running process of the optimized application and make persistent records.

Embodiment approach

[0048] According to a preferred embodiment, the method also includes:

[0049] S4: Estimate the memory upper limit of at least one stage again based on the feedback results of the static features and / or dynamic features of the program data and optimize the configuration parameters of the application program.

[0050] The present invention can solve the defects of the existing memory estimation for specific application limitations: because the present invention adopts the data feature collection strategy, based on the small number of Spark operators, the data is formed in the memory through the processing operations of each operator Data change flow, each change can be regarded as a dynamic feature of the data. These dynamic characteristics can be shared by the next new application submission, that is, the data changes in the new application can become predictable. Moreover, the more historical applications submitted on the same data, the more dynamic characteristics of the da...

Embodiment 2

[0083] This embodiment is a further improvement on Embodiment 1, and repeated content will not be repeated here.

[0084] The present invention also provides a memory estimation and configuration optimization system in a distributed data processing system, such as figure 2 shown.

[0085] The memory estimation and configuration optimization system of the present invention at least includes a memory estimation module 10 , a configuration optimization module 20 and a data characteristic collection module 30 .

[0086] The memory estimation module 10 matches the data program flow analyzed and processed by the conditional branch and / or loop body of the program code in the application jar package with the data feature library stored in the data feature recording module, and based on the successful matching result Estimate the memory limit for at least one stage. Preferably, the memory estimation module 10 includes one or more of an ASIC, a CPU, a microprocessor, a server, and a ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for memory estimation and configuration optimization in a distributed data processing system. The method at least comprises the steps that a data program stream analyzed and processed for a conditional branch and / or a loop body of a program code in an application jar package is matched with a data feature library, based on a result of successful matching, the memory upper limit of at least one stage is estimated, based on the memory upper limit, configuration parameter optimization is carried out on an application program, based on the running processes of the optimized application program, the static feature and / or dynamic feature of program data are collected, and persistence recording is carried out. According to the method for the memory estimation and configuration optimization in the distributed data processing system, different from a black box model of memory estimation by machine leaning, the accuracy of the result of machine leaning prediction is not necessary high, and fine-grained prediction at each stage is made difficultly; program analysis and existing data features are used to accurately estimate the overall memory footprint, thememory use situation of job at each stage is estimated according to the program analysis, and the further fine-grained configuration optimization is made.

Description

technical field [0001] The invention relates to the technical field of distributed data processing systems, in particular to a method and system for memory estimation and configuration optimization in a distributed data processing system. Background technique [0002] With the development of the Internet and mobile Internet, massive data has prompted the application of distributed data processing systems in big data processing more and more widely, and the development is also very rapid. Hadoop, a representative distributed processing system, uses the MapReduce algorithm, which can support massive data analysis and processing that a single machine cannot complete. However, Hadoop has a bottleneck in IO performance because of the frequent read and write of disks. In view of these shortcomings, a new generation of distributed data processing systems based on memory computing such as Spark and Flink began to appear and develop rapidly. Spark takes RDD as the basic data unit, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/50G06F9/445
CPCG06F9/44505G06F9/44557G06F9/5016G06F2209/508G06F11/3006G06F11/3051G06F11/3433G06F11/3466G06F11/3612G06F8/443G06F8/452G06F11/3604G06F2201/865
Inventor 石宣化金海柯志祥吴文超
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products