A method for decoupling task data in the spark job scheduling system

A technology for job scheduling and data decoupling, applied in multi-program devices and other directions, can solve problems such as lack of task scheduling implementation, and achieve the effects of improving synergy and maintainability, improving maintainability, and simplifying dependency configuration.

Active Publication Date: 2017-10-31
北京赛特斯信息科技股份有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Spark provides implementations in both job scheduling and Action scheduling, but lacks the implementation of task scheduling

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for decoupling task data in the spark job scheduling system
  • A method for decoupling task data in the spark job scheduling system
  • A method for decoupling task data in the spark job scheduling system

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment

[0073] In practical application, such as figure 2 Shown, be a specific embodiment of the present invention, concrete flow process is as follows:

[0074] 1: First create a global context object, which will save the context information of the Spark runtime state and global attribute information. These attribute information can be specified by the developer.

[0075] 2: Read the configuration information of each task

[0076] 3: A directed acyclic graph will be constructed based on these task configuration information, and the dependencies of tasks can be analyzed through the directed acyclic graph.

[0077] 4: Create a global state object instance, which saves the RDD information of the global scope and the iteration state object of each iteration cycle. Therefore, through this object instance, all state objects can be traversed to obtain necessary RDD information.

[0078] 5: Start an iterative cycle, and execute the tasks in this cycle sequentially according to the infor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for realizing task data decoupling in a spark operation scheduling system, wherein the method comprises the following steps that in one iteration cycle, a system reads the iteration RDD (resilient distributed datasets) information of an iteration state object through a task context object example, and in addition, the iteration RDD information is stored into a task context object; the system finds the corresponding RDD information from the task context object through a Spark task object example, and stores the corresponding RDD information into a task result object; the system analyzes the RDD information in the task result object through the task state object example, and respectively stores the corresponding RDD information into the corresponding state object. When the method for realizing task data decoupling in the spark operation scheduling system is adopted, the RDD can be transmitted among all tasks, or the RDD transmission can be carried out between a former period and a later period of the task, so that each task can be complied in a modularized mode, and a wider application range can be realized.

Description

technical field [0001] The invention relates to the field of distributed big data processing, in particular to the field of Spark job scheduling design, and specifically refers to a method for decoupling task data in a Spark job scheduling system. Background technique [0002] Spark is an open source cluster computing system based on memory computing, which aims to make data analysis faster. Spark is a general-purpose parallel computing framework like MapReduce (a programming model), but unlike MapReduce, the intermediate results can be stored in memory, which brings higher efficiency and better interactivity (low latency). In addition, Spark also provides a wider range of data set operations, supporting multiple paradigms such as memory computing, multi-iterative batch processing, ad hoc query, stream processing, and graph computing. [0003] Spark also introduces an abstraction called Resilient Distributed Datasets (RDD). RDD is a collection of read-only objects distribu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F9/46
Inventor 逯利军钱培专汪金忠余聪林强李克民李拯
Owner 北京赛特斯信息科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products