Method for realizing task data decoupling in spark operation scheduling system

A job scheduling and data decoupling technology, applied in the direction of multi-programming devices, etc., can solve the problem of lack of task scheduling implementation, and achieve the effects of improving coordination and maintainability, enhancing collaborative development capabilities, and improving maintainability

Active Publication Date: 2015-02-18
北京赛特斯信息科技股份有限公司
View PDF5 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Spark provides implementations in both job scheduling and Action scheduling, but lacks the implementation of task scheduling

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for realizing task data decoupling in spark operation scheduling system
  • Method for realizing task data decoupling in spark operation scheduling system
  • Method for realizing task data decoupling in spark operation scheduling system

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment

[0073] In practical applications, such as figure 2 As shown, it is a specific embodiment of the present invention, and the specific process is as follows:

[0074] 1: First create a global context object, which will save the context information and global attribute information of the Spark runtime state. These attribute information can be specified by the developer.

[0075] 2: Read the configuration information of each task

[0076] 3: According to the configuration information of these tasks, a directed acyclic graph will be constructed, and the dependencies of tasks can be analyzed through the directed acyclic graph.

[0077] 4: Create a global state object instance, which saves the RDD information of the global scope and the iteration state object of each iteration cycle. In this way, all state objects can be traversed through the object instance to obtain necessary RDD information.

[0078] 5: Start an iterative cycle, and execute the tasks in this cycle in sequence according to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for realizing task data decoupling in a spark operation scheduling system, wherein the method comprises the following steps that in one iteration cycle, a system reads the iteration RDD (resilient distributed datasets) information of an iteration state object through a task context object example, and in addition, the iteration RDD information is stored into a task context object; the system finds the corresponding RDD information from the task context object through a Spark task object example, and stores the corresponding RDD information into a task result object; the system analyzes the RDD information in the task result object through the task state object example, and respectively stores the corresponding RDD information into the corresponding state object. When the method for realizing task data decoupling in the spark operation scheduling system is adopted, the RDD can be transmitted among all tasks, or the RDD transmission can be carried out between a former period and a later period of the task, so that each task can be complied in a modularized mode, and a wider application range can be realized.

Description

Technical field [0001] The invention relates to the field of distributed big data processing, in particular to the field of Spark job scheduling design, and specifically refers to a method for realizing task data decoupling in a Spark job scheduling system. Background technique [0002] Spark is an open source cluster computing system based on memory computing. The purpose is to make data analysis faster. Spark is a general parallel computing framework similar to MapReduce (a programming model), but unlike MapReduce, intermediate results can be stored in memory, bringing higher efficiency and better interactivity (low latency). In addition, Spark also provides a wider range of data set operation types, supporting multiple paradigms such as memory computing, multi-iteration batch processing, ad hoc query, stream processing, and graph computing. [0003] Spark also introduced an abstraction called Resilient Distributed Datasets (RDD). RDD is a collection of read-only objects distri...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/46
Inventor 逯利军钱培专汪金忠余聪林强李克民李拯
Owner 北京赛特斯信息科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products