Classification method of Stage based on resilient distributed dataset (RDD) and terminal

A distributed data and terminal technology, applied in the field of big data, can solve problems such as long execution time

Active Publication Date: 2017-01-18
深圳华为云计算技术有限公司
View PDF4 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, the RDD partition method in the prior art makes the execution time of the entire job long

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Classification method of Stage based on resilient distributed dataset (RDD) and terminal
  • Classification method of Stage based on resilient distributed dataset (RDD) and terminal
  • Classification method of Stage based on resilient distributed dataset (RDD) and terminal

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0031] The embodiment of the present invention can be applied to memory computing, for example, to the scenario of dividing and executing an RDD-based Stage in Spark. RDD is a distributed dataset. The network architecture of the embodiment of the present invention may include one computer device, or multiple computer devices connected

[0032] Then, multiple computers have the same memory computing framework, and RDD data can be distributed and stored on multiple ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention provides a stage classification method based on a resilient distributed dataset (RDD) and a terminal, which relate to the field of big data and can shorten the execution time of Spark operation. The method comprises the steps that firstly, a directed acyclic graph (DAG) of the RDD of an application program is created by the terminal; subsequently, stage classification is perfromed on the RDD in the DAG by the terminal, and a sub-RDD, a parent RDD with wide dependency on the sub-RDD and a parent RDD with narrow dependency on the sub-RDD are classified into different stages when the wide dependency and the narrow dependency simultaneously exist in any sub-RDD; then corresponding tasks of the parent RDD with the wide dependency on the sub-RDD and the parent RDD with the narrow dependency on the sub-RDD are executed in parallel by the terminal, and then a corresponding task of the sub-RDD is executed. The stage classification method and the terminal are applied to memory computing so as to classify and execute the stages based on the RDD.

Description

technical field [0001] The present invention relates to the field of big data, in particular to a stage division method and a terminal based on an elastic distributed data set. Background technique [0002] Spark is an in-memory computing framework. The core data structure of Spark is Resilient Distributed Datasets (RDD). RDD is a fault-tolerant and parallel data structure. As a data structure, RDD is essentially a read-only collection of partitioned records. An RDD can contain multiple partitions, and each partition is a Dataset fragment. There are wide dependencies (Narrow Dependency) and narrow dependencies (Wide Dependency) between RDDs. If each partition of an RDD can only be used by at most one partition of a sub-RDD, it is called a narrow dependency. If each partition of an RDD can be relied upon by multiple partitions of a sub-RDD, it is called a wide dependency. [0003] The Spark job execution model can be divided into three steps: the first step is to create a ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/182
Inventor 彭磊党李飞崔鑫梁殿鹏
Owner 深圳华为云计算技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products