Process decomposition method and device for running ETL by hadoop cluster

A hadoop cluster and process technology, applied in the field of process decomposition of ETL running in hadoop cluster, can solve problems such as inability to complete DAG graph operation, and achieve the effect of convenient and flexible programming

Pending Publication Date: 2021-04-16
WUHAN DAMENG DATABASE
View PDF4 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The process of ETL configuration is generally a directed acyclic graph, but the two simple map reduce functions provided by Hadoop can only complete the operation of simple DAG graphs, and cannot complete the operation of slightly more complex DAG graphs.
Moreover, although Hadoop also provides chained map reduce of ChainMapper and ChainReducer, there are certain defects in the partitioning of data sources, reading of partitioned data sources, integrity of ETL process operation, convenience, flexibility and versatility of use.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Process decomposition method and device for running ETL by hadoop cluster
  • Process decomposition method and device for running ETL by hadoop cluster
  • Process decomposition method and device for running ETL by hadoop cluster

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] Based on the constraints of traditional ETL data processing performance and the problems of simple use of hadoop, the embodiment of the present invention provides a process decomposition method for running ETL in a hadoop cluster, which can decompose the ETL process so that it can be submitted to the map reduce framework of the hadoop cluster environment after decomposition in the implementation.

[0054] Wherein, the process decomposition refers to decomposing the ETL process into one or more MRWorks (ie, Map ReduceWork), and the components of each MRWork run in one map reduce. All components of an MRWork usually run partly in the mapper and partly in the reducer. All mapper task sub-processes and all reducer task sub-processes constitute an MRWork.

[0055] Such as figure 1 As shown, the process decomposition method of hadoop cluster running ETL provided by the embodiment of the present invention mainly includes the following steps:

[0056] Step 10, constructing a ...

Embodiment 2

[0086] On the basis of the process decomposition method of a hadoop cluster running ETL provided in the above-mentioned embodiment 1, several specific examples are further given in the embodiment of the present invention to introduce how to use the method in embodiment 1 in different application scenarios Break down the process.

[0087] combine Figure 9 , in the first specific example, there is no reduce node in the ETL process, there is a data source, and the corresponding directed acyclic graph DAG is as follows Figure 9 As shown on the left (corresponding to figure 2 ). Since the reduce node cannot be searched in the DAG, the data source node and all downstream component nodes are directly formed into a mapper sub-process, and the ETL process is decomposed into an MRWork. In a preferred embodiment, the data source can be fragmented to obtain n fragmentation table data sources, which are respectively split1, split2, ..., splitn, corresponding to n mapper sub-processes...

Embodiment 3

[0096] It can be seen from the process decomposition method of Hadoop cluster running ETL provided by the above-mentioned embodiment 1 that if there are X reduce nodes in the ETL process, at least X map reduce is required, that is, the ETL process is decomposed into at least X MRWork.

[0097] by Figure 6 and Figure 12 For example, the ETL process includes three reduce nodes: sorting component 1, connecting component, and sorting component 2. Therefore, this complex ETL process can be divided into three MRWorks, which are respectively recorded as MRWork1, MRWork2, and MRWork3. Each MRWork owns The component nodes run in a map reduce. The MRWork data structure is roughly as follows:

[0098]

[0099]

[0100] In the above data structure, ActivityBean is the holder of each component property.

[0101] multiMapSourceActivityBeansMap is a defined map variable, where the key is the data source ActivityBean, and the value is the data source and all component nodes downstr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a process decomposition method and device for a hadoop cluster to run ETL, and the method comprises the steps: constructing a directed acyclic graph based on all assembly nodes of an ETL process, and searching reduce nodes in the directed acyclic graph; if the reduce node is searched, forming a key value pair of the reduce node and a data source of the reduce node according to whether the reduce node has a direct data source or not; constructing a mapper and a reducer corresponding to each reduce node according to the key value pair, and further constructing an MRWork; and if the reduce node cannot be searched or the data source node cannot be associated with the reduce node, constructing a mapper based on the data source node and a downstream node thereof, and further constructing the MRWork. According to the method, the ETL process is submitted to the mapreduce framework of the hadoop cluster environment for concurrent execution, the process integrity is not damaged, and programming implementation is more convenient and flexible.

Description

【Technical field】 [0001] The invention relates to the technical field of data processing, and provides a process decomposition method and device for running ETL in a Hadoop cluster. 【Background technique】 [0002] ETL is an important tool software for data processing and building a data warehouse, and completes a process of extracting, cleaning, transforming, and loading heterogeneous data sources. Traditional ETL generally publishes the process to run on a centralized ETL server node. All processes or components within the process are run using a multi-threaded mechanism. No matter how many processes, they can only run on a single node, and a big data The processing flow cannot improve the performance of data processing. [0003] Hadoop big data platform has been widely used in big data processing. MapReduce is a computing model, framework and platform for parallel processing of big data. It provides a simple parallel programming method. It implements basic parallel compu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/25G06F16/27G06F16/2458G06F16/182
Inventor 高东升梅纲
Owner WUHAN DAMENG DATABASE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products