Method and device for running ETL (Extract-Transform-Load) process joint component by Flink framework

A component and process technology, applied in the field of process joint components and devices running ETL in the Flink framework, can solve problems affecting the efficiency of data joint, achieve the effect of improving efficiency and avoiding data serialization and deserialization

Active Publication Date: 2022-04-22
WUHAN DAMENG DATABASE
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Therefore, the union based on the Flink union operator will affect the efficiency of data union in certain scenarios

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for running ETL (Extract-Transform-Load) process joint component by Flink framework
  • Method and device for running ETL (Extract-Transform-Load) process joint component by Flink framework
  • Method and device for running ETL (Extract-Transform-Load) process joint component by Flink framework

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] Embodiment 1 of the present invention provides a Flink framework running ETL process joint component method.

[0047] A process federated component approach for Flink frameworks to run ETL, including:

[0048] Traversing the ETL's directed acyclic graph DAG, identifying one or more nodes of the Splitting attribute; wherein the Splitting attribute node includes a data source node, a node FLINK_MESSAGE_SHARED_NODE attribute and one or more nodes that need to be converted into AFlink operator;

[0049] According to the ETL process DAG directed acyclic graph node sequence, starting from the data source node, based on the adjacent two nodes of the Splitting property, generates an ETL process subset composed of one or more ETL nodes and connecting lines between the nodes of the adjacent two Splitting properties, used in the Flink operator; constructs the corresponding flink between the nodes of the two adjacent Splitting properties API statement operation operator chain;

[0050] E...

Embodiment 2

[0115] Embodiment 2 of the present invention provides a Flink framework to run the ETL process joint component method, the present embodiment 2 compared to Example 1 in a more practical scenario to show the implementation process using the Flink joint operator.

[0116] In the ETL flowchart, there are sort components in the subsequent subset downstream of the union union component, because the sort component is a node with FLINK_REDUCE_NODE properties, so the sort node is the node of the Flink operator, and the union union component needs to be converted to the union operator of flink.

[0117] The data column information of multiple data sources corresponding to the Union union component is not necessarily completely consistent, and there is a situation where the number of columns is inconsistent and the type of the column is inconsistent; the ETL configuration union component outputs the reference column, taking one of the data source column information as the benchmark, and the...

Embodiment 3

[0139] Embodiment 3 of the present invention provides a Flink framework to run the ETL process joint component method, the present embodiment 3 compared to Example 1 in a more practical scenario to show the present scheme does not use the fink joint operator to run the implementation process.

[0140] The downstream components of the union component in the ETL flowchart do not have components that need to be translated into flink operators, and union union components do not need to be converted to union operation operators provided by the flink framework.

[0141] Among them, the data column information corresponding to multiple data sources corresponding to the ETL union component is not necessarily completely consistent, and there may be inconsistencies in the number of columns and the type of columns; the ETL configuration union component outputs the reference column, using one of the data source column information as the benchmark, and the other data source data are matched an...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of data processing, and provides a method and a device for running an ETL (Extract-Transform-Load) process joint component through an Flink framework. An ETL process component is split and recombined and then translated and converted into an Flink operator, a recombined ETL sub-process is operated in an Flink operator method, repeated writing of data processing logic codes in the Flink operator is avoided, and repeated realization of logic codes of data combination is avoided; a plurality of data sources of the union component can be concurrently read in different TaskManager node partitions or fragments of the flink framework, so that the data reading efficiency is greatly improved; according to the method, the union component and the subsequent component set of the union component run in the Flink operator, so that the use of the union operator of the Flink is avoided, unnecessary data serialization, deserialization and network transmission are avoided, and the data combination efficiency is greatly improved.

Description

【Technical field】 [0001] The present invention relates to a technical field, in particular to a Flink framework running ETL process joint component method and apparatus. 【Background】 [0002] ETL is an important tool software for data processing and building data warehouses, which completes the extraction, cleaning and transformation of heterogeneous data sources, and then loads a process. The traditional ETL is generally to publish the process to a centralized ETL server node to run, all the processes or components within the process use a multi-threaded mechanism, no matter how many processes can only run on a single node, and a large data processing process, can not improve the performance of data processing. [0003] Flink big data platform has been widely used in big data processing, is a distributed processing engine framework for stateful computing of unbounded and bounded data streams, with high fault recovery performance and fault tolerance performance. [0004] If the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F8/30G06F16/25G06F16/901
CPCG06F8/315G06F16/252G06F16/254G06F16/9024
Inventor 高东升梅纲吴鑫胡高坤付晨玺
Owner WUHAN DAMENG DATABASE
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products