A method for transmitting mass data of a data warehouse based on plug-in heterogeneous data sources

By using the Flink framework based on pluggable heterogeneous data sources, we have achieved efficient transmission of massive amounts of data in the data warehouse, solved the problem of low data aggregation efficiency in existing technologies, and improved the efficiency of source layer construction and the stability of data aggregation.

CN115729924BActive Publication Date: 2026-06-26SU YIN KAIJI CONSUMER FINANCE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SU YIN KAIJI CONSUMER FINANCE CO LTD
Filing Date
2022-12-13
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing data warehouse data aggregation solutions are inefficient, especially lacking specific optimizations in the source layer construction, which fails to effectively improve data transmission efficiency.

Method used

It adopts a pluggable heterogeneous data source approach, uses the Flink framework for data transmission, generates unified task execution parameters, dynamically loads plugins to achieve data cleaning and writing, and supports multi-node clustered operation and in-memory incremental stripping.

Benefits of technology

It enables flexible control over the parallelism of tasks running on the cluster, improves the efficiency of data warehouse source layer construction, and supports efficient, stable, and easy-to-use data aggregation of massive amounts of data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115729924B_ABST
    Figure CN115729924B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on the transmission method of mass data of data warehouse of plug-in heterogeneous data source, generates uniform task execution parameter, task submission, executes task and registers library table data as memory mapping table, carries out cleaning conversion to data and generates brand-new memory mapping table, and writes data into hive table data.The application can run in multiple nodes and support memory type incremental stripping, effectively improve the efficiency of data warehouse source layer construction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a data transmission method, and more particularly to a method for transmitting massive amounts of data in a data warehouse based on plug-in heterogeneous data sources, belonging to the field of data warehouse technology. Background Technology

[0002] Batch data transmission is one of the key scenarios in the construction of data warehouses and data platforms. It carries the tasks of source table aggregation and data push from the data warehouse to the downstream, and is the core tool for data inflow and outflow in the data warehouse.

[0003] In existing technologies, data aggregation solutions for the data warehouse origin layer mainly include technical solutions represented by tools such as DataStage and Kettle, technical solutions represented by DataLoader + FTP / SCP, and technical solutions represented by DataX, Sqoop, and Flume. Among these, the technical solutions represented by DataStage and Kettle operate on a single machine through direct database connection, a client / server model suitable for traditional database migration and integration. The technical solutions represented by DataLoader + FTP / SCP utilize the database's built-in data loading / unloading tool (DataLoader) combined with remote data file transfer protocols (FTP / SCP) for data aggregation in the data warehouse; this is one of the more commonly used solutions in data warehouse operations. The technical solutions represented by DataX, Sqoop, and Flume use multi-threaded JDBC for data transmission, offering general applicability, but lack specific optimizations for data aggregation in the data warehouse origin / incremental stripping layer.

[0004] Existing data aggregation solutions are inefficient in building the data source layer. It is necessary to provide a new data transmission method to address the shortcomings of existing technologies and improve the efficiency of building the data warehouse source layer. Summary of the Invention

[0005] The technical problem to be solved by the present invention is to provide a method for transmitting massive data in a data warehouse based on pluggable heterogeneous data sources, so as to realize flexible control of the parallelism of tasks running on the cluster.

[0006] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:

[0007] A method for transmitting massive amounts of data in a data warehouse based on plug-in heterogeneous data sources, characterized by the following steps:

[0008] S1. Generate unified task execution parameters;

[0009] S2, Task Submission;

[0010] S3. Execute the task and register the database table data as a memory-mapped table;

[0011] S4. Clean and transform the data to generate a brand new memory mapping table;

[0012] S5. Write the data to the Hive table.

[0013] Furthermore, step S1 specifically involves: defining a unified JSON message format and creating task execution parameters.

[0014] Further, step S2 specifically involves: receiving task parameters through the RunJob main function in flink-framework.jar, parsing the JSON data, determining the detailed information of the task to be transmitted, assembling the flink-submit parameters, and then calling the flink submission command to submit the task.

[0015] Further, step S3 specifically involves the following steps: After receiving the execution task from flink-submit, the Flink cluster runs the JobStart main function of the core project according to its logic. Before executing the reader function, the mysql-reader.jar plugin is loaded through the classloader dynamic loading mechanism and Java reflection mechanism. Then, the reader function is executed using mysql-reader.jar to read data from the database tables in MySQL and register it as a memory-mapped table in the Flink cluster for easy use of the flink-sql function.

[0016] Further, step S4 specifically involves: after the core main project establishes the reader pipeline, it executes the cleansing and transformation function transform to parse the transform parameter in the JSON and concatenate it into data cleansing SQL. The memory-mapped table performs data transformation and cleansing operations through the SQL of this transformation function and generates a brand new memory-mapped table.

[0017] Further, step S5 specifically involves: cleaning and converting the data in the memory-mapped table, continuing to run according to the main function logic of JobStart in the core main project, and executing the writer function. Through the classloader dynamic loading mechanism and Java reflection mechanism, the hive-writer.jar plugin is loaded, and then the hive-writer.jar plugin is used to implement the execution of the write function to write data into the hive table.

[0018] Furthermore, when the source table of the database table does not have a date field, it is impossible to determine the business situation of each incremental data entry through the SQL WHERE condition. When flink-framework.jar confirms that the incremental stripping flag is true, it executes the in-memory incremental stripping logic and modifies the flink-submit parameter.

[0019] Furthermore, the flink-submit parameters are modified as follows: In the core's IncrementStart main method, the reader function is executed first to load the latest source table data T1, and the snapShoot function is executed simultaneously to read the latest full data T2 with the same table name in the source layer. The difference calculation function (T1) EXCEPT (T2) of Flink-SQL is used to calculate the increment of T1 relative to T2, and the calculation result data is loaded into a memory-mapped table. Then, the transform function is executed to clean the data, and finally, the writer function is executed to write the data.

[0020] Furthermore, relying on the computing nodes of the Flink cluster and the distributed computing capabilities of the Flink framework, multi-node clustered computing is performed during task execution.

[0021] Furthermore, the multi-node clustered operation specifically includes:

[0022] Submit the flink-framework.jar core project, and then execute the flink-submit command through the Flink client to submit the task to the cluster for execution;

[0023] After receiving the core project and task parameters, the jobManager cluster manager allocates tasks based on the degree of parallelism.

[0024] The taskManager is the task manager, responsible for resource management of specific tasks and communication between tasks distributed across different nodes.

[0025] The task is responsible for running the code logic in the core project and returning the results to the jobManager.

[0026] Compared with the prior art, the present invention has the following advantages and effects: The present invention provides a method for transmitting massive data in a data warehouse based on plug-in heterogeneous data sources, which can run on multiple nodes and supports in-memory incremental stripping, effectively improving the efficiency of data warehouse source layer construction. Attached Figure Description

[0027] Figure 1This is a logical architecture diagram of a data warehouse massive data transmission method based on plug-in heterogeneous data sources according to the present invention.

[0028] Figure 2 This is a schematic diagram of the multi-node clustered operation of the present invention. Detailed Implementation

[0029] To illustrate in detail the technical solutions adopted by the present invention to achieve the intended technical objectives, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Furthermore, the technical means or technical features in the embodiments of the present invention can be replaced without creative effort. The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

[0030] like Figure 1 The diagram shown is a logical architecture diagram of a data warehouse massive data transmission method based on plug-in heterogeneous data sources according to the present invention.

[0031] Among them, the front-end business source database includes business-oriented databases such as the core system and the CRM customer management system. Because there are many front-end systems connected to the data warehouse and the database types are diverse, this invention supports data ingestion functions for multiple databases, including but not limited to: MySQL, Oracle, DB2, SQL Server, and file data files.

[0032] Data warehouse source layer: The characteristic of data warehouse construction is layering. The source layer requires that the data be kept highly consistent with the source table, so that business personnel can easily track the original data situation.

[0033] Flink Cluster: The code in this invention is written using the Flink framework. Flink is a popular distributed in-memory data processing streaming engine with high throughput, high availability, and high performance. The Flink cluster is the environment in which the Flink code runs.

[0034] Data transmission channel: Data ingestion tasks submitted via flink-submit will be instantiated in the Flink cluster and a data transmission channel will be created to perform the logic of data extraction, transformation, and writing.

[0035] The present invention provides a method for transmitting massive amounts of data in a data warehouse based on plug-in heterogeneous data sources, comprising the following steps:

[0036] S1. Generate unified task execution parameters.

[0037] It defines a unified JSON message format and creates task execution parameters. The created task execution parameters have a clear data structure and can be customized in terms of execution content and method, effectively improving the flexibility of the task.

[0038] S2, Task Submission.

[0039] The RunJob main function in flink-framework.jar receives task parameters, parses the JSON data, determines the detailed information of the task to be transmitted, assembles the flink-submit parameters, and then calls the Flink submission command to submit the task.

[0040] S3. Execute the task and register the database table data as a memory-mapped table.

[0041] After receiving the execution task from flink-submit, the Flink cluster runs it according to the main function logic of JobStart in the core project. Before executing the reader function, it loads the mysql-reader.jar plugin through the classloader dynamic loading mechanism and Java reflection mechanism. Then, it uses mysql-reader.jar to implement the execution of the reader function, reads the database table data in MySQL, and registers it as a memory-mapped table in the Flink cluster for easy use of flink-sql functions.

[0042] S4. Clean and transform the data to generate a brand new memory-mapped table.

[0043] After the core main project establishes the reader pipeline, it executes the cleansing and transformation function transform to parse the transform parameter in the JSON and concatenate it into data cleansing SQL. The memory-mapped table performs data transformation and cleansing operations through the SQL of this transformation function and generates a brand new memory-mapped table.

[0044] When the source table of the database table does not have a date field, it is impossible to determine the business situation of each incremental data entry through the SQL WHERE condition. When the incremental stripping flag is confirmed to be true, flink-framework.jar executes the in-memory incremental stripping logic and modifies the flink-submit parameter.

[0045] The specific modification to the flink-submit parameters is as follows: In the core's IncrementStart main method, the reader function is executed first to load the latest source table data T1, and the snapShoot function is executed simultaneously to read the latest full data T2 with the same table name in the source layer. The difference calculation function (T1) EXCEPT (T2) of Flink-SQL is used to calculate the increment of T1 relative to T2, and the calculation result data is loaded into a memory-mapped table. Then, the transform function is executed to clean the data, and finally, the writer function is executed to write the data.

[0046] S5. Write the data to the Hive table.

[0047] After the data in the cleaned and transformed memory-mapped table is processed according to the main function logic of JobStart in the core project, the writer function is executed, the hive-writer.jar plugin is loaded through the classloader dynamic loading mechanism and Java reflection mechanism, and then the write function is executed through the hive-writer.jar plugin to write the data into the hive table.

[0048] This invention relies on the computing nodes of the Flink cluster and the distributed computing capabilities of the Flink framework to perform multi-node clustered computing during task execution.

[0049] like Figure 2 As shown, multi-node clustered computing specifically involves:

[0050] Submit the flink-framework.jar core project, and then execute the flink-submit command through the Flink client to submit the task to the cluster for execution;

[0051] After receiving the core project and task parameters, the jobManager cluster manager allocates tasks based on the degree of parallelism.

[0052] The taskManager is the task manager, responsible for resource management of specific tasks and communication between tasks distributed across different nodes.

[0053] The task is responsible for running the code logic in the core project and returning the results to the jobManager.

[0054] This invention uses the Flink streaming data engine as the underlying data processing framework. It can generate unified task execution parameter commands through data warehouse scheduling and submit these parameters to the Flink cluster using flink-sbmit for data aggregation task execution. The task execution logic is implemented using a self-developed JAR package tool. This tool uses the Flink programming framework and is developed in a plug-in manner. The main components are a core package, a reader plugin, and a writer plugin. Corresponding plugins are developed for reading and writing to heterogeneous databases. This approach significantly reduces code module coupling, increases aggregation, and greatly resolves JAR dependency conflicts.

[0055] This invention allows for flexible control over the parallelism of tasks running on the cluster and the number of connections to the data source. The cluster cleaning rules are written in Flink-SQL format, maximizing the flexibility of task execution. In the business domain, this invention belongs to the source-attached layer application in financial industry data warehouse construction, providing the underlying technical support for data aggregation in data warehouses. Massive amounts of data are a characteristic of data warehouses, and the continuous and stable aggregation of such data is crucial for data warehouse construction. This method demonstrates excellent performance in terms of efficiency, stability, ease of use, and scalability, making it an effective method for data aggregation in the financial industry.

[0056] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent substitutions, and improvements made to the above embodiments without departing from the scope of the present invention, based on the technical essence of the present invention and within the spirit and principles of the present invention, shall still fall within the protection scope of the present invention.

Claims

1. A method for transmitting massive amounts of data in a data warehouse based on plug-in heterogeneous data sources, characterized in that... Includes the following steps: S1. Generate unified task execution parameters; S2, Task Submission; S3. Execute the task and register the database table data as a memory-mapped table; S4. Clean and transform the data to generate a brand new memory mapping table; S5. Write the data to the Hive table. The specific steps of step S5 are as follows: the data in the cleaned and converted memory-mapped table continues to run according to the main function logic of JobStart in the core main project, and the writer function is executed. The hive-writer.jar plugin is loaded through the classloader dynamic loading mechanism and Java reflection mechanism, and then the hive-writer.jar plugin is used to implement the execution of the write function to write the data in the hive table. When the source table of the database table does not have a date field, it is impossible to determine the business situation of each incremental data entry through the SQL WHERE condition. When the incremental stripping flag is confirmed to be true, flink-framework.jar executes the in-memory incremental stripping logic and modifies the flink-submit parameter. The specific modifications to the flink-submit parameters are as follows: In the core's IncrementStart main method, the reader function is executed to load the latest source table data T1, and simultaneously the snapShot function is executed to read the latest full data T2 with the same table name in the source layer. The difference calculation function (T1) EXCEPT (T2) of Flink-SQL is used to calculate the increment of T1 relative to T2, and the calculation result data is loaded into a memory-mapped table. Then, the transform function is executed to clean the data, and finally, the writer function is executed to write the data.

2. The method for transmitting massive amounts of data in a data warehouse based on a plug-in heterogeneous data source according to claim 1, characterized in that: Step S1 specifically involves defining a unified JSON message format and creating task execution parameters.

3. The method for transmitting massive amounts of data in a data warehouse based on a plug-in heterogeneous data source according to claim 1, characterized in that: Step S2 specifically involves receiving task parameters through the RunJob main function in flink-framework.jar, parsing the JSON data, determining the detailed information of the task to be transmitted, assembling the flink-submit parameters, and then calling the flink submission command to submit the task.

4. The method for transmitting massive amounts of data in a data warehouse based on a plug-in heterogeneous data source according to claim 3, characterized in that: Step S3 specifically involves the following steps: After receiving the execution task from flink-submit, the Flink cluster runs the JobStart main function according to the logic of the core project. Before executing the reader function, the mysql-reader.jar plugin is loaded through the classloader dynamic loading mechanism and Java reflection mechanism. Then, the reader function is executed using mysql-reader.jar to read data from the database tables in MySQL and register it as a memory-mapped table in the Flink cluster for easy use of the flink-sql function.

5. The method for transmitting massive amounts of data in a data warehouse based on a pluggable heterogeneous data source according to claim 4, characterized in that: Step S4 specifically involves: after the core main project establishes the reader pipeline, it executes the cleansing and transformation function transform to parse the transform parameter in the JSON and concatenate it into data cleansing SQL. The memory-mapped table performs data transformation and cleansing operations through the SQL of this transformation function and generates a brand new memory-mapped table.

6. The method for transmitting massive amounts of data in a data warehouse based on a plug-in heterogeneous data source according to claim 1, characterized in that: Relying on the computing nodes of the Flink cluster and the distributed computing capabilities of the Flink framework, multi-node clustered computing is performed during task execution.

7. The method for transmitting massive amounts of data in a data warehouse based on a plug-in heterogeneous data source according to claim 6, characterized in that: The multi-node clustered operation specifically refers to: Submit the flink-framework.jar core project, and then execute the flink-submit command through the Flink client to submit the task to the cluster for execution; After receiving the core project and task parameters, the jobManager cluster manager allocates tasks based on the degree of parallelism. The taskManager is the task manager, responsible for resource management of specific tasks and communication between tasks distributed across different nodes. The task is responsible for running the code logic in the core project and returning the results to the jobManager.