[0040] Example:
[0041] see Figure 1-3 , a datax-based data governance method according to an embodiment of the present invention includes:
[0042] S101. The job that datax completes a single data synchronization is called a job. After datax receives a job, it will start a process to complete the entire job synchronization process. The dataxJob module is the central management node of a single job, which is responsible for data cleaning, subtask switching Functions such as division, TaskGroup management, etc.;
[0043] After S103 and dataxJob are started, the job will be divided into multiple small tasks according to different source-end segmentation strategies to facilitate concurrent execution, and each task will be responsible for the synchronization of a part of the data;
[0044] S105. After dividing multiple tasks, dataxJob will call the Scheduler module, and recombine the divided tasks according to the configured amount of concurrent data, and assemble them into a TaskGroup;
[0045] S107. Each Task is started by the TaskGroup. After the Task is started, the thread of Reader->Channel->Writer will be started fixedly to complete the task synchronization;
[0046] S109. After the datax job runs, the Job monitors and waits for the completion of multiple TaskGroup module tasks, and waits for the Job to exit successfully after all the TaskGroup tasks are completed; otherwise, the process exits abnormally and the process exit value is not 0.
[0047] In a further embodiment, the above sub-task segmentation is to convert a single job calculation into multiple sub-tasks.
[0048] In a further embodiment, the above Task is called a subtask, and the Task is the smallest unit of the datax job.
[0049] In a further embodiment, the above TaskGroup is called a task group, and each TaskGroup is responsible for running all assigned tasks with a certain concurrency, and the default number of concurrent tasks in a single task group is 5.
[0050] Through the above-mentioned scheme of the present invention, the present invention has the following beneficial effects:
[0051] 1. Perfectly solve the problem of distortion of individual types of data transmission: datax supports all strong data types through function optimization, and each plug-in has its own data type conversion strategy, so that data can be transmitted to the destination in a complete and lossless manner;
[0052] 2. Provide runtime monitoring of traffic and data volume of the entire job link: During the running process of datax, the monitoring module will comprehensively display the job status, data flow, data speed, execution progress and other information, so that users can understand the job status in real time. , and can intelligently judge the speed comparison between the source end and the destination end during the job execution process, and provide users with more performance troubleshooting information;
[0053] 3. Provide dirty data detection: During the transmission of a large amount of data, many data transmission errors (such as type conversion errors) must be caused due to various reasons. This kind of data datax is considered dirty data, and datax is implemented by configuring the data cleaning process. Dirty data is accurately filtered, identified, collected, and displayed, providing users with a variety of dirty data processing modes, allowing users to accurately control the quality of data;
[0054] 4. Rich data conversion functions: As an ETL tool serving big data, datax not only provides data snapshot relocation functions, but also provides rich data conversion functions, so that data can be easily desensitized during data transmission. Completion, filtering and other data conversion functions, and also provides automatic groovy functions, allowing users to customize conversion functions;
[0055] 5. Precise speed control: After optimization, datax provides three flow control modes, including channel (concurrency), record stream, and byte stream. You can control the speed of your work at will, so that your work can be achieved within the range that the library can bear. the best synchronization speed;
[0056] 6. Robust fault-tolerant mechanism. Datax jobs are easily disturbed by external factors. Factors such as network interruptions and unstable data sources can easily stop half-synchronized jobs reporting errors. Therefore, stability is the basic requirement of datax. In datax3 The design of .0 focuses on improving the stability of the framework and plug-ins. At present, datax 3.0 can achieve multi-level local/global retry at the thread level and job level to ensure the stable operation of users' jobs.
[0057] In order to facilitate the understanding of the above technical solutions of the present invention, the working principle or operation mode of the present invention in the actual process will be described in detail below.
[0058] In practical application:
[0059] 1. datax is an Ali open source heterogeneous data source offline synchronization tool, dedicated to realizing stable and efficient synchronization between various heterogeneous data sources including relational databases (MySQL, Oracle, etc.), HdFS, Hive, OdPS, HBase, FTP, etc. The data synchronization function is based on the open source datax framework, the optimization and encapsulation of our internal framework.
[0060] 2. The datax offline data synchronization framework is encapsulated and constructed using the Framework+plugin architecture. The data source reading and writing are abstracted into Reader/Writer plug-ins and incorporated into the entire synchronization framework.
[0061] datax 3.0 supports single-machine multi-threaded mode to complete synchronous job operation. The following is a sequence diagram of the life cycle of a datax job, and the overall architecture design is very brief to explain the implementation logic of each module of datax.
[0062] In order to solve the synchronization problem of heterogeneous data sources, datax turns complex mesh synchronization links into star data links, such as image 3 As shown, datax is responsible for connecting various data sources as an intermediate transmission carrier. When you need to access a new data source, you only need to connect this data source to datax to achieve seamless data synchronization with the existing data source.
[0063] 1. After encapsulation and tuning, datax3.0 has already supported all strong data types, and each plug-in has its own data type conversion strategy, so that the data can be transmitted to the destination in a complete and lossless manner;
[0064] 2. During the running process of datax 3.0, the job status, data flow, data speed, execution progress and other information can be comprehensively displayed, so that users can understand the job status in real time. It can intelligently judge the speed comparison between the source and destination during job execution, and provide users with more performance troubleshooting information.
[0065] In the process of transmitting a large amount of data, many data transmission errors (such as type conversion errors) must be caused due to various reasons. This kind of data datax is considered to be dirty data. At present, datax can realize accurate filtering, identification, collection, and display of dirty data, and provide users with a variety of dirty data processing modes, allowing users to accurately control the data quality.
[0066] Thereby the following beneficial effects can be achieved:
[0067] 1. Perfectly solve the problem of distortion of individual types of data transmission: datax supports all strong data types through function optimization, and each plug-in has its own data type conversion strategy, so that data can be transmitted to the destination in a complete and lossless manner;
[0068] 2. Provide runtime monitoring of traffic and data volume of the entire job link: During the running process of datax, the monitoring module will comprehensively display the job status, data flow, data speed, execution progress and other information, so that users can understand the job status in real time. , and can intelligently judge the speed comparison between the source end and the destination end during the job execution process, and provide users with more performance troubleshooting information;
[0069] 3. Provide dirty data detection: During the transmission of a large amount of data, many data transmission errors (such as type conversion errors) must be caused due to various reasons. This kind of data datax is considered dirty data, and datax is implemented by configuring the data cleaning process. Dirty data is accurately filtered, identified, collected, and displayed, providing users with a variety of dirty data processing modes, allowing users to accurately control the quality of data;
[0070] 4. Rich data conversion functions: As an ETL tool serving big data, datax not only provides data snapshot relocation functions, but also provides rich data conversion functions, so that data can be easily desensitized during data transmission. Completion, filtering and other data conversion functions, and also provides automatic groovy functions, allowing users to customize conversion functions;
[0071] 5. Precise speed control: After optimization, datax provides three flow control modes, including channel (concurrency), record stream, and byte stream. You can control the speed of your work at will, so that your work can be achieved within the range that the library can bear. the best synchronization speed;
[0072] 6. Robust fault-tolerant mechanism. Datax jobs are easily disturbed by external factors. Factors such as network interruptions and unstable data sources can easily stop half-synchronized jobs reporting errors. Therefore, stability is the basic requirement of datax. In datax3 The design of .0 focuses on improving the stability of the framework and plug-ins. At present, datax 3.0 can achieve multi-level local/global retry at the thread level and job level to ensure the stable operation of users' jobs.