Kettle-based method for extraction and statistics of data on large data platform based on kettle

A big data platform and data extraction technology, applied in database models, relational databases, electrical digital data processing, etc., can solve problems such as big data cluster network resource consumption

Inactive Publication Date: 2017-02-22
ZHENGZHOU YUNHAI INFORMATION TECH CO LTD
View PDF6 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] Currently, through the mapreduce task, all table data is scanned every day to count the data volume. In the case of tens of billions of data, it takes 4-5 hours a day to count the data. During this period, large data cluster computing and network resource consumption are serious.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0030] A kettle-based big data platform data extraction and statistics method, the method transforms the source code of the kettle to obtain the situation of each data extraction task, and records it in an hbase table, the hbase table is called historical situation Table, which records the data volume of all data tables;

[0031] The relational database is regularly incrementally extracted through the sqoop task every day. The amount of data extracted at one time is the data increment of one day. The daily data increment of each data table is recorded and written into the hbase table to realize the data volume. The situation is queried according to the combination of table and time.

[0032] Traditional relational databases supporting online systems and big data technology processing offline statistical analysis will coexist for a long time. In these two systems, the kettle acts as a bridge and is responsible for data transmission. Through the transformation of the source co...

Embodiment 2

[0034] On the basis of Example 1, the method described in this example records the daily data increment of each data table into a data history table of hbase, and performs rowkey (row primary key) on this history table design:

[0035] Serial number rowkey rowkey example qualifier

[0036] 1 {table name} person_info data volume

[0037] 2 {table name} spacer {time} person_info@20150604 data volume

[0038] 3 {time} spacer {table name} 20150604@person_info data volume

[0039] Among them, the table name in rowkey is the table name of the data table, not the table name of the historical situation table;

[0040] In the qualifier of the rowkey in 1, the data amount indicates the total amount of data in the data table recorded in the rowkey, so that the data amount of a certain data table can be quickly queried;

[0041] In 2, the rowkey is composed of the name of the data table and the time. The spacer distinguishes the name of the table from the time. The amount of data in t...

Embodiment 3

[0045] On the basis of Embodiment 2, after obtaining the sqoop task information, this embodiment records the data volume of this task in the historical situation table, and the three rowkeys in the historical situation table must be written or updated.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a kettle-based method for extraction and statistics of data on a large data platform. The method comprises the steps that source codes of kettle are transformed, so that situations of each data extraction task can be obtained and recorded in a hbase table, wherein the hbase table is called as a historical situation table and records data size situations of all data tables; through a sqoop task, timed incremental extraction is conducted to relation-type databases every day, the data size extracted for one time is a data increment of a day; and the everyday data increments in each data table are recoded and written into the hbase table, so that the data size situations can be inquired in a combined manner according to the tables and time. According to the invention, the increment situations are recorded during the data extraction, so additional time does not need to be spent, and nearly no computing and network resources need to be consumed.

Description

technical field [0001] The invention relates to the technical field of computer software applications, in particular to a kettle-based big data platform data extraction and statistical method. Background technique [0002] With the continuous development of cloud computing technology, cloud computing technology has become an important pillar supporting the development of information technology in various industries. Distributed clusters based on hadoop and hbase have become popular research objects of cloud computing at home and abroad. Hadoop's HDFS distributed storage provides a distributed file storage system for the cloud platform, and hbase has good read and write performance and can support tables with large amounts of data, so it is suitable for simple business and online databases with huge data volumes and data storehouse. [0003] Since hbase itself is not suitable as a business database, the business database is often served by mature relational databases such a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/284G06F16/2282G06F16/254
Inventor 臧勇真魏金雷
Owner ZHENGZHOU YUNHAI INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products