Mass data quality verification method based on Hadoop

A data quality and verification method technology, applied in the field of big data, can solve problems such as increased learning costs, slow inspection cycle, and increased human and financial cost investment, so as to reduce human cost input, occupy less resources, and facilitate compatible development effect of using

Pending Publication Date: 2020-07-10
SHANGHAI DATATOM INFORMATION TECH CO LTD
View PDF5 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Data quality in data governance is particularly important in data warehouse-based data center construction; if non-compliant data is not standardized and screened; not only will it face the problem of data storage, but it will also cause a lot of valuable information to become very Difficult to obtain, a large amount of invalid data pollution, increase the investment of unnecessary human and financial resources
[0004] In the traditional data quality process, many people ignore the importance of data quality, resulting in a lot of normal data being polluted by abnormal data, usually passively discovered by downstream users or application teams, and then telling the big data data analysis team to go Find the cause of abnormal data, and then go upstream to find the root cause
This will lead to slow investigation cycle, complex process, time-consuming and laborious, only specialized personnel can understand, increased learning costs, data accumulation and many other problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass data quality verification method based on Hadoop
  • Mass data quality verification method based on Hadoop
  • Mass data quality verification method based on Hadoop

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] The present invention will be further described below in conjunction with the drawings.

[0036] The Hadoop-based mass data quality verification method of the present invention includes the following steps:

[0037] Step 1. Develop data quality standards, including:

[0038] Regular rules, namely: rules formulated in the form of custom regular expressions.

[0039] Verification rules, namely: email number verification, mobile phone number verification, license plate number verification, etc.

[0040] Judging rules, namely: judging the content length, whether it is empty, and the data range.

[0041] Content format rules, such as whether to include certain specific content.

[0042] Algorithm rules in specific scenarios, such as: credit card generation rules, ID card numbers need to meet the first six digits representing administrative divisions, seven to fourteen digits representing the date of birth, the seventeenth digit representing gender, and the last digit meeting the verific...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a mass data quality verification method based on Hadoop. The method comprises the following steps: formulating a data quality standard; for the DDL instruction, writing metadata information of the creation table into Hive; for the DQL statement, converting the SQL character string into an abstract syntax tree, performing syntax analysis on the abstract syntax tree, analyzing whether latest generated SQL semantics are wrong or not according to a data quality standard, and adding extension information; compiling the abstract syntax tree to generate a corresponding logic execution plan, optimizing the logic execution plan, converting the optimized logic execution plan into a physical plan, generating a MapReduce job, submitting the MapReduce job to the Yarn for execution, and finally returning an execution result; storing a returned execution result into the HDFS, and carrying out data visualization and abnormal data exporting, tracking and tracing. Therefore, thedata quality verification effects of abnormal data display, traceability, easy configuration and easy classification are achieved.

Description

Technical field [0001] The invention relates to the technical field of big data, and in particular to a method for checking data quality. Background technique [0002] With the current rapid development of information technology and Internet technology, the amount of data is growing in a blowout pattern, and the types of data are gradually increasing, and the complexity is getting higher and higher. Modern society has entered the era of big data. In this context, in order to give full play to the application value of big data, data quality management must be strengthened, and the security, accuracy, and stability of data transmission and use must be improved. [0003] In the past decades of development, large-scale relational databases such as Oracle have been the mainstay. In recent years, various open source databases have emerged in an endless stream, such as relational databases such as MySQL and PGSQL, and many semi-structured databases, such as ElasticSearch. , Mongodb, etc....

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/215G06F16/2458G06F16/2453G06F16/182G06F16/242
CPCG06F16/215G06F16/2471G06F16/2453G06F16/182G06F16/2433
Inventor 李青枝谢赟吴新野黄海清陈大伟
Owner SHANGHAI DATATOM INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products