An automatic data cleaning method based on DeepDive

An automatic cleaning and data technology, applied in the field of data processing

Inactive Publication Date: 2019-06-28
SOUTHWEST UNIVERSITY FOR NATIONALITIES
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the problem of automatic data cleaning in the absence of existing patterns / rules and manual participation, the present invention proposes an automatic data cleaning method based on DeepDive, thereby solving the aforementioned problems in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An automatic data cleaning method based on DeepDive
  • An automatic data cleaning method based on DeepDive
  • An automatic data cleaning method based on DeepDive

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0018] Now in conjunction with embodiment, accompanying drawing, the present invention will be further described:

[0019] The present invention proposes an automatic data cleaning method based on DeepDive, the automatic data cleaning flow chart is shown in figure 1 , the technical solutions adopted to solve its technical problems include the following:

[0020] 1. Data preprocessing

[0021] Set the threshold for the size of the data to be cleaned Calculate the size of the data to be cleaned, that is, the number of tuples included, if Then randomly sample the original data to get the sampled data, otherwise keep the original data.

[0022] 2. Data model learning

[0023] In the absence of ready-made data cleaning modes / rules, data model learning is carried out from the data obtained after data preprocessing, to find out the implicit non-absolute or relatively weak dependencies in the data, and use the Bayesian network Form representation. In the learning phase of the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an automatic data cleaning method based on DeepDive, and the method comprises the steps: (1) comparing an original data scale with a set threshold, and carrying out the randomsampling when the scale exceeds the threshold, so as to obtain sampling data; (2) learning from the original data or the sampling data to obtain a Bayesian network between attributes; (3) converting the Bayesian network obtained by learning into a first-order predicate logic rule; (4) calculating the weight of a first-order predicate logic rule by using a mutual information theory, and convertingthe first-order predicate logic rule with the weight into a Markov logic network; (5) generating a DeepDive rule based on the Markov logic network; (6) carrying out probability reasoning of error/missing data based on DeepDive, and obtaining the probability of taking different values by the attributes of the tuples; And (7) using the reasoning result for cleaning the original dirty data. The method can be used for automatically cleaning the data without existing data quality modes/rules and manual intervention, and the data cleaning efficiency and quality can be effectively improved.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to an automatic data cleaning method based on DeepDive. Background technique [0002] Dirty data is very common in the real world, and cleaning of dirty data is a long-standing problem. In the era of big data, the importance of data cleaning is even more prominent. Detecting errors from dirty data and repairing them is one of the main challenges in the field of data analysis. Poor data quality may lead to greatly reduced accuracy of analysis results. Generally speaking, data cleaning consists of two stages: the first stage is error detection, which detects errors or abnormal data contained in it; the second stage is error repair, which repairs errors or abnormalities contained in the data. Most of the existing data cleaning methods use the existing constraint rules or patterns for error detection and repair, which mainly have the following limitations: (1) Existing methods...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/215
Inventor 李卫榜李玲谈文蓉崔梦天
Owner SOUTHWEST UNIVERSITY FOR NATIONALITIES
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products