Unlock instant, AI-driven research and patent intelligence for your innovation.

A Hybrid Data Cleaning Method Based on Multiple Data Versions

A mixed data, multi-data technology, applied in the direction of electrical digital data processing, digital data information retrieval, special data processing applications, etc., can solve the problems of long running time, inapplicability, dependence, etc., to reduce the detection range and speed up the running time , High cleaning efficiency

Active Publication Date: 2022-04-12
ZHEJIANG UNIV
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The current mainstream methods can be roughly divided into two categories: qualitative methods and quantitative methods: (1) The qualitative method is mainly to clean the wrong data that violates the integrity constraint rules. The shortcoming of the minimization of changes is that it cannot clean the wrong data that does not meet the minimum cost principle, although it still violates the integrity constraints; (2) the quantitative method is to build a suitable model based on the data probability distribution to determine the cleaning strategy, and its shortcoming is that Class methods strongly depend on the training set, and need to provide sufficient and clean known data as the training set to build a reliable model, which is no longer applicable to the current big data environment. Most of the current quantitative methods clean the data obtained Performance is worse than qualitative methods, and existing methods take longer to run

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Hybrid Data Cleaning Method Based on Multiple Data Versions
  • A Hybrid Data Cleaning Method Based on Multiple Data Versions
  • A Hybrid Data Cleaning Method Based on Multiple Data Versions

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] Now in conjunction with accompanying drawing and concrete implementation technical scheme of the present invention is described further:

[0040] Such as figure 1 Shown, the specific implementation process and working principle of the present invention are as follows:

[0041] Step (1): Input the integrity constraint (IC) in the framework and the data set with dirty data into the framework; the dirty data set and integrity constraint are described in Table 1 below:

[0042] Table 1 shows a hospital information dataset record, which contains 4 attributes, namely hospital name (HN), city (CT), state (ST), contact information (PN), and gray shading marks in Table 1 for wrong data. Given three integrity constraints:

[0043]

[0044]

[0045]

[0046] where D represents the data set, t 1 ,t 2 Represents two different tuples, the functional dependency (Functional Dependency, referred to as FD) rule r1 Indicates that a city can only belong to one state, the deni...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a mixed data cleaning method based on multiple data versions. The present invention utilizes the Markov logic network probability graph model and the principle of minimization repair, combines qualitative technology and quantitative technology into the present invention, designs an efficient data cleaning method, detects and corrects wrong structured data, and guarantees cleaning results It can not only clean the dirty data that violates the rule constraints, but also meet the minimum cost of changing the data set, and make it conform to statistical characteristics. The present invention firstly divides the entire data set into blocks and groups according to the Markov logic index technology, and then performs two-stage data cleaning. In the first stage, by introducing the evaluation standard of reliability score, the data in each group is cleaned to obtain multi-version data cleaning results; The results are fused to generate a final unified cleaning result.

Description

technical field [0001] The invention relates to a cleaning technology for wrong data in the field of computer databases, in particular to a mixed data cleaning method based on multiple data versions. Background technique [0002] The purpose of data cleaning is to find the content of the data set that is most likely to be wrong data, and to provide a reliable method to correct the wrong data. Dirty data is data with errors in the dataset. [0003] Nowadays, with the continuous emergence of new information publishing methods represented by social networks and e-commerce, as well as the rise of cloud computing and Internet of Things computer technology, data is growing and accumulating at an unprecedented rate. In data analysis, The presence of dirty data not only leads to bad decisions and unreliable analysis, but also takes a hit financially for the company. Therefore, there has been a strong interest in data cleaning both in industry and academia. Data cleaning is the pr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/215
Inventor 高云君陈刚陈纯葛丛丛
Owner ZHEJIANG UNIV