Data quality detection method and device of duplicated data

A technology of data quality and detection method, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as rapid detection, achieve the effects of saving time, simple formulas, and improving detection efficiency

Active Publication Date: 2016-04-13
GUANGDONG KINGPOINT DATA SCI & TECH CO LTD
View PDF5 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to provide a data quality detection method and device for repeated data to overcome the above-mentioned technical defects and solve the problem of how to accurately and quickly detect partial and complete repeated data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data quality detection method and device of duplicated data
  • Data quality detection method and device of duplicated data
  • Data quality detection method and device of duplicated data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0153] The data quality detection method for repeated data as described above, the difference of this embodiment is that, as Figure 9 As shown in the flow chart of Embodiment 1 of the data quality detection method of repeated data in the present invention; the data quality detection method also includes:

[0154] Step g, outputting the reserved record combination and the probability of repetition of the record combination, the step g is after the step f.

[0155] The output in this step can be in different forms, can be displayed in a visual form, and can also output detection results to facilitate the merging of records; it can output all of the retained record combinations and the probability of repeating the record combination, and can also output the retained Part of the record combination and the probability that the record combination repeats.

Embodiment 2

[0157] The data quality detection method for repeated data as described above, the difference of this embodiment is that, as Figure 10 As shown in the flow chart of Embodiment 2 of the data quality detection method for repeated data of the present invention; the step b also includes:

[0158] In step b1, the similarity is calculated for the values ​​in the same field of the training samples, and the similar values ​​whose similarity reaches or exceeds the threshold are taken as the same value, and the step b1 is before the step b2.

[0159] Here, an algorithm is used to calculate the similarity for some very similar values ​​in each field, and the data quality analyst defines a threshold to determine the level of similarity and treat these values ​​as the same value.

[0160] The algorithm for calculating the similarity is the Levenshtein algorithm, the longest common subsequence algorithm and other algorithms, and the specific algorithm can be selected according to actual ne...

Embodiment 3

[0163] The data quality detection method for repeated data as described above, the difference of this embodiment is that, as Figure 11 As shown in the flow chart of Embodiment 3 of the data quality detection method for repeated data of the present invention; the data quality detection method also includes:

[0164] Step a, extracting training samples from the data source to be detected; said step a is before said step b;

[0165] There are multiple records in the data source with detection, and each record has a corresponding number, which is the record number; the record numbers are arranged in sequence and incremented in turn; each record is divided into multiple fields: field 1, field 2, field 3 , field 4, ..., so that the same field has a value in each record, how many records there are, and how many values ​​each field has (the values ​​here are the same or different), and the value of the field The number of corresponds to the number of the record; here, the first valu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data quality detection method and device of duplicated data. The method comprises: step b, generating a model training set; step c, analyzing every combination pair in the model training set, marking as record duplication or record non-duplication; step d, calculating probability of record duplication and screening out field combinations with relatively high probability as sample field combinations; step e, analyzing the values of data to be detected; and step f, carrying out duplicated detection to the data, screening out record combinations in all duplicated fields satisfying the sample field combinations, wherein the device comprises a training set generating unit, a sample record duplication marking unit, a sample combination screening unit, a detection data analyzing unit and a detection data screening unit corresponding to every step. Through calculating the duplication probability of the field combinations, it is unnecessary to compare the duplication probability of any two records; therefore, the time is shortened; the detection efficiency is improved; meanwhile, the condition that parts of two data are the same can be detected.

Description

technical field [0001] The invention relates to the technical field of data quality monitoring, in particular to a data quality detection method and device for repeated data. Background technique [0002] With the rapid development of information technology, data has gradually become one of the most important resources to realize the business value of enterprises. However, as the amount of data continues to increase, data quality issues also follow. Missing data, errors, inconsistencies and other problems hinder the application of enterprises, and even cause enterprises to make wrong decisions, lose important value and cause a crisis of trust. [0003] For these dirty data, many data quality detection and cleaning schemes have emerged as the times require. Duplicate data is one of the more difficult data quality problems to detect. Because the data duplication problem faced by enterprises today is not only complete duplication of data, but also partial duplication. For e...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/215G06F16/2462
Inventor 许飞月李青海简宋全侯大勇邹立斌
Owner GUANGDONG KINGPOINT DATA SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products