Unlock instant, AI-driven research and patent intelligence for your innovation.

Data cleaning method based on data warehouse

A data cleaning and data warehouse technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as the complexity of the cleaning process, achieve the effect of improving cleaning efficiency, ensuring correctness, and reducing complexity

Inactive Publication Date: 2015-06-10
INSPUR GROUP CO LTD
View PDF3 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This can only be done for small batch data sources
[0007] 2) Through specially written programs, but usually data cleaning is an iterative process, which makes the cleaning process complicated

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data cleaning method based on data warehouse

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0021] The data cleaning method is implemented through five steps: preprocessing, assigning weights to attributes, duplicate record detection, database-level duplicate record clustering, and conflict handling;

[0022] Preprocessing: select the attribute for record matching, which can represent the characteristics of the record due to the large amount of data;

[0023] Assign weights to attributes: assign different weights to each attribute according to the importance of the attribute in determining the similarity of two records; When recording similarity, different attributes are given different weights, and those with greater importance are assigned higher weights. For example, the weight of the name attribute is obviously higher than that of the gender attribute, because the name can better reflect the characteristics of a record. During the cleaning process of duplicate records, the weight can be adjusted to find more duplicate records.

[0024] Database-level duplicate r...

Embodiment 2

[0027] Take the cleaning of the association between business tables as an example:

[0028] 1) Select the main table and establish the association between the main table and the auxiliary table.

[0029] 2) Use the sql statement to join the main table left outer to the auxiliary table, and find out the records in the main table that cannot be associated with the auxiliary table.

[0030] 3) Make a specific analysis of the records that cannot be associated. For real dirty data, you can add a "default record" in the auxiliary table, and associate all the records that cannot be associated in the main table with the "default record" in the auxiliary table.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a data cleaning method based on a data warehouse. The data cleaning method includes preprocessing, weight allocating for properties, duplicate record detection, duplicate record clustering of the database level and conflict handling. Preprocessing: choosing the property used for record matching, wherein the property can represent the record feature. Weight allocating for the property: allocating different weights for each property according to different importance in the similarity of two records decided by the property. Duplicate record clustering of the database level: applying the algorithm of the duplicate record clustering in a database to reduce the scope of comparing record and clustering the duplicate record of the whole data set; and conflict handling: combining or deleting the detected duplicate record of the same duplicate record clustering and reserving the correct record. The data cleaning method based on the data warehouse can detect and correct the errors of mass data sources, effectively reduces the complexity during cleaning, improves the cleaning efficiency, ensuring the quality of the data set, and improves the operating effect of the data warehouse.

Description

technical field [0001] The invention relates to the technical field of computer data processing, in particular to a data cleaning method based on a data warehouse. Background technique [0002] The rapid development of information technology has made organizational leaders increasingly dependent on data. Therefore, on the basis of the database, a data environment that can meet the needs of decision-making analysis—data warehouse is produced. However, there will be various problems in the data imported from heterogeneous data sources into the data warehouse, so data cleaning must be carried out to improve its quality. Data warehouse is a theme-oriented, integrated, relatively stable data collection that reflects historical changes. Data warehouse is a collection of multiple heterogeneous data sources, which are reorganized according to theme after integration. [0003] When extracting data from multiple data sources in the database, because the design of the data table stru...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/215
Inventor 焦毓葳孙海峰王传超
Owner INSPUR GROUP CO LTD