Unlock instant, AI-driven research and patent intelligence for your innovation.

Hadoop data cleaning method and system based on outlier mining

A data cleaning and isolated point technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as few solutions and different error data cleaning solutions, achieve accurate data cleaning, ensure mass data cleaning, The effect of improving the efficiency of data cleaning

Inactive Publication Date: 2015-12-09
UESTC COMSYS INFORMATION
View PDF2 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

As for the cleaning of wrong data, due to the different definitions of wrong data, there will be different wrong data cleaning solutions, which ha

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Hadoop data cleaning method and system based on outlier mining

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] The technical solution of the present invention will be further described below in conjunction with the accompanying drawings.

[0025] Such as figure 1 As shown, a Hadoop data cleaning method based on outlier mining includes the following steps:

[0026] S1. Load data from various heterogeneous data sources into the Hadoop distributed file system;

[0027] S2. Preprocessing the data of the Hadoop distributed file system: pull the data to be cleaned in the Hadoop distributed file system, and dig out the isolated points with abnormal attributes in the data to be cleaned, and record the number of isolated points as N;

[0028] S3. Judging whether the isolated points obtained in S2 meet the cleaning rules, and cleaning the isolated points satisfying the cleaning rules, specifically including the following three situations:

[0029] S31. If all the N isolated points meet the cleaning rules, perform data cleaning on all the N isolated points according to the cleaning rules...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Hadoop data cleaning method and system based on outlier mining. The method comprises the following steps that 1, data of various heterogeneous data sources are loaded to a Hadoop distributed file system; 2, the data of the Hadoop distributed file system are preprocessed, wherein the data to be cleaned in the Hadoop distributed file system are pulled, and outliers with abnormal attributes in the data to be cleaned are mined out, and the number of the outliers is recorded as N; 3, whether the outliers obtained in the step 2 meet a cleaning rule or not is judged, and the outliers meeting the cleaning rule are cleaned; 4, the data cleaned in the step 3 are output. According to the Hadoop data cleaning method and system based on outlier mining, unreasonable outliers are found through outlier mining, corresponding data cleaning actions are carried out, the outlier data can be accurately cleaned, repeated cleaning is reduced, data cleaning efficiency is improved, and therefore the purpose of cleaning mass data is ensured.

Description

technical field [0001] The invention belongs to the field of computer information analysis and data processing, in particular to a Hadoop data cleaning method and system based on outlier mining. Background technique [0002] With the wide application and development of database technology, a data environment that can meet the needs of decision-making analysis—data warehouse has been created on the basis of the database to meet the needs of the organization's managers for decision-making analysis. We need to import a large amount of data from various heterogeneous data sources for the process of constructing a data warehouse. These data have data quality problems such as omissions, entry errors and incompleteness. Wrong data will make the operation cost more expensive, and the corresponding operation time will be more time-consuming. The correctness of the model extracted from the data set and the accuracy of the derived rules will also be greatly affected, which will mislea...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/182G06F16/215
Inventor 唐雪飞陈科吴亚骏陈安龙江莹刘明鸣胡略杨桥
Owner UESTC COMSYS INFORMATION