Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Data cleaning method and device

A data cleaning and target data technology, applied in the information field, can solve problems such as inability to clean dirty data, poor cleaning effect, etc., achieve the effect of improving the effect and reducing the probability of misidentifying as dirty data

Active Publication Date: 2017-10-03
ALIBABA GRP HLDG LTD
View PDF8 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The data cleaning in the prior art is to traverse all the cleaning rules for all the data after the data is output. The cleaning rules are common among all businesses, and mainly focus on whether the data is incomplete, whether the data format is wrong, etc. Cleaning, obviously, this method can only clean out the more obvious dirty data in the data. When the dirty data has incorrect values, etc., the dirty data cannot be cleaned out, so the clean data obtained after cleaning There is still dirty data, and the cleaning effect is poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data cleaning method and device
  • Data cleaning method and device
  • Data cleaning method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0023] figure 1 A schematic flow diagram of a data cleaning method provided in Embodiment 1 of the present invention, as shown in figure 1 shown, including:

[0024] Step 101. Match cleaning rules according to the data characteristics of the target data.

[0025] Among them, data features are used to describe the target data.

[0026] Specifically, the data-related information may be obtained from the requesting end that requests to clean the target data. For example: data-related information such as the original business that generates the target data, the target business that the target data needs to use, the original computing task that generates the target data in the original business, and / or the target computing task that the target data needs to use in the target business.

[0027] The original business that generates the target data, the target business that the target data needs to be used for, the original computing task that generates the target data in the origi...

Embodiment 2

[0036] figure 2 A schematic flow diagram of a data cleaning method provided in Embodiment 2 of the present invention, such as figure 2 shown, including:

[0037] Step 201, configure cleaning rules.

[0038] Specifically, the cleaning rules can be configured in advance, and the configuration process can be manually completed by the user, or can be automatically generated by the data cleaning platform according to the existing cleaning rules.

[0039] As a possible implementation form, the cleaning rule includes three levels: first-level cleaning sub-rules, second-level cleaning sub-rules, and third-level cleaning sub-rules. The three levels are described below:

[0040] A. The first-level cleaning sub-rules are composed of common rules for each business, and are mainly used to identify incomplete, repetitive, and obviously wrong dirty data.

[0041] For example, the first-level cleaning sub-rules may include: a field in the data cannot be empty, the data has been complete...

Embodiment 3

[0063] image 3 A schematic structural diagram of a data cleaning device provided in Embodiment 3 of the present invention, as shown in image 3 As shown, it includes: a matching module 31 and a cleaning module 32 .

[0064] The matching module 31 is configured to match the cleaning rules according to the data characteristics of the target data.

[0065] The cleaning module 32 is configured to clean the target data by using the cleaning rules in the matching.

[0066] In this embodiment, after matching the cleaning rules according to the data characteristics of the target data, the target data is cleaned using the matching cleaning rules, thereby ensuring that the cleaning rules match the data characteristics, and the target data can be more targeted Perform cleaning to effectively clean out more dirty data and improve the cleaning effect.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a data cleaning method and device. According to the method, multiple cleaning rules are set according to different data features in advance; and when target data needs to be cleaned, cleaning rule matching is performed according to the data features of the target data, and then the matching cleaning rules are utilized to clean the target data. Therefore, it is guaranteed that the cleaning rules adapt to the data features, the target data can be cleaned in a more targeted mode, more dirty data can be effectively cleaned, meanwhile, the probability of recognizing clean data as dirty data by mistake is lowered, and the cleaning effect is improved.

Description

technical field [0001] The invention relates to information technology, in particular to a data cleaning method and device. Background technique [0002] Data cleaning is the process of re-examining and verifying data after data output, with the purpose of identifying dirty data. Because the data in the data warehouse is extracted from multiple business systems, and contains various types of historical data and forecast data, it is unavoidable that some data is wrong data, and some data conflicts with each other. Wrong or conflicting data is obviously undesirable in the next link, and can be called dirty data. Data cleaning is to identify these dirty data according to certain cleaning rules. [0003] The data cleaning in the prior art is to traverse all the cleaning rules for all the data after the data is output. The cleaning rules are common among all businesses, and mainly focus on whether the data is incomplete, whether the data format is wrong, etc. Cleaning, obvious...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/215G06F16/00
Inventor 马艳娟
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products