Data cleaning method and device

A data cleaning and dirty data technology, applied in the information field, can solve problems such as poor cleaning effect and inability to clean dirty data, so as to improve the effect and reduce the probability of misidentifying as dirty data

Active Publication Date: 2022-02-25
ALIBABA GRP HLDG LTD
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The data cleaning in the prior art is to traverse all the cleaning rules for all the data after the data is output. The cleaning rules are common among all businesses, and mainly focus on whether the data is incomplete, whether the data format is wrong, etc. Cleaning, obviously, this method can only clean out the more obvious dirty data in the data. When the dirty data has incorrect values, etc., the dirty data cannot be cleaned out, so the clean data obtained after cleaning There is still dirty data, and the cleaning effect is poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data cleaning method and device
  • Data cleaning method and device
  • Data cleaning method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0023] figure 1 A schematic flow diagram of a data cleaning method provided in Embodiment 1 of the present invention, as shown in figure 1 shown, including:

[0024] Step 101. Match cleaning rules according to the data characteristics of the target data.

[0025] Among them, data features are used to describe the target data.

[0026] Specifically, the data-related information may be obtained from the requesting end that requests to clean the target data. For example: data-related information such as the original business that generates the target data, the target business that the target data needs to use, the original computing task that generates the target data in the original business, and / or the target computing task that the target data needs to use in the target business.

[0027] The original business that generates the target data, the target business that the target data needs to be used for, the original computing task that generates the target data in the origi...

Embodiment 2

[0036] figure 2 A schematic flow diagram of a data cleaning method provided in Embodiment 2 of the present invention, such as figure 2 shown, including:

[0037] Step 201, configure cleaning rules.

[0038] Specifically, the cleaning rules can be configured in advance, and the configuration process can be manually completed by the user, or can be automatically generated by the data cleaning platform according to the existing cleaning rules.

[0039] As a possible implementation form, the cleaning rule includes three levels: first-level cleaning sub-rules, second-level cleaning sub-rules, and third-level cleaning sub-rules. The three levels are described below:

[0040] A. The first-level cleaning sub-rules are composed of common rules for each business, and are mainly used to identify incomplete, repetitive, and obviously wrong dirty data.

[0041] For example, the first-level cleaning sub-rules may include: a field in the data cannot be empty, the data has been complete...

Embodiment 3

[0063] image 3 A schematic structural diagram of a data cleaning device provided in Embodiment 3 of the present invention, as shown in image 3 As shown, it includes: a matching module 31 and a cleaning module 32 .

[0064] The matching module 31 is configured to match the cleaning rules according to the data characteristics of the target data.

[0065] The cleaning module 32 is configured to clean the target data by using the cleaning rules in the matching.

[0066] In this embodiment, after matching the cleaning rules according to the data characteristics of the target data, the target data is cleaned using the matching cleaning rules, thereby ensuring that the cleaning rules match the data characteristics, and the target data can be more targeted Perform cleaning to effectively clean out more dirty data and improve the cleaning effect.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a data cleaning method and device. By setting multiple cleaning rules in advance according to different data characteristics, when the target data needs to be cleaned, the cleaning rules are matched according to the data characteristics of the target data, and then the cleaning rules in the matching are used The target data is cleaned, so as to ensure that the cleaning rules are compatible with the data characteristics, and the target data can be cleaned more targetedly, effectively cleaning more dirty data, and reducing the misidentification of clean data as dirty. Data probabilities improve the cleaning effect.

Description

technical field [0001] The invention relates to information technology, in particular to a data cleaning method and device. Background technique [0002] Data cleaning is the process of re-examining and verifying data after data output, with the purpose of identifying dirty data. Because the data in the data warehouse is extracted from multiple business systems, and contains various types of historical data and forecast data, it is unavoidable that some data is wrong data, and some data conflicts with each other. Wrong or conflicting data is obviously undesirable in the next link, and can be called dirty data. Data cleaning is to identify these dirty data according to certain cleaning rules. [0003] The data cleaning in the prior art is to traverse all the cleaning rules for all the data after the data is output. The cleaning rules are common among all businesses, and mainly focus on whether the data is incomplete, whether the data format is wrong, etc. Cleaning, obvious...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/215
CPCG06F16/215G06F16/00
Inventor 马艳娟
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products