Python script-based distributed big data cleaning method

A data cleaning and distributed technology, applied in the field of data cleaning, can solve problems such as few cleaning rules, insufficient cleaning computing power, general cleaning effect, etc., to achieve the effect of improving accuracy and solving insufficient cleaning capabilities

Active Publication Date: 2020-12-22
云基华海信息技术股份有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004]The technical problem to be solved by the present invention is: the existing distributed big data cleaning method has limited cleaning ability and cannot clean a large amount of data, and now Some cleaning methods basically use SQL cleaning rules, and there are fewer cleaning rules, resulting in a general cleaning effect. In addition, the traditional method of cleaning data has insufficient cleaning computing power.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Python script-based distributed big data cleaning method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The embodiments of the present invention are described in detail below. This embodiment is implemented on the premise of the technical solution of the present invention, and detailed implementation methods and specific operating procedures are provided, but the protection scope of the present invention is not limited to the following implementation example.

[0029] Such as figure 1 As shown, the present embodiment provides a technical solution: a python script-based distributed big data cleaning method, the method comprising the following steps:

[0030] Step 1: First load the data to be cleaned, and then perform sharding operation on the loaded data to be cleaned;

[0031] Step 2: Distributed scheduling and execution of the data to be cleaned;

[0032] Step 3: Request the data to be cleaned and backfill the cleaning results;

[0033] Wherein, Step 1 is specifically divided into the following steps:

[0034] S1: Data loading, first load the data that needs to be cl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a python script-based distributed big data cleaning method, which comprises the following steps: firstly, loading to-be-cleaned data, performing fragmentation operation on theloaded to-be-cleaned data, performing distributed scheduling and execution operation on the to-be-cleaned data, and then performing distributed scheduling and execution operation on the to-be-cleaneddata, requesting to-be-cleaned data and backfilling a cleaning result, wherein the step 1 specifically comprises the following sub-steps: data loading: firstly, loading data required to be cleaned from an HBase column storage database, formulating a cleaning strategy and setting a data cleaning strategy; based on the big data technology, conducting data cleaning based on an HBase column storage database, the problem of mass data cleaning is solved, a python engine and scripts are adopted for data cleaning, the problem that traditional SQL cleaning rules are few and the problem of jar package cleaning static coding are solved, a distributed computing engine based on Spark is adopted, the scripts are executed in parallel, and the problem of insufficient big data cleaning computing power is solved.

Description

technical field [0001] The invention relates to the field of data cleaning, in particular to a python script-based distributed big data cleaning method. Background technique [0002] Data cleaning refers to the last procedure for discovering and correcting identifiable errors in data files, including checking data consistency, dealing with invalid and missing values, etc. Unlike questionnaire review, data cleaning after entry is generally done by computer rather than manual Finish. [0003] Existing distributed big data cleaning methods have limited cleaning capabilities and cannot clean a large amount of data, and the existing cleaning methods basically use SQL cleaning rules, which have fewer cleaning rules, resulting in relatively general cleaning effects. The traditional method of cleaning data has insufficient cleaning computing power. Therefore, how to create a distributed big data cleaning method based on python scripts has become an urgent problem to be solved. Co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/215G06F16/27G06F9/48
CPCG06F16/215G06F16/27G06F9/4881Y02D10/00
Inventor 鲁红军
Owner 云基华海信息技术股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products