Data cleaning method and device based on distributed platform

A distributed platform and data cleaning technology, applied in the field of big data processing, can solve problems such as slow performance, low processing efficiency, and inability to achieve real-time display, so as to meet business needs, improve data quality, improve data cleaning performance and processing efficiency Effect

Active Publication Date: 2018-05-18
SHENZHEN ZHONGYI TECH CO LTD
View PDF5 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, data cleaning in existing technologies is often unable to adapt to large-scale data operations, with slow performance and low processing efficiency, and cannot achieve real-time display

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data cleaning method and device based on distributed platform
  • Data cleaning method and device based on distributed platform
  • Data cleaning method and device based on distributed platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0057] Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

[0058]The entire data processing flow of the present invention is composed of STORM applications, each step is an application, the connection point is through KAFKA, and the storage method is MONGODB, refer to figure 1 shown. Among them, STORM is a distributed, fault-tolerant real-time computing system; KAFKA is a high-throughput distributed publish-subscribe message system, which can handle all action flow data in consumer-scale w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data cleaning method and device based on a distributed platform. The method is applied to a distributed high-efficiency real-time processing system and used for data cleaningin big data magnitude, the performance and magnitude purposes are achieved by means of distributed processing, and the demands of rapid processing and real-time responses are met; the service demandsare met and the cleaning purpose is achieved through continuous iterative optimization, the process is a process of alternately conducting data exploration and rule optimization, the data quality iscontinuously improved, the problem that existing centralized processing cannot adapt to large-scale data operation is solved, the features of big data are fully utilized to complete big data cleaning,and data preparation is provided for massive data analysis; the service demands is met optimally, and the data cleaning performance and the data processing efficiency can be improved.

Description

technical field [0001] The invention relates to the field of big data processing, in particular to a data cleaning method and device based on a distributed platform. Background technique [0002] Data cleaning – the process of re-examining and validating data to remove duplicate information, correct existing errors, and provide data consistency. The data cleaning development of the distributed platform uses a series of distributed architectures such as STORM, ZOOKEEPER, KAFKA and MONGODB to form a data cleaning system. STORM can perform distributed real-time computing and processing. KAFKA is a distributed message system. Guaranteed access performance and high throughput at normal times, supports message partitioning, and distributed consumption, and also supports offline data processing and real-time data processing. MONGODB is an open source database system based on distributed file storage, providing scalable high-performance data storage [0003] However, data cleaning...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/215G06F16/27
Inventor 陈建江
Owner SHENZHEN ZHONGYI TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products