Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A data cleaning system and method for big data

A data cleaning and data technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as inaccurate original data, missing data fields, and difficult to solve big data

Active Publication Date: 2017-07-18
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF2 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using the traditional centralized processing method for structured data, it is difficult to solve the problems caused by big data. In view of these three characteristics, the integration and cleaning of big data becomes particularly important
Big data also includes uncertain data. At this stage, the causes of uncertain data are relatively diverse, mainly reflected in inaccurate original data, use of coarse-grained data sets, missing data fields, and data integration.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A data cleaning system and method for big data
  • A data cleaning system and method for big data
  • A data cleaning system and method for big data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0046] After the research on big data in recent years, the integration and analysis of big data is the key to realize the value transformation of big data, and the integration of big data is the basis of analysis. Only by correctly integrating different data and information can the value of big data be brought into play. Therefore, data integration technology under big data is a challenge that must be solved to realize the value transformation of big data. Among them, similarity joins is the key technology in data integration.

[0047] In the data set, some records may exist repeatedly in different forms, and to find these similar records requires the use of similar connection technology. The similar connection technology can detect duplicate records in the data set during data integration, and merge records that may be the sa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A data cleaning system and method for big data. The application layer of the system includes a data analysis and extraction module, a similar connection module, a similar subgraph aggregation module, an entity sampling module, a probability calculation and an entity query module. The storage layer uses the distribution provided by Hadoop. The storage tool HDFS stores structured data records, similar data record pairs, and similar connected subgraphs generated during the data cleaning process. The distributed storage tool HBase provided by Hadoop is used to store the cleaned structured data records. The method includes obtaining data to be cleaned; similar connection; similar subgraph aggregation; entity sampling; probability calculation and entity query. The invention is a data cleaning system and uncertainty data determination method for big data, which solves the problem that the previous centralized similarity connection cannot adapt to large-scale data operations, and makes full use of graphs and related knowledge to creatively complete big data. Cleans and provides data preparation for massive data analysis.

Description

technical field [0001] The invention belongs to the technical field of data mining, and in particular relates to a big data-oriented data cleaning system and method. Background technique [0002] In recent years, with the rapid development of information technology, the amount of data collected, stored, processed and analyzed is increasing. The processing of big data is becoming more and more popular. Different from the traditional data structure characteristics, big data has three characteristics, including massiveness, distribution, and heterogeneity. Its massiveness mainly refers to the huge scale of data and its growth rate continues to increase; its distribution is mainly reflected in the fact that the huge amount of data cannot be stored, calculated and analyzed on a single machine; its heterogeneity is mainly reflected in the differences in data types and data sources diversification. Using the traditional centralized processing method for structured data, it is dif...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/215
Inventor 王国仁信俊昌聂铁铮赵相国邓诗卓季航旭侯喆梁帅
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products