Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Data cleaning system and method for aiming at big data

A data cleaning and big data technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of inaccurate original data, difficult to solve big data, missing data fields, etc.

Active Publication Date: 2015-01-28
BEIJING INSTITUTE OF TECHNOLOGYGY
View PDF2 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using the traditional centralized processing method for structured data, it is difficult to solve the problems caused by big data. In view of these three characteristics, the integration and cleaning of big data becomes particularly important
Big data also includes uncertain data. At this stage, the causes of uncertain data are relatively diverse, mainly reflected in inaccurate original data, use of coarse-grained data sets, missing data fields, and data integration.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data cleaning system and method for aiming at big data
  • Data cleaning system and method for aiming at big data
  • Data cleaning system and method for aiming at big data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0045] The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0046] After the research on big data in recent years, the integration and analysis of big data is the key to realize the value transformation of big data, and the integration of big data is the basis of analysis. Only by correctly integrating different data and information can the value of big data be brought into play. Therefore, data integration technology under big data is a challenge that must be solved to realize the value transformation of big data. Among them, similarity joins is the key technology in data integration.

[0047] In the data set, some records may exist repeatedly in different forms, and to find these similar records requires the use of similar connection technology. The similar connection technology can detect duplicate records in the data set during data integration, and merge records that may be the sa...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a data cleaning system and method for aiming at big data. A system application layer comprises a data analysis and extraction module, a similar joins module, a similar subgraph gathering module, an entity sampling module and a probability and entity query module, a storage layer stores a structural data record, a similar data record pair and a similar communication subgraph generated in a data cleaning process by utilizing a distributed storage tool HDFS (Hadoop Distributed File System) provided by Hadoop, and the cleaned structural data record is stored by utilizing a distributed storage tool HBase provided by the Hadoop. The method comprises the following steps: obtaining data to be cleaned, carrying out similar joins, enabling the similar subgrpahs to be gathered, sampling an entity, and carrying out probability calculation and entry query. The invention is a data cleaning system for aiming at the big data and an uncertain data certainty method, solves a problem that traditional centralized similar joins can not adapt to large-scale data operation, fully utilizes graphs and relevant knowledge to creatively finish big data cleaning, and provides a data preparation for the analysis of mass data.

Description

technical field [0001] The invention belongs to the technical field of data mining, and in particular relates to a big data-oriented data cleaning system and method. Background technique [0002] In recent years, with the rapid development of information technology, the amount of data collected, stored, processed and analyzed is increasing. The processing of big data is becoming more and more popular. Different from the traditional data structure characteristics, big data has three characteristics, including massiveness, distribution, and heterogeneity. Its massiveness mainly refers to the huge scale of data and its growth rate continues to increase; its distribution is mainly reflected in the fact that the huge amount of data cannot be stored, calculated and analyzed on a single machine; its heterogeneity is mainly reflected in the differences in data types and data sources diversification. Using the traditional centralized processing method for structured data, it is dif...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/215
Inventor 王国仁信俊昌聂铁铮赵相国邓诗卓季航旭侯喆梁帅
Owner BEIJING INSTITUTE OF TECHNOLOGYGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products