A preprocessing method of multi-source heterogeneous big data

A multi-source heterogeneous and heterogeneous data source technology, applied in the field of big data processing, can solve the problems of insufficient research on semi-structured and unstructured data preprocessing, and can not meet user needs well, so as to reduce storage capacity. Resources and network bandwidth, strong practicability, and the effect of improving data storage efficiency

Inactive Publication Date: 2019-01-08
SOUTH CHINA UNIV OF TECH
View PDF12 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The existing big data preprocessing methods have the following problems: mainly for structured data, insufficient research on semi-structured and unstructured data preprocessing, and usually only include two modules of data acquisition and data cleaning, and data cleaning The method is relatively simple and cannot meet the needs of users well.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A preprocessing method of multi-source heterogeneous big data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] The present invention will be further described below in conjunction with specific examples.

[0036] Such as figure 1 As shown, the preprocessing method for multi-source heterogeneous big data provided in this embodiment includes the following steps:

[0037] Step 1: Storage of heterogeneous data. Extract data from multiple heterogeneous data sources and upload them to the distributed file system HDFS for storage. The invention provides support for multiple data source formats, including: Txt, Csv, Xsl, database data, jpg, mp4, etc., and provides interface standards to expand new data sources.

[0038] For text files, such as Txt and Csv, by designing a text storage function, the text data is read from the text file and stored in the distributed file system HDFS.

[0039] For the Xsl file, by designing the Xsl storage function, the excel data is read from the Excel file and stored in the distributed file system HDFS.

[0040] For database data, such as MySQL, Oracl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a preprocessing method of multi-source heterogeneous big data, comprising the following steps: 1, heterogeneous data storage: extracting data from heterogeneous data sources according to preset conditions and uploading the data to a distributed file system HDFS for storage; 2, data cleaning: loading the data in the distributed file system HDFS into the memory through a Spark frame, removing the duplicate data and noise data, and transforming the format; 3, entity identification: identifying different representations of the same entity for the cleaned data, correctly identifying all the different entities, and merging the data of the same entity; 4, removal of redundancy: using hash-based duplicate data removal technology to remove redundant data. The method of the invention can reduce the storage resource and the network bandwidth, improve the data storage efficiency, and improve the quality of the subsequent data analysis work.

Description

technical field [0001] The invention relates to the technical field of big data processing, in particular to a preprocessing method for multi-source heterogeneous big data. Background technique [0002] Big data is often generated from a large number of sources, often including images, videos, audios, data streams, texts, web pages, and other different data formats. These data are high-dimensional, massive, and complex, which exacerbates the difficulty and complexity of data analysis, information extraction, and knowledge representation. In addition, in the process of data collection and uploading, it is easy to generate problematic data, that is, data that does not meet the data quality requirements, such as missing data, inconsistent data, duplicate data, abnormal data, etc. These problematic data not only waste a lot of storage space and increase storage costs, but also have a serious impact on the results of subsequent big data analysis. Therefore, it is of great signi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/182G06F16/174G06F16/21G06F16/25G06F16/28
Inventor 赵跃龙张豫
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products