A preprocessing method of multi-source heterogeneous big data

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A multi-source heterogeneous and heterogeneous data source technology, applied in the field of big data processing, can solve the problems of insufficient research on semi-structured and unstructured data preprocessing, and can not meet user needs well, so as to reduce storage capacity. Resources and network bandwidth, strong practicability, and the effect of improving data storage efficiency

Inactive Publication Date: 2019-01-08

SOUTH CHINA UNIV OF TECH

View PDF12 Cites 28 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] The existing big data preprocessing methods have the following problems: mainly for structured data, insufficient research on semi-structured and unstructured data preprocessing, and usually only include two modules of data acquisition and data cleaning, and data cleaning The method is relatively simple and cannot meet the needs of users well.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0035] The present invention will be further described below in conjunction with specific examples.

[0036] Such as figure 1 As shown, the preprocessing method for multi-source heterogeneous big data provided in this embodiment includes the following steps:

[0037] Step 1: Storage of heterogeneous data. Extract data from multiple heterogeneous data sources and upload them to the distributed file system HDFS for storage. The invention provides support for multiple data source formats, including: Txt, Csv, Xsl, database data, jpg, mp4, etc., and provides interface standards to expand new data sources.

[0038] For text files, such as Txt and Csv, by designing a text storage function, the text data is read from the text file and stored in the distributed file system HDFS.

[0039] For the Xsl file, by designing the Xsl storage function, the excel data is read from the Excel file and stored in the distributed file system HDFS.

[0040] For database data, such as MySQL, Oracl...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a preprocessing method of multi-source heterogeneous big data, comprising the following steps: 1, heterogeneous data storage: extracting data from heterogeneous data sources according to preset conditions and uploading the data to a distributed file system HDFS for storage; 2, data cleaning: loading the data in the distributed file system HDFS into the memory through a Spark frame, removing the duplicate data and noise data, and transforming the format; 3, entity identification: identifying different representations of the same entity for the cleaned data, correctly identifying all the different entities, and merging the data of the same entity; 4, removal of redundancy: using hash-based duplicate data removal technology to remove redundant data. The method of the invention can reduce the storage resource and the network bandwidth, improve the data storage efficiency, and improve the quality of the subsequent data analysis work.

Description

technical field [0001] The invention relates to the technical field of big data processing, in particular to a preprocessing method for multi-source heterogeneous big data. Background technique [0002] Big data is often generated from a large number of sources, often including images, videos, audios, data streams, texts, web pages, and other different data formats. These data are high-dimensional, massive, and complex, which exacerbates the difficulty and complexity of data analysis, information extraction, and knowledge representation. In addition, in the process of data collection and uploading, it is easy to generate problematic data, that is, data that does not meet the data quality requirements, such as missing data, inconsistent data, duplicate data, abnormal data, etc. These problematic data not only waste a lot of storage space and increase storage costs, but also have a serious impact on the results of subsequent big data analysis. Therefore, it is of great signi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/182G06F16/174G06F16/21G06F16/25G06F16/28

Inventor赵跃龙张豫

OwnerSOUTH CHINA UNIV OF TECH

A preprocessing method of multi-source heterogeneous big data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology