Method convenient for cleaning, integrating and storing massive multi-source heterogeneous data

A multi-source heterogeneous data and data caching technology, applied in structured data retrieval, electronic digital data processing, special data processing applications, etc., can solve problems such as the inability to integrate global factors, reduce the pressure of massive data, and improve robustness sex, reduce the effect of the degree of coupling

Active Publication Date: 2021-02-09
哈尔滨航天恒星数据系统科技有限公司
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the problem that global factors cannot be integrated in the processing of massive multi-source heterogeneous data, a method for cleaning, integrating and storing massive multi-source heterogeneous data that can consider global factors is provided. The invented scheme is as follows: The specific method steps are:

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method convenient for cleaning, integrating and storing massive multi-source heterogeneous data
  • Method convenient for cleaning, integrating and storing massive multi-source heterogeneous data

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0027] Specific implementation mode 1: a method for cleaning and integrating storage of massive multi-source heterogeneous data, the method includes the following steps: executing the construction of a data source collection; performing collection traversal, recording type and data protocol; performing data access; and then Through the protocol adaptation link, the first-order data is formed, and the cache queue is pushed; the first-order data is pulled, and the cleaning link is performed to form the second-order data, and the cache queue is pushed; the second-order data is pulled, and the active-passive hybrid mode conversion and integration link is formed to form The third-level data is pushed to the cache queue; the third-level data is pulled, and the distributed storage link is carried out to complete the final storage.

specific Embodiment approach 2

[0028] Specific implementation mode two: according to the method described in his implementation mode one, each step can also be refined into:

[0029] The link of constructing data source collection is to construct a collection of original collected data sources from massive multi-source heterogeneous data sources;

[0030] Traversing the collection link is to record the source type, source number, source dimension and source data protocol of each data source in the collection to form an array list;

[0031] The data access link is the access of data sources before data processing;

[0032] The protocol adaptation link is to perform the corresponding first analysis according to the source type and data protocol of each data source, and the parsed data forms first-level data, which is pushed to the first-level topic in the data cache queue;

[0033] In the cleaning process, firstly, the first-level data is pulled, and then abnormal and problematic data are cleaned and elimina...

specific Embodiment approach 3

[0036] Specific implementation mode three: the embodiment provides a simulated application scenario. There are three existing database-type data sources: DB1, DB2, DB3, two excel-type data sources: EL1, EL2, one json-type data source: JN1, There are 2 protocol buffer data sources, PB1 and PB2, and 2 sensor data sources: SN1 and SN2. There are 5 types of data sources in total, and the number is 10. Now it is necessary to carry out the complete process of accessing-cleaning-integrating-storing the data of these 10 data sources, then implement the process according to the methods and methods:

[0037] First construct a data source collection, the number of elements in the collection is 10; traverse the collection, record (data source identification, data source type, access method, adapter) quadruples, and form a list:

[0038] {(DB1,DB,JDBC,JavaApi),(DB2,DB,JDBC,JavaApi),(DB3,DB,JDBC,JavaApi),(EL1,EL,FIO,Buffer),(EL2,EL,FIO,Buffer) ,(JN1,JSON,HTTP,Json),(PB1,PB,HTTP,Pb),(PB2,PB...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method convenient for cleaning, integrating and storing massive multi-source heterogeneous data, belongs to the field of massive multi-source heterogeneous data processing,and solves the problems that global factors cannot be integrated in massive multi-source heterogeneous data processing and the like. The method comprises the steps of constructing a data source set, traversing the set, recording types and data protocols, a data access link and a protocol adaptation link, forming first-order data, pushing a cache queue, pulling the first-order data, a cleaning link, forming second-order data, pushing the cache queue, taking the second-order data, an active and passive mixed mode conversion integration link, forming third-order data, pushing the cache queue, finally pulling the third-order data, a distributing storage link, and finishing final storage. The method provided by the invention is convenient and fast, is clear in process, can reduce the coupling degree of each link of cleaning, integrating and storing massive multi-source heterogeneous data, can effectively reduce the pressure of massive data and perform peak clipping and valley filling through layer-by-layer migration and order reduction processing, and can improve the robustness of the whole method process.

Description

technical field [0001] The invention relates to the field of massive multi-source heterogeneous data processing, in particular to a method for cleaning, integrating and storing massive multi-source heterogeneous data. Background technique [0002] In the field of smart cities, there are more and more various smart terminals and sensor networks, and the application topics of various cities are becoming more and more extensive, which makes the data sources in the field of smart cities more abundant, and the amount of data has become The dimensions of massive data and big data, and various data sources are also massive, multi-source, and heterogeneous data sources. Under this technical background, massive multi-source heterogeneous data processing has become the technical focus of this field. [0003] Facing the field of massive multi-source heterogeneous data processing, for its access-cleaning-integration-storage process, it is necessary to find a process method that is conv...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/215G06F16/27
CPCG06F16/215G06F16/27
Inventor 刘源周含笑姜宇于雷赵辉谢雨王兆祥董丽娜李墨野刘京京王建勋
Owner 哈尔滨航天恒星数据系统科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products