Multi-source data aggregation sampling method and system based on big data environment

A multi-source data and big data technology, applied in the field of big data, can solve the lack of research on semi-structured and unstructured data preprocessing, the inability to integrate large-scale heterogeneous data sources, and the inability to well meet user needs, etc. problems, to avoid the loss of effective information, reduce storage resources and network bandwidth, and reduce or eliminate noise data

Pending Publication Date: 2019-08-20
ZHEJIANG UNIVERSITY OF SCIENCE AND TECHNOLOGY
View PDF6 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In the existing multi-source data aggregation sampling process under the big data environment, there is insufficient research on structured data, semi-structured and unstructured data preprocessing, and usually only includes two modules of data acquisition and data cleaning, and the data The cleaning method is also relatively simple, which cannot well meet the needs of users; at the same time, when the data is fused, there is no open link data set as prior knowledge, and it is impossible to efficiently and accurately perform large-scale heterogeneous data sources while reducing the complexity. fusion of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-source data aggregation sampling method and system based on big data environment
  • Multi-source data aggregation sampling method and system based on big data environment
  • Multi-source data aggregation sampling method and system based on big data environment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] In order to further understand the invention content, features and effects of the present invention, the following embodiments are exemplified, and detailed descriptions are included with the accompanying drawings.

[0073] The structure of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0074] Such as figure 1 As shown, the multi-source data aggregation sampling method based on the big data environment provided by the present invention comprises the following steps:

[0075] S101, collect multiple original data sources through the data source collection module, each original data source includes a data source name and at least one associated field;

[0076] S102, the central control module uses the data processing program to clean, identify and remove redundant operations on the collected data sources through the preprocessing module;

[0077] S103, using the construction program to obtain the original policy l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of big data, and discloses a multi-source data aggregation sampling method and system based on a big data environment, and the method comprises the steps:collecting a plurality of original data sources, wherein each original data source comprises a data source name and at least one association domain; cleaning and identifying the acquired data source,and removing redundant of the acquired data source; obtaining an original strategy list by utilizing a construction program according to the original data source, and sorting the original strategies in the original strategy list to form a strategy list between the data sources; carrying out fusion processing on different source data sets by utilizing a fusion program; carrying out word segmentation on the fused file to form a two-dimensional word frequency matrix of file words; setting a balance verification numerical value, circularly matching each word, and carrying out snowball sampling; and displaying the acquired multi-source data by using a display. According to the method, distributed computing is completed by scheduling the computing nodes by the Spark through the preprocessing module, more efficient data preprocessing can be achieved, practicability is high, and the application range is wide.

Description

technical field [0001] The invention belongs to the technical field of big data, and in particular relates to a multi-source data aggregation sampling method and system based on a big data environment. Background technique [0002] Multi-source data fusion technology refers to the use of relevant means to integrate all the information obtained through investigation and analysis, and to make a unified evaluation of the information, and finally to obtain unified information. The purpose of this technology is to integrate various data information, absorb the characteristics of different data sources, and then extract unified, better and richer information than single data. However, in the process of multi-source data aggregation sampling in the existing big data environment, there is insufficient research on structured data, semi-structured and unstructured data preprocessing, and usually only includes two modules: data acquisition and data cleaning. Moreover, the method of da...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/182G06F16/174G06F16/11
CPCG06F16/11G06F16/174G06F16/182
Inventor 云本胜钱亚冠胡月
Owner ZHEJIANG UNIVERSITY OF SCIENCE AND TECHNOLOGY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products