Multi-source data aggregation sampling strategy based on big data environment

A multi-source data and big data technology, applied in the field of big data, can solve the problems of increased sample noise, increased sample redundancy or missing ratio, lack of multi-source data and multi-form data fusion and cross-validation, etc., to reduce interference. Effect

Active Publication Date: 2017-12-08
NANJING AUDIT UNIV
View PDF0 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] (1) Sampling is mostly single-source data or single-form data, and most of them lack the fusion and cross-validation of multi-source data and multi-form data;
[0006] (2) Sampling is mostly random sampling. In a big data environment, random sampling has certain limitations, because in a multi-field, multi-source, multi-carrier, and multi-form big data environment, cross-domain and cross-platform sampling is

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-source data aggregation sampling strategy based on big data environment
  • Multi-source data aggregation sampling strategy based on big data environment
  • Multi-source data aggregation sampling strategy based on big data environment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] In order to make the objects and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0038] The embodiment of the present invention provides a multi-source data aggregation sampling strategy based on a big data environment, including the following steps:

[0039] Preparation stage: Input the initial data sets from multiple sources, and uniformly set the encoding of these data sets to GBK encoding, and use the ID attribute in the first column of the file to identify and distinguish the data of different rows, so as to avoid repeated reading in the experiment Question; the initial data set includes at least social media, news platforms, special websites, patent websites, and talent recruitment data resources about business objecti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-source data aggregation sampling strategy based on a big data environment. The method comprises the following steps that on the basis of a multi-source data set encoded by GBK, fusion of multiple attributes in one data set and fusion of different source data sets are achieved, and multi-source data fusion operation is completed; fused files are segmented to form a two-dimensional word frequency matrix of file words; high-frequency words are displayed to a user for reference, and random migratory selection of seed root node words required by snowball sampling is conducted; business-goal-oriented seed root node keywords are selected, and the depth of snowball sampling is input; on the basis of seed root node data, a balance check value is set, and the words are circularly matched for snowball sampling; a directed acyclic graph and an adjacent matrix are constructed; a root node clustering network diagram and a logical reasoning diagram which are related to business goals can be obtained. The interference of sample noise on subsequent reasoning can be reduced through the strategy.

Description

technical field [0001] The invention relates to the field of big data, in particular to a multi-source data aggregation sampling strategy based on the big data environment. Background technique [0002] In the big data environment, data related to decision-making goals has the characteristics of multi-source heterogeneity, heterogeneous association, hierarchical nesting, and dynamic evolution. Decision-making goal-oriented multi-source heterogeneous data aggregation and reasoning sampling technology has great practical application value in risk warning, business opportunity prediction and anomaly detection. How to select the range of samples and their attribute features, how to determine the correlation between sample attribute features, how to construct the logical reasoning structure between samples and their attribute features, etc. . [0003] At present, the sampling technology of big data is mainly embodied in single-source or single-form data sampling and random samp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06N5/04G06K9/62G06F17/27
CPCG06F16/9024G06F16/9027G06N5/04G06F40/242G06F40/289G06F18/2323G06F18/24155
Inventor 李保珍朱庆康杨刚余臻周可
Owner NANJING AUDIT UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products