Non-uniform big data classifying method

A classification method and big data technology, applied in text database clustering/classification, unstructured text data retrieval, electronic digital data processing, etc., can solve problems such as high complexity of big data classification and big data algorithms, and reduce complexity degree, improve grades, and improve the effect of classification grades

Active Publication Date: 2014-01-08
GUANGXI NORMAL UNIV
View PDF3 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The invention studies the classification problem of non-uniform big data
This method can solve the bias problem that is easy to occur in big data classification and the high complexity problem of big data algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Non-uniform big data classifying method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0035] It is given that the simulated big data instance contains two million, and the dimension of each instance is 1000 dimensions. The entire data set is divided into two categories, the first category contains 1.99 million instances, and the second category only contains 10,000 instances. This dataset is randomly generated and belongs to the imbalanced big data binary classification problem.

[0036](1) Determine the confidence level of 99% and the maximum allowable error of 1%. So the sample size per dataset per class is 16641. According to the proportion, 10,000 instances are extracted from class A (data set containing 1.99 million instances), plus 10,000 instances of non-class A, each data set contains 20,000 instances. Common PCs can usually easily apply common meta-classifiers to classify datasets containing 20,000 instances.

[0037] (2) According to the above method, a total of 10 sub-datasets are generated in this example. Establish 10 classifiers using the near...

Embodiment 2

[0041] Given a simulated big data instance containing 20 million, the dimension of each instance is 1000 dimensions. The entire data set is divided into three categories, among which category A contains 12 million instances, category B contains 7.9 million instances, and category C contains 100,000 instances. This dataset is randomly generated and belongs to the imbalanced big data multiclass classification problem.

[0042] (1) Determine the confidence level of 95% and the maximum allowable error of 1%. The sample size for each dataset is 9604 per class. It is a bit difficult for a general computer to process 300,000 data. Therefore, three classes need to be sampled.

[0043] (2) Sample 10 data for class A, and each data set contains 20,000 instances (note: the number of instances only needs to exceed 9604). More specifically, 10,000 samples are randomly drawn from class A, then 5,000 samples are randomly drawn from class B, and 5,000 samples are randomly drawn from class...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a non-uniform big data classifying method which is used for classifying data set categories which cannot be classified and non-uniform category data sets of the data set categories which cannot be classified in a computer memory. Firstly, the size of a sample is determined by a downsampling method according to a theory, and the number of classifiers can be determined according to the number of the samples. An integrated classifier is established for each category of the big data. When a test case is tested, the integrated classifiers of all categories are classified, and the category where the integrated classifier with the highest classification rate is arranged is used as the category of the test case. The method is in linear to time complexity of big data category and can reduce polarization of non-uniform big data classification results. Furthermore, the integrated classifiers improve the accuracy. The method is easy to implement and only involves some simple math models in writing codes.

Description

technical field [0001] The present invention relates to the fields of computer science and technology and information technology, in particular to big data, in particular to a processing method for classification of non-uniform big data. Background technique [0002] Big data refers to the collection of data that cannot be captured, managed and processed with conventional software tools within the existing physical conditions and allowed time. Big data has the following characteristics: Volume (large amount of data), Variety (variety of data types), Value (low value density), Velocity (fast processing speed), referred to as 4V. [0003] At present, big data research usually includes two categories. First, big data challenges architecture. At present, the raw data capacity in the HADOOP clusters of many famous websites reaches dozens of PB, and there is redundancy, which needs to be scanned and updated every day. Then, in order to ensure that the failure of a single node o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/35G06F16/90
Inventor 朱晓峰张师超
Owner GUANGXI NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products