Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Semi-supervised random forests classification method based on Spark

A technique of random forest classification and random forest, which is applied in the field of machine learning classification algorithms, can solve problems such as the lack of training samples, the unusable classification model, and large sample data.

Inactive Publication Date: 2016-10-26
CHONGQING UNIV
View PDF0 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In order to solve the problems of high cost of manual labeling samples, too large sample data cannot be executed quickly on a single machine, too few training samples cannot obtain a classification model with high classification accuracy, and cannot use a large amount of cheap unlabeled data, etc.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Semi-supervised random forests classification method based on Spark
  • Semi-supervised random forests classification method based on Spark
  • Semi-supervised random forests classification method based on Spark

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0037] The specific principles of the Spark-based semi-supervised random forest classification method provided by the present invention are as follows:

[0038] According to Angluin and Laird's paper noise learning, if the training set size is m, the noise is η, the error rate is ε, and c is a constant. Then the relationship is satisfied:

[0039] m = c ϵ 2 ( 1 - 2 η ) 2 - - - ( 1 )

[0040] To simplify the calculation, it is necessary to calculate the quotient of the constant c and the square of the error rate.

[0041] μ = c ϵ 2 ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a semi-supervised random forest classification method based on Spark, which utilizes a random forests algorithm to employ replacement sampling on a training data set and column attributes, so that randomness is added in both row and column directions to ensure decision-making tree diversity, and to avoid tree pruning; in addition, a category is determined in a voting method, and accuracy is greatly improved. Accordingly, the random forests algorithm does not need to perform dimensionality reduction in processing high-dimensional data samples, and has sound effects for a sparse vector as well as a dense vector. According to the verification of a plurality of sets of experiments, the semi-supervised learning algorithm reduces a classification model error rate mean value, and improves calculating performance.

Description

technical field [0001] The invention relates to the technical field of machine learning classification algorithms, in particular to a spark general parallel framework, Tri-Training semi-supervised classification and distributed random forest classification algorithms. Background technique [0002] Classification is a category in the field of machine learning and is a commonly used data analysis technique, which is mainly used to determine which predefined target class a certain target object belongs to. It has a wide range of applications in many fields, including data mining, statistical learning, neural networks, etc. In the field of machine learning, there are two commonly used research methods, one is supervised learning and the other is unsupervised learning. Traditional classification models need to be trained on the basis of labeled sample sets, which belongs to supervised learning. In practical applications, the labeling of data is limited by various factors, and s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62
CPCG06F18/24323
Inventor 李学明马超
Owner CHONGQING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products