Random forest parallelization machine studying method for big data in Spark cloud service environment

A random forest and machine learning technology, applied in the computer field, can solve the problems such as the performance degradation of the classification method and the long time, and achieve the effect of reducing the amount of calculation and complexity, reducing the impact, and improving the classification accuracy.

Pending Publication Date: 2016-05-04
HUNAN UNIV
View PDF0 Cites 51 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] Traditional classification methods can achieve ideal results on low-dimensional small data sets, but when the structure of the data becomes complex, the dimension of the data becomes higher, and the size of the dat

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Random forest parallelization machine studying method for big data in Spark cloud service environment
  • Random forest parallelization machine studying method for big data in Spark cloud service environment
  • Random forest parallelization machine studying method for big data in Spark cloud service environment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0038](1) Aiming at the problem that big data has high-dimensional features, the method of feature importance analysis is used in the training process and prediction process to reduce the dimensionality of high-dimensional data features, which effectively reduces the amount of calculation and complexity of the method. Aiming at the problem of a large amount of noisy data in big data, the weighted voting method is used for data set prediction and voting, which reduces the impact of noisy data on data classification voting results, and improves the classification accuracy of random forest machine learning methods for complex big data.

[0039] Step 1: The feature selection process of the training data during the random forest model training process, the process is as follows figure 1 shown. The specific implementation steps are as follows:

[0040] Step 1.1: Sampling the high-dimensional big data training set with replacement into n training data subsets;

[0041] Step 1.2: Ca...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a random forest parallelization machine studying method for big data in a Spark cloud service environment. The method comprises the steps that dimension reduction processing is performed on the high-dimensional big data through feature vector importance analysis, and prediction is performed by adopting a weighed voting mode; through a distributed memory management mechanism and a cloud computing platform, parallelization of random forest training process model building, single decision-making tree splitting process and prediction voting is improved. According to the method, dimension reduction processing is performed on the high-dimensional big data through feature vector importance analysis, prediction is performed by adopting the weighed voting mode, therefore, optimization of the random forest method is achieved, and the mining effect of the random forest machine studying method on the complex big data is improved; the random forest parallelization method based on the Spark cloud platform is performed on the basis, so that the operation efficiency of the random forest machine studying method is improved.

Description

technical field [0001] The invention belongs to the field of computers, and in particular relates to a large data-oriented random forest parallel machine learning method under a Spark cloud service environment. Background technique [0002] Explanation of terms: [0003] Feature dimensionality reduction: In the process of image or data feature extraction, too many feature dimensions extracted often lead to too complex feature matching and consume system resources. At this time, a low-latitude feature is used to represent a high-latitude feature. Dimensionality reduction. [0004] With the continuous emergence of various new information publishing methods, the rise of technologies such as cloud computing and the Internet of Things, and various sensors all over the world, data is growing and accumulating at an unprecedented rate. The data age has arrived. With the deepening of network applications, the value of big data applications is becoming more and more obvious. Massi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06N99/00
CPCG06F2216/03G06F16/90G06N20/00
Inventor 唐卓陈建国李肯立鲁彬陈俊杰肖锦波
Owner HUNAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products