Parallel random forest algorithm for processing unbalanced big data

A random forest method and big data technology, applied in the field of unbalanced big data classification, can solve the problems of deepening the influence of data bias, reducing classification efficiency, and low density of positive samples

Inactive Publication Date: 2018-04-06
JIANGSU LANSWON TECH CO LTD
View PDF0 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] When MR_RF deals with the problem of unbalanced data classification, because the positive sample density is too low and the base classifier is built on the sampled data of each block, the influence of data bias is deepened; secondly, MR_RF uses the global minimum gini when constructing the tree. The value is used as the criterion for selecting the split point, it is easy to ignore the positive class, there is a tendency to divide the positive class into negative class, and the classification efficiency is reduced

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel random forest algorithm for processing unbalanced big data
  • Parallel random forest algorithm for processing unbalanced big data
  • Parallel random forest algorithm for processing unbalanced big data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] specific implementation plan

[0025] The invention designs a parallel random forest algorithm for effectively dealing with unbalanced data classification problems in a big data environment. The specific process is divided into two parts: model building and classification prediction, which will be combined with figure 1 , figure 2 The flowchart description is as follows:

[0026] When the SBWMR_RF algorithm is building a model, multiple blocks are processed in parallel. First, use the Hardtop platform to divide the data into blocks and send them to different data nodes, and then calculate the cost matrix of each block, see formula (1), C(-,+) uses a dynamic local imbalance index to avoid splitting Blocks deepen imbalances. The data key-value pair is used as input, the key is the binary stream encoding an instance, and the value is the specific data of each instance. Stratified self-sampling is performed on each block data, and then the final sample data set {sd ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel random forest algorithm for processing unbalanced big data, belongs to the field of unbalanced big data classification, and aims at solving the problem that a parallel random forest algorithm based on a MapReduce mechanism is low in the unbalanced big data classifying efficiency. The parallel random forest algorithm (SBWMR_RF) in which layered self-service sampling and sensitive cost ideas are combined is provided, self-service sampling is carried out on positive and negative classes of each partitioned data set, the information content of the position classis increased while the randomness is guaranteed, a cost sensitive matrix of each partition is calculated dynamically, and the sensitive cost is used for establishment of a base classifier. According to the invention, the unbalanced data is processed in the aspects of data distribution adjustment and algorithm adjustment, influence of data bias is weakened, the classification accuracy of the unbalanced big data is improved, and the algorithm has obvious advantages in the extremely unbalanced environment.

Description

technical field [0001] The invention relates to a parallel random forest algorithm for processing unbalanced large data. It belongs to the classification technology in machine learning and involves the classification method of unbalanced big data. Background technique [0002] The technical focus of processing big data is to achieve fast and scalable parallel analytical processing. As the mainstream framework, Hardtop uses the simple idea of ​​"divide and conquer" to integrate distributed file system HDFS, distributed computing framework Map Reduce, non-relational database HBASE, etc. for big data processing, and can store, manage and mine "big data" task. Big data and big data analysis and processing technologies promote each other. Map Reduce Random Forest algorithm based on Map Reduce, hereinafter referred to as MR_RF, as a kind of integrated classification, because it is simple and efficient in processing classification and regression problems, and has good parallelis...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/231G06F18/2415
Inventor 秦静
Owner JIANGSU LANSWON TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products