Parallel random forest algorithm for processing unbalanced big data

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A random forest method and big data technology, applied in the field of unbalanced big data classification, can solve the problems of deepening the influence of data bias, reducing classification efficiency, and low density of positive samples

Inactive Publication Date: 2018-04-06

JIANGSU LANSWON TECH CO LTD

View PDF0 Cites 9 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] When MR_RF deals with the problem of unbalanced data classification, because the positive sample density is too low and the base classifier is built on the sampled data of each block, the influence of data bias is deepened; secondly, MR_RF uses the global minimum gini when constructing the tree. The value is used as the criterion for selecting the split point, it is easy to ignore the positive class, there is a tendency to divide the positive class into negative class, and the classification efficiency is reduced

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0024] specific implementation plan

[0025] The invention designs a parallel random forest algorithm for effectively dealing with unbalanced data classification problems in a big data environment. The specific process is divided into two parts: model building and classification prediction, which will be combined with figure 1 , figure 2 The flowchart description is as follows:

[0026] When the SBWMR_RF algorithm is building a model, multiple blocks are processed in parallel. First, use the Hardtop platform to divide the data into blocks and send them to different data nodes, and then calculate the cost matrix of each block, see formula (1), C(-,+) uses a dynamic local imbalance index to avoid splitting Blocks deepen imbalances. The data key-value pair is used as input, the key is the binary stream encoding an instance, and the value is the specific data of each instance. Stratified self-sampling is performed on each block data, and then the final sample data set {sd ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a parallel random forest algorithm for processing unbalanced big data, belongs to the field of unbalanced big data classification, and aims at solving the problem that a parallel random forest algorithm based on a MapReduce mechanism is low in the unbalanced big data classifying efficiency. The parallel random forest algorithm (SBWMR_RF) in which layered self-service sampling and sensitive cost ideas are combined is provided, self-service sampling is carried out on positive and negative classes of each partitioned data set, the information content of the position classis increased while the randomness is guaranteed, a cost sensitive matrix of each partition is calculated dynamically, and the sensitive cost is used for establishment of a base classifier. According to the invention, the unbalanced data is processed in the aspects of data distribution adjustment and algorithm adjustment, influence of data bias is weakened, the classification accuracy of the unbalanced big data is improved, and the algorithm has obvious advantages in the extremely unbalanced environment.

Description

technical field [0001] The invention relates to a parallel random forest algorithm for processing unbalanced large data. It belongs to the classification technology in machine learning and involves the classification method of unbalanced big data. Background technique [0002] The technical focus of processing big data is to achieve fast and scalable parallel analytical processing. As the mainstream framework, Hardtop uses the simple idea of "divide and conquer" to integrate distributed file system HDFS, distributed computing framework Map Reduce, non-relational database HBASE, etc. for big data processing, and can store, manage and mine "big data" task. Big data and big data analysis and processing technologies promote each other. Map Reduce Random Forest algorithm based on Map Reduce, hereinafter referred to as MR_RF, as a kind of integrated classification, because it is simple and efficient in processing classification and regression problems, and has good parallelis...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/62

CPCG06F18/231G06F18/2415

Inventor 秦静

Owner JIANGSU LANSWON TECH CO LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Parallel random forest algorithm for processing unbalanced big data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology