Optimized random forest unbalanced data set processing method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of random forest and random forest model, which is applied in the direction of computer parts, instruments, characters and pattern recognition, etc., can solve the problems of reducing the unbalanced rate of data sets, information loss, and the decline of the accuracy rate of most categories, so as to achieve correct prediction The rate will not drop seriously, the prediction performance will be improved, and the classification accuracy rate will be improved.

Active Publication Date: 2021-05-25

SUN YAT SEN UNIV

View PDF7 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0010] The disadvantages at the data processing level are: oversampling technology directly generates similar minority samples because it does not analyze minority samples, which may easily lead to redundant samples and lead to model overfitting

Undersampling technology reduces the majority class samples to reduce the imbalance rate of the data set, resulting in the loss of most class information and reducing the classification accuracy of the majority class.

[0011] The disadvantage of the ENN algorithm is that even if the algorithm removes some samples of the majority class, the distribution of the data set may still have a large imbalance rate, and because some samples of the majority class are deleted, the classification accuracy of the majority class will be reduced. decline

[0012] Although the biased random forest algorithm with the best effect at present achieves the purpose of improving classification performance by finding error-prone areas and training random forests through two data sets, it throws less minority class information and obtains the first The two data sets may still be unevenly distributed, and because Random Forest uses Bootstrap random resampling technology, this will reduce the probability of minority class samples being sampled and affect the classification accuracy of minority class samples

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0059] The present invention is an optimized random forest method for processing unbalanced data sets. The method includes data preprocessing, random forest model construction and classification prediction, wherein the data preprocessing will find the nearest neighbor of the minority class sample K majority class samples form an indistinguishable area. The samples in this area are relabeled in the original data set, and the minority class samples are generated in the indistinguishable area. The original data after relabeling and the newly added samples The difficult-to-distinguish regions are output as different training sets; the construction of the random forest model uses the 2 data sets processed by the data preprocessing part as the training set of the model to obtain two random forest models; the classification prediction will Enter the two random forest models described in two stages for verification, and finally obtain the classification prediction results of the sample...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an optimized random forest unbalanced data set processing method, which comprises data preprocessing, random forest model construction and classification prediction, and is characterized in that k majority class samples nearest to minority class samples are found out in the data preprocessing part to form areas difficult to distinguish. the samples of the region are re-tagged in the original data set, minority class samples are generated in the region which is difficult to distinguish, and the re-tagged original data and the region which is difficult to distinguish is outputted after the samples are newly added as different training sets; the two data sets processed by the data preprocessing part are used as training sets of the model in the construction of the random forest models to obtain two random forest models, the classification prediction enters the two random forest models for verification in two stages, and finally a classification prediction result of a sample is obtained. The purposes that the minority class prediction performance is improved, and meanwhile the majority class prediction accuracy cannot be seriously reduced are achieved.

Description

technical field [0001] The invention belongs to the technical field of data analysis, mining and machine learning, and in particular relates to an optimized random forest method for processing unbalanced data sets. [0002] technical background [0003] With the advent of the era of big data, data mining has become an increasingly important technology, and classification is the most common task in data mining. Using classification algorithms to mine the potential information of data is conducive to providing effective predictions for problems. In real-world classification scenarios, there are often situations where many data sets are unevenly distributed, and for different problems, different classifications have different degrees of emphasis. The general classification algorithm seeks to improve the overall classification accuracy of the data set, resulting in the prediction classification accuracy rate of the minority class samples being much lower than the prediction class...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/62

CPCG06F18/214G06F18/2415

Inventor 卢宇彤邓雷

Owner SUN YAT SEN UNIV

Optimized random forest unbalanced data set processing method

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

Agents

Company

Optimized random forest unbalanced data set processing method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

Agents

Company

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology