Self-adaptive oversampling method based on HDBSCAN clustering

An adaptive, over-sampling technology, applied in the direction of instruments, character and pattern recognition, computer components, etc., can solve the problems of over-fitting, unable to process data distribution feature sampling, unable to strengthen classifier learning, etc., to achieve strong generalization Power and Robustness, Effect of Improving Accuracy

Pending Publication Date: 2019-11-12
重庆信科设计有限公司 +1
View PDF0 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Studies have shown that the overall classification performance of classifiers on balanced data is much better than that on original data, so there have been many studies aimed at improving unbalanced learning problems in recent years, but there are currently many methods on the data level. Limitations: For example, random oversampling Random Oversampling is to randomly copy minority class samples to balance the class distribution. This method can effectively improve the classification performance of the classifier but it is easy to lead to overfitting
This

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Self-adaptive oversampling method based on HDBSCAN clustering
  • Self-adaptive oversampling method based on HDBSCAN clustering
  • Self-adaptive oversampling method based on HDBSCAN clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0044] The technical scheme that the present invention solves the problems of the technologies described above is:

[0045] The basic ideas of the present invention to achieve the above goals are as follows: firstly, the unbalanced data set is divided into a training set and a test set, and 70% of the data set is selected as the training set in consideration of the characteristics and labels of the data. Secondly, using HDBSCAN clustering technology to cluster the minority class samples in the training set to obtain clusters that are disjoint and of different sizes. Subsequently, the sparsity of each cluster and the corresponding number of samples to be sampled are calculated. Then, new samples are adapt...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a self-adaptive oversampling method based on HDBSCAN clustering, and mainly solves the problem of unbalanced data classification by using complete data information in an existing method. The technology comprises the following steps: (1) inputting a training data set; (2) clustering the minority class samples in the training set to obtain different scales of clusters which are not intersected with each other; (3) calculating the number of samples needing to be synthesized in each minority class cluster; (4) adaptively synthesizing new samples according to the number of samples needing to be synthesized by each cluster to obtain a new minority class data set; (5) forming a new balanced data set by the majority class data set and the new minority class data set; and (6) training and testing the classifier by using the new balance data set. According to the technology, noise in an unbalanced data set can be effectively prevented from being generated, meanwhile, theproblem of inter-class and intra-class unbalance is solved, and a brand-new oversampling strategy is provided for unbalanced learning.

Description

technical field [0001] The invention belongs to the field of computer artificial intelligence, and in particular relates to an integrated classification method combining unbalanced data resampling technology and clustering. Background technique [0002] Most of the standard machine learning algorithms proposed by researchers in recent years are based on the assumption that the data set distribution is balanced or the error cost is equal. However, in real life, we often encounter a lot of data distributions that are extremely unbalanced. Or scenarios where misclassification costs vary significantly. For example, many current classification learning algorithms are difficult to achieve accurate prediction results in the fields of credit card fraud detection, cancer risk prediction, text classification, software defect prediction, and bioinformatics, because these classification learning algorithms are due to the unbalanced distribution of training data sets, noise, data, etc. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06K9/62
CPCG06F18/23G06F18/241G06F18/214
Inventor 董宏成赵学华刘颖解如风范荣妹
Owner 重庆信科设计有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products