Unbalanced data classification method based on clustering downsampling

A technology of data classification and clustering algorithm, applied in the research field of pattern recognition, it can solve the problems of classification surface offset, loss of multi-class sample information, etc., to achieve the effect of enhancing overall performance, shortening training time, and improving contribution rate

Inactive Publication Date: 2018-02-13
WUYI UNIV
View PDF0 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the downsampling technology may lose representative multi-class sample information while deleting multi-class samples, and the classification surface will be shifted.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data classification method based on clustering downsampling
  • Unbalanced data classification method based on clustering downsampling
  • Unbalanced data classification method based on clustering downsampling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The present invention will be further described in detail below in conjunction with the embodiments and the accompanying drawings, but the embodiments of the present invention are not limited thereto. In order not to confuse the present invention, some commonly used technologies such as the support vector machine theory will not be described in detail.

[0032] A kind of unbalanced data classification method based on clustering down-sampling provided by the present invention, the specific implementation steps are as follows:

[0033] (1) The unbalanced data set is divided into two parts, the training set and the cross-validation set, which can be expressed as , where D is the unbalanced dataset, Tr is the training set, and Te is used to represent the cross-validation set. The proportion of the cross-set of the training set can be allocated according to the needs. Generally, 10-fold cross-validation can be used, that is, the data set is divided into 10 parts, and 9 of ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unbalanced data classification method based on clustering downsampling. The unbalanced data classification method comprises the following steps: utilizing a quick search anddiscovery density peak value clustering algorithm for clustering various categories of samples of a training set to obtain a clustering result; dividing multiple categories of samples in the trainingset into N clusters; forming a new sample set by each cluster of sample of multiple categories of samples in the training set and small categories of samples in the training set, and utilizing a support vector machine for classification to obtain the support vectors of multiple categories of samples in the training set; extracting the new training set formed by the support vector of each clusterand the small categories of samples in the training set; training the new training set through the support vector machine, and carrying out performance evaluation through a cross verification set. Thetraining time of the classifier can be shortened, in addition, the identification rate of small categories of samples can be improved under a situation that the identification rate of multiple categories of identification samples is not harmed, and the performance of the classifier is improved.

Description

technical field [0001] The invention relates to the research field of pattern recognition, in particular to a classification method for unbalanced data based on clustering and downsampling. Background technique [0002] The classification problem is a very important research content in the fields of pattern recognition and machine learning. It has a very wide range of applications in real life, such as handwritten digit recognition in banking systems, face recognition in security monitoring systems, and network security. intrusion detection etc. At present, there are some relatively mature classification methods for dealing with classification problems, such as: decision tree, K-nearest neighbor, neural network, support vector machine and other methods. extensive attention. These traditional classification methods are all proposed based on the assumption of balanced class distribution, and their main purpose is to improve the overall classification performance, and show go...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/23G06F18/2411G06F18/214
Inventor 曹路
Owner WUYI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products