Unbalanced data classification method

A data classification and balancing technology, applied in database models, relational databases, electrical digital data processing, etc., can solve problems such as unfavorable classification effects, large randomness, and limited hyperplane adjustment

Inactive Publication Date: 2014-12-24
NANJING UNIV
View PDF5 Cites 27 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, at the beginning of the training, the support vector machine of active learning determines the initial training set by random extraction from all data points. The randomness is very large, and it is commonly used to select the data point closest to the current classification hyper

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data classification method
  • Unbalanced data classification method
  • Unbalanced data classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0115] The present invention is described in further detail in conjunction with accompanying drawing and specific embodiment:

[0116] First, some symbols are defined. Given a set L of all labeled data, the n data points closest to non-similar points form the initial training set T.

[0117] Such as figure 1 As shown, this embodiment discloses a method for classifying unbalanced data, including the following steps:

[0118] Step 1. For a given class-labeled data set L, calculate the distance between each data point and all non-similar points. For each data point, record the minimum distance between it and non-similar points as the point feature;

[0119] Step 2: Arrange the features of all data points in increasing order, select the first t data points with the smallest features as the training set T at the beginning, and the rest of the data points form the non-data set N, t theoretically ranges from 2 to m The natural number of , where m is the total number of data points...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an unbalanced data classification method. The unbalanced data classification method comprises the following steps: in a labeled data set L, firstly processing each data point, calculating the distances between the data points and all non-similar data points, and reserving the shortest distance as the characteristic of the data points; arranging all the data points according to the characteristic from small to large, taking the first t data points with the minimum characteristic to form an initial training set T, and enabling the remaining data points to form an initial non-training set N; using a support vector machine, and utilizing an active learning strategy to carry out iterative learning on the training set T; after training begins, a temporary classification hyperplane P is generated in each step of iteration, using the Ps to carry out trial classification on all the data points in the N, if mispredicted data points exist, drawing an item at random from the mispredicted data points to be added to the training set T, and meanwhile selecting the data point closest to the P in the N to be added to the T; if no mispredicted data points exist in the N, selecting the data point closest to the P from the data points to be added to the training train T. carrying out subsequent training.

Description

technical field [0001] The invention relates to a method for classifying unbalanced data, which belongs to the field of computer data analysis and mining, in particular to a data classification algorithm. Background technique [0002] Imbalanced datasets, that is, datasets that have large differences between the number of samples each class has. In the binary classification of unbalanced data sets, the class with a small number of samples is usually called a positive class, and correspondingly, the class with a large number of samples is called a negative class. Data imbalance is very common in current applications, such as medical diagnosis, intrusion detection, fraud prevention, distinguishing things from satellite images, etc. When classifying, the classification accuracy of the positive class is our main concern. For example, in disease diagnosis, the misdiagnosis of healthy people will be resolved during reexamination, but the misjudgment of cancer patients as normal ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/2365G06F16/2386G06F16/285
Inventor 柏文阳姚玉姝周嵩
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products