Oversampling method for unbalanced data set

A data set and oversampling technology, which is applied in the direction of electrical digital data processing, special data processing applications, digital data information retrieval, etc., can solve the problems of affecting analysis results and poor data validity, so as to improve the effectiveness and increase the ease of learning Effect

Inactive Publication Date: 2019-09-24
NORTHEASTERN UNIV
View PDF4 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The existing oversampling methods for unbalanced datasets mostly use the ADASYN oversampling algorithm and the B-SMOTE oversampling algorithm. The samples processed by these methods have a lot of noise, which makes the data obtained after oversampling too ineffective, thus affecting the analysis results.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Oversampling method for unbalanced data set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0018] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0019] like figure 1 Shown is a flow chart of the oversampling method of the unbalanced data set of the present invention. The oversampling method of unbalanced data set of the present invention is characterized in that, comprises the following steps:

[0020] Step 1: Collect the unbalanced dataset U 0 , based on the K-means method, for the unbalanced data set U 0 Perform clustering to obtain a data set of K classes {U 01 , U 02 ,...,U 0q ,...,U 0K}, q∈{1,2,...,K}; remember the data set U 0q The number of elements in is s(U 0q ), if s(U 0q )0q into the minority class dataset U m , if s(U 0q )≥ε, the data set U 0q into the majority class dataset U l .

[0021] In this example, the unbalanced data set U 0 It is a telecom user data set, which includes SERV_LEV (service level), CONSUME_GRADE (consumption level), CREDIT_DEG (credit de...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of data mining, and provides an oversampling method of an unbalanced data set. The method comprises the following steps: firstly, collecting an unbalanced data set, clustering the unbalanced data set based on a K-means method, and dividing the unbalanced data set into a minority class and a majority class according to the number of elements in the data set of each class; then, on the basis of an SMOTE method, carrying out oversampling on the minority class data set to obtain a synthesized minority class data set; then, performing oversampling with replacement on the synthesized minority class data set to obtain a new minority class data set, and forming a new data set; and finally, based on a CCA method, cleaning the new data set: clustering the new data set, calculating and sorting Euclidean distances between each sample in each class cluster and other samples in the class cluster, and deleting the sample corresponding to the farthest Euclidean distance to obtain the cleaned data set. According to the method, more minority class samples can be effectively synthesized, the learnability of the samples is improved, and the effectiveness of the samples is improved.

Description

technical field [0001] The invention relates to the technical field of data mining, in particular to an oversampling method for an unbalanced data set. Background technique [0002] Oversampling is a very effective way to deal with the problem of class imbalance by duplicating or synthesizing samples to balance the distribution between majority and minority class samples. The existing oversampling methods for unbalanced datasets mostly use the ADASYN oversampling algorithm and the B-SMOTE oversampling algorithm. The samples processed by these methods have a lot of noise, which makes the data obtained after oversampling less effective, thus affecting the analysis results. . SUMMARY OF THE INVENTION [0003] Aiming at the problems existing in the prior art, the present invention provides an oversampling method of an unbalanced data set, which can effectively synthesize more minority samples, increase the learnability of the samples, and improve the effectiveness of the samp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/2458G06F16/215G06K9/62
CPCG06F16/2462G06F16/215G06F18/23213G06F18/214
Inventor 侯雁博朱志良
Owner NORTHEASTERN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products