Oversampling method for unbalanced data set

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A data set and oversampling technology, which is applied in the direction of electrical digital data processing, special data processing applications, digital data information retrieval, etc., can solve the problems of affecting analysis results and poor data validity, so as to improve the effectiveness and increase the ease of learning Effect

Inactive Publication Date: 2019-09-24

NORTHEASTERN UNIV

View PDF4 Cites 7 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The existing oversampling methods for unbalanced datasets mostly use the ADASYN oversampling algorithm and the B-SMOTE oversampling algorithm. The samples processed by these methods have a lot of noise, which makes the data obtained after oversampling too ineffective, thus affecting the analysis results.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0018] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0019] like figure 1 Shown is a flow chart of the oversampling method of the unbalanced data set of the present invention. The oversampling method of unbalanced data set of the present invention is characterized in that, comprises the following steps:

[0020] Step 1: Collect the unbalanced dataset U 0 , based on the K-means method, for the unbalanced data set U 0 Perform clustering to obtain a data set of K classes {U 01 , U 02 ,...,U 0q ,...,U 0K}, q∈{1,2,...,K}; remember the data set U 0q The number of elements in is s(U 0q ), if s(U 0q )0q into the minority class dataset U m , if s(U 0q )≥ε, the data set U 0q into the majority class dataset U l .

[0021] In this example, the unbalanced data set U 0 It is a telecom user data set, which includes SERV_LEV (service level), CONSUME_GRADE (consumption level), CREDIT_DEG (credit de...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to the technical field of data mining, and provides an oversampling method of an unbalanced data set. The method comprises the following steps: firstly, collecting an unbalanced data set, clustering the unbalanced data set based on a K-means method, and dividing the unbalanced data set into a minority class and a majority class according to the number of elements in the data set of each class; then, on the basis of an SMOTE method, carrying out oversampling on the minority class data set to obtain a synthesized minority class data set; then, performing oversampling with replacement on the synthesized minority class data set to obtain a new minority class data set, and forming a new data set; and finally, based on a CCA method, cleaning the new data set: clustering the new data set, calculating and sorting Euclidean distances between each sample in each class cluster and other samples in the class cluster, and deleting the sample corresponding to the farthest Euclidean distance to obtain the cleaned data set. According to the method, more minority class samples can be effectively synthesized, the learnability of the samples is improved, and the effectiveness of the samples is improved.

Description

technical field [0001] The invention relates to the technical field of data mining, in particular to an oversampling method for an unbalanced data set. Background technique [0002] Oversampling is a very effective way to deal with the problem of class imbalance by duplicating or synthesizing samples to balance the distribution between majority and minority class samples. The existing oversampling methods for unbalanced datasets mostly use the ADASYN oversampling algorithm and the B-SMOTE oversampling algorithm. The samples processed by these methods have a lot of noise, which makes the data obtained after oversampling less effective, thus affecting the analysis results. . SUMMARY OF THE INVENTION [0003] Aiming at the problems existing in the prior art, the present invention provides an oversampling method of an unbalanced data set, which can effectively synthesize more minority samples, increase the learnability of the samples, and improve the effectiveness of the samp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/2458G06F16/215G06K9/62

CPCG06F16/2462G06F16/215G06F18/23213G06F18/214

Inventor侯雁博朱志良

OwnerNORTHEASTERN UNIV

Oversampling method for unbalanced data set

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology