Data resampling method based on repeated editing nearest neighbor and clustering oversampling

A nearest-neighbor and resampling technology, which is applied in computing models, machine learning, computing, etc., can solve problems such as unbalanced clustering, small separation problems, and amplified data noise, so as to reduce insufficient data volume and improve classification Effect, the effect of solving the class imbalance problem

Inactive Publication Date: 2020-03-31
NORTHWESTERN POLYTECHNICAL UNIV
View PDF0 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the SMOTE method amplifies the noise in the data and does not take into account the imbalance and small separation problems within the clusters

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data resampling method based on repeated editing nearest neighbor and clustering oversampling
  • Data resampling method based on repeated editing nearest neighbor and clustering oversampling
  • Data resampling method based on repeated editing nearest neighbor and clustering oversampling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] Now in conjunction with embodiment the present invention will be further described:

[0021] The experimental data set is the JM1 data set of NASA (National Aeronautics and Space Administration) MDP (metric data program). This dataset is a public dataset provided by NASA and contains software indicators such as McCabe and Halstead. Each software module in the dataset is labeled with a defective or non-defective class label. The present invention uses datasets downloaded from related dataset websites and the open source PROMISE data repository.

[0022] Table 1 JM1 dataset

[0023]

[0024] The data set is divided into four parts: training attribute set, test attribute set, training label set, and test label set. The number of initial samples in the training attribute set is 5213, and the number of samples in the test attribute set is 2569. The method proposed by the present invention is applied to the training attribute set and the training label set. The number ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a data resampling method based on repeated editing nearest neighbor and clustering oversampling. The method comprises the steps: calculating the Euclidean distance between each to-be-sampled book and a nearby sample, selecting the sample with the smallest distance as the nearby sample of the to-be-sampled book, comparing whether the labels of the sample and the nearby sample are the same or not, and deleting the sample if the labels of the sample and the nearby sample are different; dividing the remaining samples into k clusters by using K-means, and filtering out theclusters of which the ratio of the number of majority class samples to the number of minority class samples is less than an imbalance rate threshold c; calculating an Euclidean distance between minority class samples in each cluster, constructing a distance matrix of the cluster, summing all off-diagonal elements in the matrix, and dividing the sum by the number of the off-diagonal elements to obtain an average distance of the cluster; calculating a sparse factor of each cluster; and calculating a resampling weight value of each cluster, and determining the number of generated new samples according to the weight values by using an SMOTE method. According to the method, the problem of class imbalance in the data is solved, so that the classifier can obtain a better classification effect.

Description

technical field [0001] The invention belongs to the technical field of data sampling, in particular to a data resampling method based on repeated editing nearest neighbor and clustering oversampling. Background technique [0002] In machine learning, an unbalanced training set may cause the trained model to be more inclined to identify samples as the majority class. Although the number of samples in the minority class is small, the importance is higher, and the cost of misclassifying the minority class is much higher than the cost of misclassifying the majority class. Therefore, addressing the class imbalance problem is very important for software defect prediction. Current methods for dealing with imbalanced datasets include algorithm-level methods, cost-sensitive methods, and data-level methods. Compared with algorithm-level methods that are bound to specific classifiers and cost-sensitive methods for specific problems, data-level methods are more general. Data preproce...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06N20/00
CPCG06N20/00
Inventor 殷茗马怀宇蒋丹姜继娇马子琛芦菲娅孟丹荔张煊宇杨益仵芳吴瑜
Owner NORTHWESTERN POLYTECHNICAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products