Data resampling method based on clustering oversampling and instance hardness threshold

A technique of oversampling and clustering, applied to instruments, character and pattern recognition, computer components, etc., can solve problems such as prediction deviation

Inactive Publication Date: 2020-12-22
NORTHWESTERN POLYTECHNICAL UNIV
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In order to solve the problem of prediction deviation of classification algorithms for minority samples in unbalanced text data, the present invention proposes a data resampling method based on clustering oversampling and instance hardness threshold

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data resampling method based on clustering oversampling and instance hardness threshold
  • Data resampling method based on clustering oversampling and instance hardness threshold
  • Data resampling method based on clustering oversampling and instance hardness threshold

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The present invention will be further described below in conjunction with the examples, and the present invention includes but not limited to the following examples.

[0022] The present invention provides a data resampling method based on clustering oversampling and instance hardness threshold, and its basic implementation process is as follows:

[0023] 1. Clustering processing

[0024] As a commonly used clustering algorithm, K-means divides data by iterative solution. The present invention first adopts the K-means algorithm to cluster the text data set. First, randomly select k pieces of text in the text data set as the initial clustering center, k is the number of clusters to be obtained, and different k values ​​will affect the results of subsequent clustering filtering and sampling weight distribution. The selection of k in the present invention Values ​​are 2, 5, 10 or 15. Then repeat the following process: assign each text to the cluster center with the clos...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a data resampling method based on clustering oversampling and an instance hardness threshold. The method comprises the following steps: firstly, performing clustering processingon a data set by utilizing a Kmeans method, and performing filtering processing and sampling weight distribution on clustering; then, adopting an SMOTE algorithm to carry out oversampling on the dataset to generate new data, so that the number of minority class samples in the data set is equal to that of majority class samples, and the data set becomes class balance; and finally, cleaning the data by adopting an instance hardness threshold algorithm to obtain a final balanced data set with fewer noisy points. According to the method, the class imbalance data set can be processed into the balance data set, and the prediction performance of the classifier for minority class samples is improved.

Description

technical field [0001] The invention belongs to the technical field of unbalanced data processing, and in particular relates to a data resampling method based on clustering oversampling and instance hardness threshold. Background technique [0002] In machine learning, since the goal of the classification algorithm is to improve the classification accuracy, when there is a class imbalance problem in the data set, the classification algorithm will tend to predict the sample as the majority class. This is because even if the classification algorithm cannot correctly classify the minority class samples, it can still achieve a classification result with a high accuracy rate. Although the number of minority class samples is small, the importance is higher, and the cost of misclassifying minority class samples is much higher than the cost of misclassifying majority class samples. Therefore, addressing the class imbalance problem is very important for model predictions on imbalanc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06K9/40
CPCG06V10/30G06F18/23213G06F18/2415G06F18/214
Inventor 殷茗马怀宇朱奎宇张小港高存志
Owner NORTHWESTERN POLYTECHNICAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products