Data resampling method based on clustering oversampling and instance hardness threshold

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technique of oversampling and clustering, applied to instruments, character and pattern recognition, computer components, etc., can solve problems such as prediction deviation

Inactive Publication Date: 2020-12-22

NORTHWESTERN POLYTECHNICAL UNIV

View PDF6 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] In order to solve the problem of prediction deviation of classification algorithms for minority samples in unbalanced text data, the present invention proposes a data resampling method based on clustering oversampling and instance hardness threshold

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0021] The present invention will be further described below in conjunction with the examples, and the present invention includes but not limited to the following examples.

[0022] The present invention provides a data resampling method based on clustering oversampling and instance hardness threshold, and its basic implementation process is as follows:

[0023] 1. Clustering processing

[0024] As a commonly used clustering algorithm, K-means divides data by iterative solution. The present invention first adopts the K-means algorithm to cluster the text data set. First, randomly select k pieces of text in the text data set as the initial clustering center, k is the number of clusters to be obtained, and different k values will affect the results of subsequent clustering filtering and sampling weight distribution. The selection of k in the present invention Values are 2, 5, 10 or 15. Then repeat the following process: assign each text to the cluster center with the clos...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a data resampling method based on clustering oversampling and an instance hardness threshold. The method comprises the following steps: firstly, performing clustering processingon a data set by utilizing a Kmeans method, and performing filtering processing and sampling weight distribution on clustering; then, adopting an SMOTE algorithm to carry out oversampling on the dataset to generate new data, so that the number of minority class samples in the data set is equal to that of majority class samples, and the data set becomes class balance; and finally, cleaning the data by adopting an instance hardness threshold algorithm to obtain a final balanced data set with fewer noisy points. According to the method, the class imbalance data set can be processed into the balance data set, and the prediction performance of the classifier for minority class samples is improved.

Description

technical field [0001] The invention belongs to the technical field of unbalanced data processing, and in particular relates to a data resampling method based on clustering oversampling and instance hardness threshold. Background technique [0002] In machine learning, since the goal of the classification algorithm is to improve the classification accuracy, when there is a class imbalance problem in the data set, the classification algorithm will tend to predict the sample as the majority class. This is because even if the classification algorithm cannot correctly classify the minority class samples, it can still achieve a classification result with a high accuracy rate. Although the number of minority class samples is small, the importance is higher, and the cost of misclassifying minority class samples is much higher than the cost of misclassifying majority class samples. Therefore, addressing the class imbalance problem is very important for model predictions on imbalanc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06K9/62G06K9/40

CPCG06V10/30G06F18/23213G06F18/2415G06F18/214

Inventor殷茗马怀宇朱奎宇张小港高存志

OwnerNORTHWESTERN POLYTECHNICAL UNIV

Data resampling method based on clustering oversampling and instance hardness threshold

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology