Unbalanced data classification method based on active learning

A data classification and active learning technology, applied in the field of machine learning, can solve the problems of time cost and labor cost, and achieve the effect of reducing sample size, saving time and labor cost

Inactive Publication Date: 2020-07-03
NANJING UNIV OF SCI & TECH
View PDF0 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This learning method needs to label all samples, which will lead to huge time and labor costs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data classification method based on active learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0013] The invention can be used in the fields of credit card fraud transaction detection, information security detection and the like.

[0014] A kind of unbalanced data classification method based on active learning of the present invention, comprises the following steps:

[0015] randomly sampling samples from the original unlabeled data for labeling as initial training data; the original unlabeled data includes credit card transaction data;

[0016] Use a general machine learning model to perform cost-sensitive learning training on the initial training data;

[0017] Use the trained binary supervised classification model to predict all unlabeled samples in the original training data samples, and select the most uncertain N samples according to the uncertainty; respectively calculate the center point of the N samples and the trained data set The sum of the Euclidean distances, select M samples from N samples according to the order of distance from large to small, where M i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unbalanced data classification method based on active learning, and the method comprises the steps: randomly sampling and selecting a sample from original label-free data for marking, and taking the sample as initial training data; performing cost-sensitive learning training on the initial training data by adopting a universal machine learning model; predicting all samples which are not labeled in the original training data samples by utilizing a trained binary supervised classification model, and selecting N samples which are most uncertain according to uncertainty;respectively calculating the sum of Euclidean distances between the N samples and the central point of the trained data set, and selecting M samples from the N samples according to the descending order of the distances; marking the selected M samples, and adding the marked M samples into a training data set; performing cost-sensitive learning training on the initial training data set by using a universal machine learning model; and continuously repeating the process, iteratively circulating until the average uncertainty of the selected M samples is smaller than a set uncertainty threshold, and stopping training. According to the method, on the basis of keeping the performance of the unbalanced data classifier, the sample size of labeling can be effectively reduced, so that the labeling time and the labor cost are saved.

Description

technical field [0001] The invention relates to the field of machine learning, in particular to an unbalanced data classification method based on active learning. Background technique [0002] Unbalanced data widely exists in practical applications, such as fraudulent transactions and non-fraudulent transactions in the field of credit card transactions. Due to the unbalanced distribution of such data categories, traditional machine learning models cannot be directly used in the classification of such data. The methods to solve imbalanced data classification mainly include resampling method and cost-sensitive learning method. Among them, the resampling method is subdivided into methods such as undersampling, oversampling, and smote. [0003] Traditional machine learning training methods are mostly supervised learning methods, that is, given all samples of data in related fields and the labels of the samples, they are trained through appropriate machine learning models, and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06N20/00
CPCG06N20/00G06F18/23213G06F18/214
Inventor 张静董怀龙
Owner NANJING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products