Unbalanced data sampling method in improved C4.5 decision tree algorithm

A data sampling and decision tree technology, applied in the field of data processing, can solve problems such as the difficulty of accurately estimating the cost of misclassification

Inactive Publication Date: 2016-03-02
CHONGQING UNIV OF POSTS & TELECOMM
View PDF4 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The commonly used strategies are: cost-sensitive method, which introduces cost-sensitive factors on the basis of traditional classification algorithms, and designs cost-sensitive classification algorithms, such as co

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data sampling method in improved C4.5 decision tree algorithm
  • Unbalanced data sampling method in improved C4.5 decision tree algorithm
  • Unbalanced data sampling method in improved C4.5 decision tree algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0054] Using the two-month user replacement data set of an operator as the research object, the number of replacement users per month is far less than that of non-replacement users. Effectively predicting replacement users and taking corresponding marketing measures can bring benefits to the company. Very profitable. The learning set is the data records of a telecom operator distributed according to the natural ratio of 200,000 in April (non-replacement: replacement = 27:1), and the test set is the data records of 400,000 in May according to the distribution of 1:1. Through the combination of feature selection and expert experience, 19 attributes are selected as the input features of the prediction model. In addition, in view of the fact that the attributes are independent of each other in the learning process, but in the actual situation, the user’s contribution income, call Time and traffic are closely related, so 9 attributes are artificially added to measure the changes of...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an unbalanced data sampling method in an improved C4.5 decision tree algorithm. The method comprises the steps as follows: firstly, initial weights of various samples are determined according to the number of various samples; the weights of the samples are modified through the training result of the improved C4.5 decision tree algorithm in each round; the information gain ratio and misclassified sample weights are taken into account by a division standard of the improved C4.5 algorithm; the final weights of the samples are obtained after T iterations; the samples in minority class boundary regions and majority class center regions are found out according to the sample weights; over-sampling is carried out on the samples in the minority class boundary regions by an SMOTE algorithm; and under-sampling is carried out on majority class samples by a weight sampling method, so that the samples in the center regions are relatively easily selected to improve the balance degree of different classes of data, and the recognition rates of the minority class and the overall data set are improved. According to the unbalanced data sampling method in the improved C4.5 decision tree algorithm, weight modification is carried out through the improved C4.5 decision tree algorithm; and over-sampling and under-sampling are specifically carried out according to the sample weights, so that the phenomena of classifier over-fitting, loss of useful information of the majority class and the like are effectively avoided.

Description

technical field [0001] The invention belongs to the technical field of data processing and relates to an unbalanced data sampling method under an improved C4.5 decision tree algorithm. Background technique [0002] An unbalanced data set means that in the data set, the number of samples of a certain class is far less than the number of samples of other classes, and the class with the majority is called the majority class, while the class with the minority is called the minority class. The classification problems of unbalanced datasets exist in a large number of people's real life and industrial production, such as customer churn prediction, DNA microarray data analysis, software defect prediction, spam filtering, text classification, medical diagnosis, etc., in these applications , the minority class classification accuracy is often more important. Therefore, improving the classification accuracy of the minority class becomes a research focus in imbalanced datasets. [000...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35G06F16/285
Inventor 邓维斌刘进熊冰妍何菲菲
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products