Unbalanced data classification undersampling method, device and equipment and medium

A data classification and undersampling technology, applied in instrumentation, computing, character and pattern recognition, etc., can solve problems such as imbalance and low accuracy of classification learning algorithms

Inactive Publication Date: 2018-10-12
GUANGZHOU UNIVERSITY
View PDF0 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] In view of the above problems, the purpose of the present invention is to provide an under-sampling method for unbalanced data classification, which can solve the problem of low accuracy of classification learning algorithms caused by too many samples of the majority class and too few samples of the minority class in the process of unbalanced big data classification , to improve the classification accuracy of unbalanced big data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data classification undersampling method, device and equipment and medium
  • Unbalanced data classification undersampling method, device and equipment and medium
  • Unbalanced data classification undersampling method, device and equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0057] see image 3 , is a schematic flow chart of the under-sampling method for unbalanced data classification provided by the first embodiment of the present invention.

[0058] It should be noted that, when deleting majority samples, the existing methods either adopt the same processing method for all majority class samples, randomly select and delete majority class samples, and thus delete the majority class samples that should not be deleted. Samples, or most of the samples that are selected to be deleted are the majority of samples with a small number of samples as neighbor samples, but the majority of samples in large data sets are much larger than the minority samples, and the majority of samples that can be deleted It is relatively limited and cannot solve the problem of low accuracy of classification learning algorithms caused by too many samples of most classes in the process of unbalanced big data classification.

[0059] The unbalanced data classification and und...

Embodiment 2

[0077] On the basis of embodiment one,

[0078] The determination of the category corresponding to the majority of samples according to the number of the minority samples includes:

[0079] comparing the number of the minority samples with a preset threshold to determine the class of the corresponding majority samples; wherein the class includes noise samples, boundary samples and stable samples.

[0080] In the embodiment of the present invention, the preset threshold is set according to actual conditions.

[0081] Preferably, the preset threshold includes a preset first threshold n;

[0082] The comparison of the number of the minority samples with a preset threshold to determine the category of the corresponding majority samples includes:

[0083] When the number of the minority samples is greater than or equal to the preset first threshold n, the category corresponding to the majority of samples is the noise sample; wherein, the preset first threshold n has a range of va...

Embodiment 3

[0098] On the basis of Example 2,

[0099] When the number of the minority samples is less than the second threshold p, the category corresponding to the majority of samples is the stable sample; wherein, the value range of the preset second threshold p is k / 3 <=p<=n;

[0100] Then the operation corresponding to the category according to the category of each of the majority samples includes:

[0101] When the class corresponding to the majority of samples is the stable sample, selectively delete the majority of samples.

[0102] In this embodiment of the present invention, if the number of minority samples among the k nearest neighbors of the majority sample is less than the second threshold p, it means that the majority sample is not in the majority sample set and the minority sample set. The sample set boundary position, but in the majority sample set, then the majority sample is stable.

[0103] In this embodiment of the present invention, the selective deletion of the m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unbalanced data classification undersampling method, comprising the steps: obtaining all majority samples in to-be-processed unbalanced data; according to K nearest neighboralgorithm, obtaining the number of minority samples in k nearest neighbor samples of each majority sample; according to the number of minority samples, determining categories corresponding to the majority samples; and according to the category of each majority sample, performing operation corresponding to the category. The low precision of a classified-learning algorithm due to more majority samples and fewer minority samples in the unbalanced big data classification process is solved, and the accuracy of unbalanced big data classification is solved.

Description

technical field [0001] The invention relates to the field of unbalanced big data processing, in particular to a method, device, equipment and medium for classification and undersampling of unbalanced data. Background technique [0002] With the continuous advancement of technology, including the improvement of Internet speed, the upgrading of mobile Internet, the continuous development of hardware technology, the rapid development of data acquisition technology, storage technology and processing technology, data is growing at an unprecedented rate, and we have entered the big data era. The characteristics of big data such as huge data volume (volume), high speed (velocity), variety (variety), and data uncertainty (veracity) make traditional data analysis and mining technologies encounter unprecedented challenges when applied to the field of big data. . [0003] Data classification is a basic algorithm in data analysis and mining, which has a wide range of applications and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/24147G06F18/214
Inventor 韩伟红李树栋王乐方滨兴贾焰黄子中周斌殷丽华田志宏
Owner GUANGZHOU UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products