A data balancing method based on pseudo-negative samples and a method to improve data classification performance

A negative sample and sample technology, applied in the field of data balance based on pseudo-negative samples, to improve data classification performance, can solve the problems of low data classification accuracy, easy to lead to overfitting, unbalanced data learning, etc., to improve the classifier performance, avoiding heavy calculations, and improving accuracy

Active Publication Date: 2021-09-21
CHENGDU UNIV OF INFORMATION TECH
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] In order to solve the above problems, the present invention provides a data balancing method based on pseudo-negative samples and a method for improving data classification performance. Positive samples (pseudo-negative samples) can be found from negative samples and added to positive samples to balance the proportion of positive and negative samples. , to achieve the learning of unbalanced data, thus solving the problem that the existing methods lose some important classification information, take a long time, easily lead to over-fitting, and cause low data classification accuracy.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A data balancing method based on pseudo-negative samples and a method to improve data classification performance
  • A data balancing method based on pseudo-negative samples and a method to improve data classification performance
  • A data balancing method based on pseudo-negative samples and a method to improve data classification performance

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0107] In this embodiment, using the data balancing method of the present invention, pseudo-negative samples are selected on the CMC and Haberman data sets according to different pseudo-negative sample rates (that is, the proportion of the number of pseudo-negative samples to the number of positive samples), and four kinds of The classifier performs data classification and classification performance evaluation.

[0108] Set the false negative sample rate from 0% to 50%, 0% means no false negative samples are picked. The selection results on CMC are shown in Table 2. It can be seen that the larger the percentage of pseudo-negative samples, the better the performance. When the proportion of pseudo-negative samples is 0%, 10%, 20%, 30%, 40% and 50%, the Sen of random forest is 28.19%. , 39.22%, 43.94%, 50.87%, 56.45% and 62%, the Acc values ​​were 78.2%, 78.75%, 78.41%, 78.48%, 79.57% and 79.63%, and the MCC values ​​were 0.27, 0.369, 0.404, 0.448, 0.505 and 0.532. The perform...

Embodiment 2

[0116] This example verifies the effectiveness of the method of the present invention on real biological data. The dataset includes PDNA-316, PDNA-543, SNP.

[0117] figure 2 The classification performance of the PDNA-543 dataset under different pseudo-negative sample rates is shown, where RF-Sen and NN-Sen represent the Sen (sensitivity value) of the RF (neural network) and NN (discriminant analysis) classifiers, and RF-Sen MCC and NN-MCC represent the MCC values ​​of RF and NN classifiers, respectively. It can be seen that the Sen and MCC metrics of the neural network increase as the percentage of false negative samples increases from 0% to 50%, while the Sen and MCC of the random forest remain the same when the percentage of false negative samples varies from 0% to 30%. , and when the percentage of false negative samples exceeds 30%, as the percentage increases, RF has better performance.

[0118] image 3 Classification performance on the PDNA-316 dataset at different...

Embodiment 3

[0121] The MMPCC algorithm and the MAXR algorithm were compared with the MINR algorithm using PDNA-316 data. Wherein MMPCC is the abbreviation of the algorithm of the present invention.

[0122] In Example 3, five-fold cross-validation is still used to evaluate the prediction performance of the proposed algorithm on these four indicators. Using the PDNA-316 data set to compare the classification performance of the MMPCC algorithm, the MAXR (max-relevance) algorithm and the MINR (min-redundancy) algorithm, the comparison results are as follows Figure 5-8 shown.

[0123] according to Figure 5-8 , we can easily find that in RF and NN classifiers, MMPCC is superior to MAXR and MINR methods in both RF classifiers and NN classifiers. from Figure 5 It can be seen that the pseudo-negative samples have a greater impact on the Sen value. When NN is used as a classifier, the Sen value of MMPCC is significantly better than MAXR and MINR. For RF classifiers, MAXR is the best when m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data balancing method based on pseudo-negative samples and a method for improving data classification performance, including steps: step 1: separate positive and negative samples to obtain a positive sample set and a negative sample set; step 2: calculate and obtain a negative sample Peel Poor correlation coefficient set; Step 3: Initialize the pseudo-negative sample set and the selected sample set; Step 4: Use the maximum correlation-minimum redundancy method to calculate the weight to obtain the weight set; Step 5: Select the maximum weight and update the pseudo-negative sample set and the selected sample set; Step 6: Repeat Step 4 and Step 5 until the pseudo-negative sample set is selected; Step 7: Merge the selected pseudo-negative sample set into the positive sample set, and at the same time, from the negative sample set Eliminate the selected set of false negative samples; the present invention proposes and defines the concept of false negative samples for the first time, and the proposed algorithm can improve the accuracy of data classification, thereby improving the performance of the classifier, especially in the processing of unbalanced biological information data. obvious.

Description

technical field [0001] The invention relates to the technical field of information processing, in particular to a data balancing method based on pseudo-negative samples and a method for improving data classification performance. Background technique [0002] With the rapid growth of data volume, such as biological information, machine learning technology is widely used in the field of biological information, because machine learning can find important information from large-scale biological data to help people understand complex biological processes. However, the ubiquity of the category imbalance problem will greatly reduce the performance of machine learning. In theory, limited positive samples cannot realize data mining. Therefore, the learning of many biological data needs to solve the problem of category imbalance, such as gene expression data, protein-DNA binding data, predicted small molecule RNA data, etc. [0003] The methods that have been proposed to reduce the i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06K9/62
CPCG06F18/214G06F18/24323
Inventor 乔少杰张永清韩楠周激流卢荣钊刘定祥温敏魏军林袁犁
Owner CHENGDU UNIV OF INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products