Data balancing method based on pseudo-negative samples and method for improving data classification performance

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A data balance and negative sample technology, applied in the field of information processing, can solve problems such as imbalanced data learning, loss of classification information, and low data classification accuracy.

Active Publication Date: 2019-01-25

CHENGDU UNIV OF INFORMATION TECH

View PDF11 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006] In order to solve the above problems, the present invention provides a data balancing method based on pseudo-negative samples and a method for improving data classification performance. Positive samples (pseudo-negative samples) can be found from negative samples and added to positive samples to balance the proportion of positive and negative samples. , to achieve the learning of unbalanced data, thus solving the problem that the existing methods lose some important classification information, take a long time, easily lead to over-fitting, and cause low data classification accuracy.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0107] In this embodiment, using the data balancing method of the present invention, pseudo-negative samples are selected on the CMC and Haberman data sets according to different pseudo-negative sample rates (that is, the proportion of the number of pseudo-negative samples to the number of positive samples), and four kinds of The classifier performs data classification and classification performance evaluation.

[0108] Set the false negative sample rate from 0% to 50%, 0% means no false negative samples are picked. The selection results on CMC are shown in Table 2. It can be seen that the larger the percentage of pseudo-negative samples, the better the performance. When the proportion of pseudo-negative samples is 0%, 10%, 20%, 30%, 40% and 50%, the Sen of random forest is 28.19%. , 39.22%, 43.94%, 50.87%, 56.45% and 62%, the Acc values were 78.2%, 78.75%, 78.41%, 78.48%, 79.57% and 79.63%, and the MCC values were 0.27, 0.369, 0.404, 0.448, 0.505 and 0.532. The perform...

Embodiment 2

[0116] This example verifies the effectiveness of the method of the present invention on real biological data. The dataset includes PDNA-316, PDNA-543, SNP.

[0117] figure 2 The classification performance of the PDNA-543 dataset under different pseudo-negative sample rates is shown, where RF-Sen and NN-Sen represent the Sen (sensitivity value) of the RF (neural network) and NN (discriminant analysis) classifiers, and RF-Sen MCC and NN-MCC represent the MCC values of RF and NN classifiers, respectively. It can be seen that the Sen and MCC metrics of the neural network increase as the percentage of false negative samples increases from 0% to 50%, while the Sen and MCC of the random forest remain the same when the percentage of false negative samples varies from 0% to 30%. , and when the percentage of false negative samples exceeds 30%, as the percentage increases, RF has better performance.

[0118] image 3 Classification performance on the PDNA-316 dataset at different...

Embodiment 3

[0121] The MMPCC algorithm and the MAXR algorithm were compared with the MINR algorithm using PDNA-316 data. Wherein MMPCC is the abbreviation of the algorithm of the present invention.

[0122] In Example 3, five-fold cross-validation is still used to evaluate the prediction performance of the proposed algorithm on these four indicators. Using the PDNA-316 data set to compare the classification performance of the MMPCC algorithm, the MAXR (max-relevance) algorithm and the MINR (min-redundancy) algorithm, the comparison results are as follows Figure 5-8 shown.

[0123] according to Figure 5-8 , we can easily find that in RF and NN classifiers, MMPCC is superior to MAXR and MINR methods in both RF classifiers and NN classifiers. from Figure 5 It can be seen that the pseudo-negative samples have a greater impact on the Sen value. When NN is used as a classifier, the Sen value of MMPCC is significantly better than MAXR and MINR. For RF classifiers, MAXR is the best when m...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a data balancing method based on pseudo-negative samples and a method for improving data classification performance. The method comprises the following steps: step 1, positiveand negative samples are separated to obtain a positive sample set and a negative sample set; 2, calculate a negative sample Pearson correlation coefficient set; 3, initialize a pseudo negative sampleset and a selected sample set; Step 4: using a maximum correlation-least redundancy method to calculate weights and obtains a weight set. 5, selecting the maximum weight, updating the pseudo-negativesample set and the selected sample set; 6, repeat that steps 4 and 5 until a pseudo negative sample set is selected; 7, merging the selected pseudo-negative sample set into the positive sample set, and removing the selected pseudo-negative sample set from the negative sample set. The invention firstly proposes and defines the concept of pseudo-negative sample, and the proposed algorithm can improve the accuracy of data classification and further improve the performance of the classifier, in particular, has obvious advantages in dealing with unbalanced biological information data.

Description

technical field [0001] The invention relates to the technical field of information processing, in particular to a data balancing method based on pseudo-negative samples and a method for improving data classification performance. Background technique [0002] With the rapid growth of data volume, such as biological information, machine learning technology is widely used in the field of biological information, because machine learning can find important information from large-scale biological data to help people understand complex biological processes. However, the ubiquity of the category imbalance problem will greatly reduce the performance of machine learning. In theory, limited positive samples cannot realize data mining. Therefore, the learning of many biological data needs to solve the problem of category imbalance, such as gene expression data, protein-DNA binding data, predicted small molecule RNA data, etc. [0003] The methods that have been proposed to reduce the i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/62

CPCG06F18/214G06F18/24323

Inventor 乔少杰张永清韩楠周激流卢荣钊刘定祥温敏魏军林袁犁

Owner CHENGDU UNIV OF INFORMATION TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Data balancing method based on pseudo-negative samples and method for improving data classification performance

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology