Unlock instant, AI-driven research and patent intelligence for your innovation.

Random forest model training method for classification imbalance data optimization

A random forest model and training method technology, applied in the field of random forest model training, can solve problems such as being unsuitable for human understanding and difficult to troubleshoot.

Pending Publication Date: 2021-11-09
上海基绪康生物科技有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Aiming at the deficiencies of the prior art, the present invention provides a random forest model training method optimized for unbalanced classification data, which is convenient for human understanding, less affected by unbalanced classified data, and the number of features is higher than the number of samples of a specific category. Small and other advantages, it solves the problem that the current prediction of small molecule drug activity uses a deep learning model, which is not suitable for human understanding. At the same time, there are orders of magnitude differences between different classifications of drug activity data sets. No matter which classification model, the samples are defaulted to The tendency to enter large sample size types, the result is that although there is a high correct rate, the errors are concentrated in small sample size types, and the number of features is at least an order of magnitude, or even dozens of times, greater than the number of samples of a specific category. The ratio of the number of features to the number of samples is likely to cause overfitting problems that are difficult to troubleshoot

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Random forest model training method for classification imbalance data optimization
  • Random forest model training method for classification imbalance data optimization
  • Random forest model training method for classification imbalance data optimization

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0027] Please refer to Figure 1-3 ,in, figure 2 From the paper A Deep Learning Approach to Antibiotic Discovery, where B is the roc-auc of the graph neural network.

[0028] In order to solve the problem of comprehensibility in the candidate molecule proposal process, we designed a random forest model. The model uses small molecule descriptors or fingerprint features as independent variables and antibiotic activity as dependent variables. The random forest model has feature...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a random forest model training method for classification imbalance data optimization. The method comprises the following steps: S1, calculating descriptors and fingerprints of small molecule compounds; S2, constructing a sampling-machine learning process, and determining candidate parameters; S3, constructing a Bayesian optimization process; and S4, verifying a Bayesian optimization result, and giving a model. The training set generation process aims at a classification imbalance problem and comprises an oversampling method, an undersampling method and the like; random forest model training needs to determine the maximum feature number, the maximum decision tree number, the leaf node sample number, decision tree branch rules (Gini coefficients, information entropy and the like), whether samples are endowed with weights or not and the like.According to the invention, there is a cooperative relationship between successive training set generation and random forest model training, namely, a specific training set generation process corresponds to specific random forest model training, so that required parameters are determined by Bayesian optimization in an integral unified manner.

Description

technical field [0001] The invention relates to the technical field of machine learning in drug small molecule screening, in particular to a random forest model training method optimized for unbalanced classification data. Background technique [0002] Drug development is a process with huge economic costs. According to statistics in 2014, each new drug costs more than 2 billion US dollars on average. Development has a failure rate of up to 90%, with 50% of failures attributed to effectiveness and 30% to security. [0003] In the field of small molecule drugs, confirming drug activity is also costly. In typical cases, after thousands of small molecules have undergone complex demonstrations, only a few can become drug candidates. In addition, the failure rate of designing new compounds is often very high. Therefore, it is of great significance for drug development to simply and quickly select promising small molecules from existing drug-like compounds and reduce unnecessary ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G16C20/50G16C20/70G06N20/00
CPCG16C20/50G16C20/70G06N20/00
Inventor 关申民韦嘉
Owner 上海基绪康生物科技有限公司