Leave-multiple-out cross validation (LMOCV) method of quantitative structure and activity relationship (QSAR) model of organic pollutant

A technology of organic pollutants and quantitative structure, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of lack of representativeness and uniform distribution of verification samples, achieve large sample volatility, improve variable screening Effect

Inactive Publication Date: 2011-09-14
NANJING UNIV
View PDF4 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Although this solves the problem of grouping samples, usually the Monte Carlo method conforms to a certain probability distribution for grouping samples, so the obtained samples cannot be uniformly distributed in the sample space, that is to say, the verification samples obtained by the Monte Carlo method lack Comprehensive representation (Picard R.R., Cook R.D. Cross-Validation of Regression Models. J. Am. Stat. Assoc. 1984, 79(387), 575-583.)
[0006] The literature sea...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Leave-multiple-out cross validation (LMOCV) method of quantitative structure and activity relationship (QSAR) model of organic pollutant
  • Leave-multiple-out cross validation (LMOCV) method of quantitative structure and activity relationship (QSAR) model of organic pollutant
  • Leave-multiple-out cross validation (LMOCV) method of quantitative structure and activity relationship (QSAR) model of organic pollutant

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0026] When the number of samples is 31, a 32-level uniform design table is constructed using the grid point method, as shown in Table 1.

[0027] Table 1 The 32-level uniform design table constructed by the grid point method

[0028]

[0029] It can be seen from Table 1 that the 32-level uniform table has a total of 16 columns and 32 rows, of which the elements in the last row are all 32. After deletion, the remaining 31 rows correspond to the sample numbers of 31 samples. Each column represents a Sample distribution form. Divide each column into 5 equal parts, the easiest way is to divide according to the order of row numbers, and use the same division method for all columns. The samples obtained by the uniform design are very evenly distributed throughout the space, while the sample distribution obtained by the Monte Carlo method is not uniform, which is the advantage of the uniform design to obtain the LMOCV grouping method.

Embodiment 2

[0031] Literature (Cronin M.T.D., Netzeva T.I., Dearden J.C., Edwards R., Worgan A.D.P. Assessment and Modeling of the Toxicity of Organic Chemicals to Chlorella vulgaris: Development of A Novel Database. Chem. Res. Toxicol 2004, 17(4), 545- The best model for 91 samples in 554.) has 3 structural descriptors Kow, LUMO and Δ 1 x v As a variable, the correlation coefficient of the model is r 2 = 0.890, q of LOOCV 2 = 0.875.

[0032] Use the method of the present invention to implement UDOLMOCV to this model: first construct the uniform design table of 92 levels, then delete the last row, there are 44 columns in total, and then each column is divided into 2, 5, 10 equal parts (if not divisible, redundant samples are returned to into the last group), which constitutes 44 times of 2-, 5-, and 10-fold cross-validation (denoted by UD-2, -5, and -10, respectively). The calculation results are shown in Table 2. As can be seen from Table 2, 2-, 5-, 10-fold UDOLMOCV The root mean ...

Embodiment 3

[0036] Literature (Liu H., Papa E., Gramatica P. QSAR Prediction of Estrogen Activity for A Large Set of Diverse Chemicals under the Guidance of OEC

[0037] Use the method of the present invention to implement UDOLMOCV to this model: first construct the uniform design table of 133 levels, then delete the last line, there are 108 columns in total, then each column is divided into 2, 5, 10 equal parts (if not divisible, redundant samples are returned to into the last group), which constitutes 108 times of 2-, 5-, and 10-fold cross-validation (denoted by UD-2, -5, and -10, respectively). From the calculation results in Table 3, it can be seen that the root mean square error obtained by UDOLMOCV is always larger than the Monte Carlo cross-validation result, which shows that the sample grouping method adopted by the present invention is more representative. When the sample disturbance is relatively large (such as 2-fold, the stability of the model is significantly reduced, and the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a leave-multiple-out cross validation (LMOCV) method of a quantitative structure and activity relationship (QSAR) model of an organic pollutant. In the method, a uniform design method is integrated with LMOCV; related coefficients of uniform design optimized LMOCV (UDOLMOCV) serve as variable screening ending standards; meanwhile, the model is subjected to sample internal cross validation by using the UDOLMOCV during validation in the model; higher sample volatility is provided by using predictive capacity judgment indexes; validated samples are uniformly distributed in sample spaces by infrequent sampling validation of the sample; and the sample which is selected every time has quite high representativeness. The validated sample which is obtained by the method has quite high representativeness of sample distribution, so the defect of singleness of sample selection of a Monte Carlo method is overcome. The invention can be used for validating the stability and predicative capacity of the QSAR model and can be used for discovering and determining instability factors in the model.

Description

technical field [0001] The invention relates to a method for interactive verification of the quantitative structure-activity correlation model of organic pollutants by multi-drawing method. Specifically, the model is internally verified by using the multi-pumping method of uniform design optimization, and the multi-pumping method of uniform design optimization is used to verify the model internally. Cross Validated Correlation Coefficient A novel interactive verification method for quantitative structure-activity correlation models as a termination criterion for model variable screening and a judgment indicator for predictive ability. Background technique: [0002] As a computer modeling technique, the Quantitative Structure and Activity Relationship (QSAR) research method of organic pollutants can deeply explore the quantitative change law and cause and effect between the chemical structure of organic pollutants and their harm to human body and ecological environment. It ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/50
Inventor 张爱茜易忠胜李富华蔺远高常安穆云松
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products