+ All Categories
Home > Documents > Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large...

Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large...

Date post: 25-Dec-2015
Category:
Upload: clifton-walton
View: 218 times
Download: 1 times
Share this document with a friend
Popular Tags:
24
Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai Shacham Dept. Chem. Eng. Ben-Gurion University of the Negev Beer-Sheva, Israel Greta Tovarovski and Neima Brauner School of Engineering Tel-Aviv University Tel-Aviv, Israel
Transcript
Page 1: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large

Molecular Descriptor Databases

Inga Paster and Mordechai Shacham Dept. Chem. Eng.

Ben-Gurion University of the NegevBeer-Sheva, Israel

Greta Tovarovski and Neima BraunerSchool of Engineering Tel-Aviv University

Tel-Aviv, Israel

Page 2: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

The Targeted QSPR Method

OBJECTIVE: Predicting physical properties of a Target compound using structural information of this compound and structural and property related information of Similar, predictive compounds (Training set). The structural information is presented in the form of molecular descriptors (calculated properties of the molecule)

ALGORITHM – STEP 1:Similarity group- Select a group of compounds similar to the target based on similarity measures (e.g., correlation between the vectors of molecular descriptors of the target and potential predictive compounds). Training set- Select the most similar predictive compounds for which data for the target property value are available.

Page 3: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

The Targeted QSPR Method

ALGORITHM – STEP 2 : QSPR Model- Use a stepwise regression program to identify a (linear) QSPR model (Quantitative Structure Property Relationship) that can best represent the property data of the training set in terms of molecular descriptors.

ALGORITHM – STEP 3: Prediction-Use the QSPR model and the descriptor values of the target compound (and other compounds in the similarity group (Validation set) to predict its property value

Page 4: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Similarity group of 1-tridecanol – an example

No. Compound Type1 1-tridecanol target2 1-tetradecanol training set3 1-pentadecanol training set4 1-undecanol training set5 1-hexadecanol training set6 1-heptadecanol training set7 2-methyl-dodecan-1-ol training set8 tridecanal training set9 2-butyl-1-decanol training set

10 2-methyl-1-tridecanol training set11 n-dodecanoic acid training set12 dodecanal validation set13 1-octadecanol validation set14 1-nonadecanol validation set15 1-eicosanol validation set

Page 5: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Predicting NBP for 1-tridecanol

No. CompoundNBP Data

Desc. EEig10r

Desc. E3m

NBP Predicted*

Rel. Err %

1 1-tridecanol 553.4 -0.266 0.028 553.8 -0.082 1-tetradecanol 568.8 -0.022 0.028 569.9 -0.193 1-pentadecanol 583.9 0.212 0.028 585.3 -0.234 1-undecanol 520.3 -0.742 0.027 523.6 -0.635 1-hexadecanol 598 0.43 0.028 599.6 -0.276 1-heptadecanol 611.3 0.631 0.028 612.8 -0.257 2-methyl-dodecan-1-ol 538.8 -0.364 0.034 541.0 -0.418 tridecanal 540.15 -0.231 0.042 541.3 -0.209 2-butyl-1-decanol 563 -0.052 0.034 561.5 0.26

10 2-methyl-1-tridecanol 563.1 -0.134 0.034 556.1 1.2411 n-dodecanoic acid 571.85 -0.16 0.021 568.2 0.6312 dodecanal 523.15 -0.485 0.042 524.5 -0.2713 1-octadecanol 623.6 0.814 0.029 623.8 -0.0314 1-nonadecanol 635.1 0.98 0.029 634.7 0.0615 1-eicosanol 645.5 1.131 0.029 644.6 0.14

The QSPR: NBP = 601.0456+65.76087*EEig10r-1061.998*E3m

Descriptor EEig10r - Eigenvalue 10 from edge adjacency matrix weighted by resonance integrals ( a 2-D descriptor)

Page 6: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Predicting NBP for 1-tridecanol

The QSPR: NBP = 601.0456+65.76087*EEig10r-1061.998*E3m

Descriptor E3m - 3rd component accessibility directional WHIM (Weighted Holistic Invariant Molecular descriptor) index/weighted by atomic masses (3-D) descriptor)

500

550

600

650

500 550 600 650

NBP (K, data)

NB

P (

K,

pre

dic

ted

)Training set

Target

Validation set

Linear (Training set)

Page 7: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Descriptor Types

Computer programs that can calculate several thousands of molecular descriptors are available.

Molecular Weight

Number of aromatic C

Wiener Index

3D Wiener Index

MlogP

Page 8: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Algorithms Used by NIST* for Minimization of the Molecular Structure

*National Institute of Standards and Technology (NIST). In: Linstrom PJ, Mallard WG, eds. Chemistry WebBook, NIST Standard Reference Database Number 69. Gaithersburg, MD: NIST; June 2005 (http://webbook.nist.gov).

•Initial structures are the 2-D MOL files. •3-D structures are generated using the Alchemy 2000 desktop software package and its native molecular-mechanics force field.•The structures are re-optimized using the MM3 force field and the simulated annealing algorithm included in the Tinker software package. •Final optimization- at the PM3 level (using the version of MOPAC6 bundled with the Alchemy 2000 package or, in some cases, the Gaussian 94 software package).

Page 9: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

The Importance of the Reliability of the Descriptors

It is practically impossible to check the accuracy and consistency of the individual-descriptor values for the large number of descriptors and compounds involved The 3-D descriptors can be in particular unreliable because of the uncertainty associated with the minimization of the 3-D structure. Reliable and consistent descriptor values are important in particular in the selection of the training set.If different software packages are used for calculating the descriptors of the predictive compounds (database) and for the target compound (not in the data base), inconsistency in the descriptors included in the QSPR may cause poor property prediction.Descriptor “noise level”- The effect of the 3-D minimization technique on the descriptor value should be considered in establishing a reliable estimate for the “noise level”.

Page 10: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Generation of Molecular Structure Files and Molecular Descriptors

For the first part of this study a database containing 326 compounds (hydrocarbons, 1-alcohols and n-aliphatic acids) was used.

The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package*.

The Dragon+ program was used to calculate 1664 descriptors for the compounds in the database from minimized energy molecular models.

*HyperChem program, version 7.01, Hyperchem is copyrighted by Hypercube Inc. (http://www.hyper.com/ ).

+Todeschini, R.; Consonni, V.; Mauri, A.; Pavan, M. DRAGON user manual, Talete srl, Milano, Italy, 2006. ©TALETE srl, http://www.talete.mi.it.

Page 11: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Test 1 – Plotting Descriptors of Same Family Neighboring Compounds

y = 0.9815xR2 = 0.917

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-0.8 -0.3 0.2 0.7 1.2

n -hexane

n-h

epta

ne

Outlying, unreliable descriptors

Page 12: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series

Monotonic Change – Reliable Descriptors

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35

No. of carbon atoms

No

rmal

ized

des

crip

tor

valu

e

AGDD

ASP

H4m

Page 13: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series

Separate curves for odd and even nc – Consistent with some solid properties

0.000

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

4.500

0 5 10 15 20 25 30 35

No. of carbon atoms

Des

crip

tor

valu

e

PJI2

ICR

Page 14: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Test 2 – Plotting Descriptor Values Versus the No. Of C atoms in (n-alkene) Homologous Series

Inconsistent (random) variation of the descriptor value with nc – Unreliable descriptor (Gm – 3-D WHIM descriptor)

0.10

0.15

0.20

0.25

0.30

0.35

0.40

2 6 10 14 18 22 26 30

No. of carbon atoms.

Des

crip

tor

Gm

Page 15: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Test 3 – Comparing 3-D descriptors obtained from 3-D structure files minimized by different algorithms

Compounds for which 3-D MOL files are available from NIST and Dragon

No. Compound No. Compound No. Compound

1 methane 12 pyrrole 22 trans-2-butene 2 ethanol 13 furan 23 cyclopentane 3 ethane 14 isobutane 24 n-pentane 4 2-propanone 15 cyclohexane 25 cis-2-butene 5 benzene 16 cyclopropane 26 cyclohexanone 6 2-propylamine 17 n-butane 27 anthracene 7 2-propanol 18 toluene 28 1-propanol 8 propane 19 thiophene 29 dibenzofuran 9 naphthalene 20 2-butyne 30 2-methylpentane 10 phenol 21 neopentane 31 n-hexane 11 cyclobutane

Page 16: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Visual comparison of Dragon and NIST 3-D structure files using Gaussian 3

Page 17: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Groups of 3 – D Descriptors Calculated by Dragon

No. Descriptor Group Description

1 3D-MoRSE 1601 →1582

Descriptors calculated by summing atom weights viewed by different angular scattering function

2 Geometrical 74 → 33

Different kinds of conformationally dependent descriptors based on the molecular geometry.

3 GETAWAY 197 → 124

Descriptors calculated from the leverage matrix obtained by the centred atomic coordinates (molecular influence matrix, MIM)

4 Randic molecular profiles 41 → 25

Descriptors derived from the distance distribution moments of the geometry matrix defined as the average row sum of its entries raised to the k-th power, normalized by the factor k!.

5 RDF 150 → 45

descriptors obtained by radial basis functions centred on different interatomic distances (from 0.5A to 15.5A)

6 WHIM 99 → 92

Descriptors obtained as statistical indices of the atoms projected onto the 3 principal components obtained from weighted covariance matrices of atomic coordinates.

1Total number of descriptors 2Average No. of Nonzero Descriptors

Page 18: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Percent differences between 3-D descriptors based on NIST and Dragon Library MOL files (28 compounds)

No. Descriptor Group ≤ 0.2% ≤ 3% ≤ 10% ≤ 50% > 50% 1 3D-MoRSE 4.51 15.09 30.00 67.03 32.97 2 Geometrical 12.97 81.38 92.89 97.18 2.82 3 GETAWAY 33.17 85.06 96.94 99.43 0.57 4 Randic molecular profiles 16.95 58.47 81.92 99.15 0.85 5 RDF 1.03 4.61 22.67 70.01 29.99 6 WHIM 30.13 72.51 92.40 97.75 2.25

No. Descriptor Group ≤ 0.2% ≤ 3% ≤ 10% ≤ 50% > 50% 1 3D-MoRSE 3.96 10.00 20.83 58.75 41.25 2 Geometrical 0.98 15.69 39.22 92.16 7.84 3 GETAWAY 6.25 15.73 43.10 93.10 6.90 4 Randic molecular profiles 0.00 3.81 12.38 36.19 63.81 5 RDF 0.00 6.01 20.22 55.19 44.81 6 WHIM 9.09 15.15 31.99 87.54 12.46

n-hexane, 2-methylpentane and 1-propanol

Page 19: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Data for an Extensive 709 Compounds Study

For this study 709 compounds from the DIPPR database were used.

For these compounds 3-D MOL files are available from the NIST database and from molecular structures minimized by the DIPPR staff.

For the later the minimization was done for most of the compounds in Gaussian 03 using B3LYP/6-311+G (3df, 2p). This is a density functional method. Most of the other compounds were optimized using HF/6-31G*, which is a Hartree-Fock ab initio method with a medium-sized basis set.

The Dragon 5.5 program was used to calculate 3224 descriptors.

Page 20: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Results for the 709 Compounds Study

Page 21: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Conclusions and Future Work

1. In has been shown that the 3-D descriptors may have various levels of inconsistency depending on the algorithms used for minimization of the 3-D structure.

2. In order to determine the effects of the inconsistency of the descriptors on the training set selection for various families of compounds comparative studies involving descriptors of various levels of consistency must be carried out.

3. To determine whether inconsistent descriptors can be excluded from TQSPRs prepared for particular properties and particular families of target compounds a comparative study to this effect must be carried out.

4. It is always preferable to use the same molecular structure minimization algorithm for the members of the training set and the target compound.

Page 22: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Selection of the Database and the Target Property Using the Property Prediction GUI

Page 23: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Similarity Group Identification for 1-methyl-3-iso-prophylbenzene

Page 24: Combining Statistical and Physical Considerations in Deriving Targeted QSPRs Using Very Large Molecular Descriptor Databases Inga Paster and Mordechai.

Derivation of the “Target QSPR Model

BP = 285.8355 + 46.66 ALOGP


Recommended