Date post: | 26-Mar-2015 |
Category: |
Documents |
Upload: | alexander-maher |
View: | 214 times |
Download: | 0 times |
Training pKa and logP prediction
Jozsef Szegezdi
Solutions for Cheminformatics
logP calculation models in Marvin
Models Training set size
Number of parameters
VG 1000 120
KLOP 1700 100
PHYS 10000 110
Weighted >10000 120
User defined Variable <=100
Unfortunately we can not tell in advance which model will be better for a molecule if it is not included in the training set.
Three models are provided in Marvin. They share the same atom type definitions taken from
Viswanadhan, V. N., et al. J.Chem.Inf.Comput.Sci., 1989, 29, 3, 163-172;
Problem with logP models
Frequently occuring problems of constructing logP models
- logP training set size is too small- logP training set is unrepresentative- Specification of atom types and interactions is subjective- The number of logP parameters is restricted in order to ensure
the ‘predictive power’
As a result, there will be missing interactions and atom types for the models.
OH
HO
H3COH
H3C OHOH
H3C
OHH3C OHH3C
OHH3C
OHH3C
OHH3C OHH
3C
OHH
3C
CH3H3C
HO
H3C
CH3
CH3
HO
H3C OH
CH3
CH3H3C
OH CH3
CH3
HO
CH3
OH
HO
O
OH
OH
HO
HO
H3C
OH
CH3
HO
OH
HO
OHOH
OHHO
OH
HO OH
OH OH
HO
-0.77 -0.31 0.25 0.88 1.51 2.03
2.62 3.00 3.77 4.57 1.29 1.28
1.48
1.231.19
1.79 -3.24-0.92
0.15 1.460.16 2.85 -1.76 -1.040.88
Example for creating a local logP model
Example for creating a local logP model
The logP of the molecules calculated with the standard weighted method which
is shown on the figure below. Calculated vs. experimetal logP by weigthed method
n=25, R2=0.96, s=0.35
-4
-3
-2
-1
0
1
2
3
4
5
-4 -2 0 2 4 6
logP exp.
log
P c
alc
.
The ‘principal of uniformity of nature’ would say that other ‘OH’ containig molecules could be predicted reasonably by the standard ‘weighted’ method. Is it true?We test this with the ‘hydroquinone’molecule.
The logP value of hydroquinone is 0.59. The next table summarizes the ‘logP’ errors of the standard models.
Models logP calc. –logP exp.
VG 0.88
KLOP 0.75
PHYS 0.68
Weighted 0.77
User defined ?
Test of standard models
How can one improve the accuracy of the predicition?
Prediction error can be reduced by creating a local model using linear regression for the 25 molecules mentioned above. Command line call for creating the local model:
cxcalc -T logP -t LOGP –o logPparameters.txt training25.sdf
Error of the standard models is relatively large.
The logP value of 25 molecules containing ‘OH’ groups calculated with the ‘user
defined’ method after logP training on the figure below. Calculated vs. experimental logP by user method
n=25, R2=0.99, s=0.10
-4
-3
-2
-1
0
1
2
3
4
5
-4 -2 0 2 4 6
logP exp.
log
P c
alc.
Model n R2 s Test molecule:
logP error of hydroquinone
Weighted 25 0.96 0.36 0.77
User defined
25 0.99 0.10 0.24
Comparision of the standard and the user model
The user-trained local model based on 25 molecules outperforms all of the standard models.
User’s model
Conclusions
The local model based on 25 molecules is more accurate than any of the standard global models.
Depending on the training set different parameter values will be assigned to the same atom type. This is one of the main characteristics of the user
model. A ‘carefully’ created set of local models must be superior to any ‘large’ model. We plan to develop a model that combines many local models.
The ionization % -pH curvedenoted with blue color for basic centers and with red color for acidic centers.
Calculated ionization % vs. pH
0
50
100
0 2 4 6 8 10 12
pH
deg
ree
of
ion
izat
ion
%
10.28
4.30
5.102.49
Apparent pKa and ionization%-pH curve
Method for predicting pKa and training
Marvin’s prediction model considers:
• partial charges• polarizability• effect of ionizable centers on each others
Training refines the existing parameters for ionizable centers and at the same time creates new modifier parameters based on structures and experimental values specified by the user.
Example for training pKa prediction
N
CH3
N
CH3
CH3
CH3
N+
N
CH3 N
N N
N
NH2N
N
NH2N
H2N NH
2N
N
N
N
N
OH
N
N
N
NNO
O
N
NN
OH
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
pKa1 6.0 6.70 0.50 1.20 1.50 3.76 5.60 5.0 4.50 6.71 0.63 4.05 4.10 2.84 4.91
pKa2 -3.0 1.0 -7.5 8.35 0.71 9.01
Calculated vs. experimental pKa before
'training' n=20, R2=0.94, s=0.68
-4
-2
0
2
4
6
8
10
-5 0 5 10
exp. pKa
ca
lc.
pK
a
Calculated vs. experimental pKa after
'training' n=21, R2=0.99, s=0.26
-10
-8
-6-4
-2
0
2
46
8
10
-10 -5 0 5 10
exp. pKa
Ca
lc.
pK
a
Experimental vs. calculated pKa values
The input ‘sdf’ file may be created in IJC
The teaching can be run using this command line :cxcalc –T pka –o c:/output InputpKadata.sdf
Curating experimental pKa data
Conclusions
•User defined pKa model is more accurate then the built-in default model. •IJC can be used for curating input data for the training.
•The new model is only a refinement of the default model, so the training assumes a robust base model that is provided in Marvin.