Licensee OA Publishing London 2014. Creative Commons Attribution License (CC-BY)
Mishra M, Sachan S, Gupta S, Nigam RS, Gupta SP. QSTR with topological indices: Modeling of the acute toxicity of phenylsulfonyl carboxylates to vibrio fischeri using multiple regression analysis. OA Drug Design & Delivery 2014 Feb 25;2(1):3.
Com
petin
g in
tere
sts:
non
e de
clar
ed. C
onfli
ct o
f int
eres
ts: n
one
decl
ared
. Al
l aut
hors
con
trib
uted
to c
once
ptio
n an
d de
sign,
man
uscr
ipt p
repa
ratio
n, re
ad a
nd a
ppro
ved
the
final
man
uscr
ipt.
Al
l aut
hors
abi
de b
y th
e As
soci
atio
n fo
r Med
ical
Eth
ics (
AME)
eth
ical
rule
s of d
isclo
sure
.
1
Section: Drug Structure-Activity Relationships
QSTR with topological indices: Modeling of the acute toxicity of phenylsulfonyl
carboxylates to vibrio fischeri using multiple regression analysis
M Mishra1, S Sachan2, S Gupta3, RS Nigam4, SP Gupta5*
1 Department of Chemistry, Govt. Auto. P. G. College, Satna-485001, India
2 Department of Chemistry, Govt. New Science College, Rewa-486001, India.
3Department of Chemistry, A.P.S. University, Rewa -486001, India
4Rajiv Gandhi College, Sherganj, Panna Road, Satna (MP)-485001
5Rajiv Gandhi Institute of Pharmacy, Sherganj, Panna Road, Satna (MP)-485001
*E-Mail: [email protected]
Abstract
The present paper deals with modeling of the acute toxicity of 56 phenylsulfonyl
carboxylates to Vibrio fischeri. Multiple regression analysis (MLR) has been used as the data-
processing step for the selection of independent variables. The statistical quality of the best
model (without deleting outliers) using topological & indicator parameters is as follows:
N=56, R=0.8802, AR2=0.7570, MSE=0.0516, F=43.834 & Q=17.0576 and statistical quality
of the best model (after deleting outliers) is as follows: N=53, R=0.9397, AR2=0.8733,
MSE=0.0254, F=90.577 & Q=36.9953. Use of the topological and indicator parameters has
suggested that negative contributions of steric bulk, branching, functionality of C10,
functionality of chloro substituent at X1 position and presence of unsaturation at the
substituent(s) on C10, and positive contributions of functionality of O13 and presence of
substituent’s with electronegative atoms at R2 and R3 positions. Cross-validation analysis of
obtained models has been checked by employing the leave one out (LOO) method.
2
Keywords: QSAR, QSTR, Topological indices, MLR, LOO.
Introduction
Quantitative structure-activity relationship (QSAR) analysis has become an
indispensable tool in ecotoxicological risk assessments, which are used in formulating
regulatory decisions of environmental protection agencies.1-3 Due to shortage of experimental
data, QSAR estimates for the selection of persistent, bio-accumulative and toxic (PBT)
substances appear as an attractive alternative.4 It has been argued that all new chemicals
should be assessed using a consistent and transparent methodology that uses chemical
property data derived from QSARs, or experimental determination when possible and applies
evaluative or regio-specific environmental models.5 QSAR methods routinely result in
ecotoxicity estimations of acute and chronic toxicity to various organisms, and in fate
estimations of physical/chemical properties, degradation, and bio-concentration.6 It is now
possible to predict accurately potential of organic chemicals to cause diverse effect to a range
of organisms and degrade or partition within the environment.7 QSARs have also been used
in exploring the mechanism of toxic actions of chemicals.8
Many QSAR approaches and statistical methods have been adopted to explore
ecotoxicological modeling of diverse categories of organic compounds. Cui et al have
reported holographic QSAR for toxicity data of 83 benzene derivatives to the autotrophic
Chlorella vulgaris.9 Comparative molecular field analysis (CoMFA) was used to model acute
toxicity of 56 phenylsulfonyl carboxylates on vibrio fischeri.10 The acute toxicity data of 20
alpha-substituted phenylsulfonyl acetates against Daphnia magna was modeled using
theoretical linear solvation energy relationships and charge model descriptors.11 The joint
toxicity of 2,4-dinitrotoluene with aromatic compounds with vibrio fischeri was subjected to
QSAR study using the energy of lowest unoccupied molecular orbital.12 Partial least squares
and multiple regression analyses were used for modeling toxicity of aromatic compounds to
3
Chlorella vulgaris.13 Different classification techniques were applied on 235 pesticides using
153 descriptors by Mazzatorta et al. for the toxicity prediction.14
Recently the present group of authors have introduced topological and indicator
parameters to explored quantitative structure-toxicity relationship of compounds of different
chemical groups. In continuation of such effort, the present paper deals with modeling of the
acute toxicity of phenylsulfonyl carboxylates to vibrio fischeri. Aromatic sulfones being
extensively used as intermediates in the manufacture of pesticides, herbicides and
anthelmentics and also as floatation agents and extractants in the petrochemical and
metallurgical industries, modeling QSTR of these compounds appears to be of timely need in
order to predict the ecological effects of the compounds in case of their accidental discharge.
Methodology
Calculation of Molecular Descriptors
Experimentally observed toxicity (pC) of phenylsulfonyl carboxylates to Vibrio
fischeri for 56 substituted aromatic sulfones have been collected by literature.15 We calculate
manually the various molecular descriptors such as Weiner index (W), Path number (P2, P3
& P3-P2), Equalized electronegativity (χeq), Molecular Redundancy Index (MRI),
Negentrophy (N), Szeged index (Sz), and Molecular Id number (Id) at the basis of a fully
optimization of the molecular geometry. During the course of regression analysis we
observed the need for indicator parameter for obtaining better result. Therefore we have used
five indicator parameters:
* IP1=1 If aromatic ring not present otherwise 0.
* IP2=1 If (CH2)2 present at R2 position otherwise 0.
* IP3=1 If R1=CH3, R2=(CH2)2 otherwise 0.
* IP4=1 If NO2 present otherwise 0.
5
Structural Features Toxicity to
Vibrio
fischeri (pC)
Sl. R1 R2 R3 X1 X2 pC
1 CH3 -(CH2)2- H H 2.28
2 CH3 -(CH2)3- H H 2.12
3 CH3 -(CH2)4- H H 1.91
4 CH3 -(CH2)5- H H 1.81
5 CH3 -(CH2)2- H NO2 2.12
6 CH(CH3)2 -(CH2)2- H NO2 1.78
7 CH(CH3)2 -(CH2)3- H NO2 1.81
8 CH(CH3)2 -(CH2)5- H NO2 1.45
9 CH(CH3)2 -(CH2)6- H NO2 1.05
10 CH3 -(CH2)2- H Br 1.89
11 CH3 -(CH2)3- H Br 1.76
12 CH3 -(CH2)4- H Br 1.60
13 CH3 -(CH2)5- H Br 1.31
14 CH3 -(CH2)2- H Cl 1.96
15 CH3 -(CH2)3- H Cl 1.92
16 CH(CH3)2 -(CH2)2- H Cl 1.86
17 CH2(CH2)2CH3 -(CH2)2- H Cl 1.70
18 CH(CH3)2 -(CH2)4- H Cl 1.51
19 CH(CH3)2 -(CH2)5- H Cl 1.32
20 CH(CH3)2 -(CH2)6- H Cl 0.90
6
21 CH(CH3)2 -(CH2)2- H CH3 1.96
22 CH(CH3)2 -(CH2)3- H CH3 1.46
23 CH3 -(CH2)2- H CH3 2.22
24 CH2CH3 -(CH2)2- H CH3 1.92
25 CH2CH3 -(CH2)3- H CH3 1.68
26 CH(CH3)2 -(CH2)4- H CH3 1.22
27 CH(CH3)2 -(CH2)5- H CH3 1.09
28 CH3 -(CH2)5- H CH3 1.40
29 CH3 H H H NO2 1.29
30 CH(CH3)2 H H H NO2 1.29
31 CH3 H H Cl NO2 0.44
32 CH(CH3)2 H H Cl NO2 1.13
33 CH3 H H NO2 H 1.49
34 CH(CH3)2 H H NO2 H 1.34
35 CH3 H H NO2 Cl 1.33
36 CH(CH3)2 H H NO2 Cl 1.45
37 CH3 H CH3 H NO2 1.48
38 CH3 CH3 CH3 H NO2 1.42
39 CH3 CH2CH3 CH2CH3 H NO2 1.36
40 CH3 CH2(CH2)2CH3 CH2(CH2)2CH3 H NO2 1.10
41 CH3 CH2Ph CH2Ph H NO2 0.60
42 CH2CH3 CH2(CH2)2CH3 CH2(CH2)2CH3 H NO2 1.08
43 CH2CH3 CH3 CH2Ph H NO2 0.98
44 CH2CH3 CH3 CH2CH=CH2 H NO2 1.12
7
45 CH2CH3 CH3 CH2-1-Naph H NO2 0.83
46 CH(CH3)2 CH2(CH2)2CH3 CH2(CH2)2CH3 H NO2 1.05
47 Cyclohexyl H CH3 H NO2 1.19
48 CH3 H CH2CO2CH2CH3 H NO2 1.00
49 CH(CH3)2 H CH2CO2CH(CH3)2 H NO2 0.92
50 CH(CH3)2 CH2CO2CH2CH3 CH2CO2CH2CH3 H NO2 0.66
51 CH3 =CHPh H NO2 0.82
52 CH2CH3 =CHPh H NO2 0.75
53 CH(CH3)2 =CHPh H NO2 0.64
54 CH2CH(CH3)2 =CHPh H NO2 0.66
55 CH(CH3)2 =CHPh H CH3 0.89
56 CH(CH3)2 =CHPh H H 0.80
Result and Discussion
In order to understand the experimental toxicity data of 56 compounds on theoretical
basis, we established a quantitative-structure toxicity relationship (QSTR) between their in
vitro toxicity and topological descriptors of the molecules under consideration using multiple
regression analysis (MLR). Developing a QSTR model requires a diverse set of data, and,
thereby a large number of descriptors have to be considered.
Descriptors have numerical values that encode different structural features of the
molecules. Selection of a set of appropriate descriptors from a large number of them requires
a method, which is able to discriminate between the parameters. The different topological
molecular descriptors (Independent variables) Weiner index(W), Path number P2, P3 & P3-
P2, Negentropy (N), Molecular redundancy index (MRI), Molecular equalized
8
electronegativity (χeq), Szeged index (Sz), Intimation theoretical index (Id) along with
indicator parameters presented in Table 2.
Table 2: Calculated Topological Descriptors & Indicator Parameter
Com
p.
No.
W P2 P3 P3-
P2
χeq N MRI Sz Id IP
1
IP
2
IP
3
IP
4
IP
5
1. 405 56.
00
86.0
0
78.0
0
2.4
02
28.5
04
-
0.01
2
553.0
0
-
46.639
1 1 1 0 1
2. 470 48.
00
90.0
0
84.0
0
2.4
56
21.1
49
-
0.14
8
656.0
0
-
50.422
1 0 0 0 0
3. 537 56.
00
96.0
0
80.0
0
2.3
67
33.4
56
-
0.01
0
771.0
0
-
54.263
1 0 0 0 0
4. 622 58.
00
105.
00
94.0
0
2.3
55
35.4
83
-
0.02
6
910.0
0
-
58.161
1 0 0 0 0
5. 693 56.
00
102.
00
92.0
0
2.4
82
34.5
00
-
0.10
2
964.0
0
-
58.735
1 1 1 1 0
6. 931 62.
00
105.
50
87.0
0
2.4
34
42.3
36
-
0.11
3
1242.
00
-
66.978
1 1 0 1 0
9
7. 103
4
66.
00
114.
00
96.0
0
2.4
16
45.3
18
-
0.10
2
1378.
00
-
71.029
1 0 0 1 0
8. 125
6
74.
00
129.
00
120.
00
2.3
88
50.2
65
-
0.07
1
1780.
00
-
79.269
1 0 0 1 0
9. 139
8
76.
00
138.
50
131.
00
2.3
76
52.4
64
-
0.05
5
1986.
00
-
83.454
1 0 0 1 0
10 489 50.
00
90.0
0
80.0
0
2.4
08
29.6
72
-
0.15
6
678.0
0
-
50.642
1 1 1 0 0
11. 562 54.
00
96.0
0
84.0
0
2.4
05
33.2
63
-
0.04
9
806.0
0
-
54.551
1 0 0 0 0
1
2.
637 60.
00
102.
00
86.0
0
2.3
87
35.6
66
-
0.03
2
922.0
0
-
58.448
1 0 0 0 0
13. 714 61.
00
111.
00
98.0
0
2.3
73
37.6
29
-
0.01
1
1076.
00
-
62.399
1 0 0 0 0
14. 489 50.
00
90.0
0
78.0
0
2.4
32
30.6
60
-
0.06
6
738.0
0
-
50.711
1 1 1 0 0
15. 562 53. 96.0 86.0 2.4 33.2 - 794.0 - 1 0 0 0 0
10
00 0 0 09 94 0.04
9
0 54.551
16. 683 55.
00
96.0
0
80.0
0
2.3
91
28.2
90
-
0.10
8
968.0
0
-
58.736
1 1 0 0 0
17. 832 56.
00
99.0
0
86.0
0
2.3
76
40.3
67
-
0.05
8
1135.
00
-
62.399
1 1 0 0 0
18. 855 65.
00
108.
00
77.0
0
2.3
63
42.4
40
-
0.03
8
1176.
00
-
66.690
1 0 0 0 0
19. 963 68.
00
127.
00
98.0
0
2.3
53
46.3
11
-
0.04
7
1350.
00
-
70.742
1 0 0 0 0
20. 107
4
70.
00
127.
00
115.
00
2.3
43
46.9
66
-
0.01
3
1524.
00
-
74.839
1 0 0 0 0
21. 683 56.
00
94.5
0
77.0
0
2.3
55
41.5
88
-
0.12
4
920.0
0
-
48.251
1 1 0 0 0
22. 768 60.
00
100.
50
81.0
0
2.3
44
103.
28
-
0.98
8
1032.
00
-
62.688
1 0 0 0 0
23. 486 50.
00
90.0
0
71.0
0
2.3
83
34.9
68
-
0.08
738.0
0
-
50.711
1 1 1 0 1
11
6
24. 585 52.
00
91.5
0
82.0
0
2.3
67
38.6
92
-
0.09
0
852.0
0
-
54.551
1 1 0 0 0
25. 664 56.
00
99.0
0
82.0
0
2.3
55
41.5
51
-
0.07
8
912.0
0
-
58.449
1 0 0 0 0
26. 855 66.
00
108.
00
82.0
0
2.2
35
47.2
14
-
0.06
0
1176.
00
-
66.690
1 0 0 0 0
27. 963 68.
00
117.
00
98.0
0
2.3
27
49.5
88
-
0.04
7
1350.
00
-
70.742
1 0 0 0 0
28. 730 62.
00
111.
00
98.0
0
2.3
44
42.3
20
-
0.03
6
1076.
00
-
62.399
1 0 0 0 0
29. 550 48.
00
72.0
0
48.0
0
2.5
16
29.3
28
-
0.09
0
748.0
0
-
50.745
1 0 0 1 0
30. 768 54.
00
78.0
0
50.0
0
2.4
54
35.6
80
-
0.07
6
1002.
00
-
58.764
1 0 0 1 0
31. 617 49.
00
81.0
0
60.0
0
2.5
51
31.3
30
-
0.14
5
859.0
0
-
54.869
1 0 0 1 0
12
32. 852 58.
00
87.0
0
58.0
0
2.4
81
37.6
00
-
0.11
6
1111.
00
-
63.000
1 0 0 1 0
33. 502 51.
00
75.0
0
54.0
0
2.5
16
29.3
02
-
0.08
9
652.0
0
-
50.745
1 0 0 1 0
34. 708 54.
00
79.5
0
51.0
0
2.4
54
35.6
16
-
0.07
5
882.0
0
-
58.764
1 0 0 1 0
35. 117
2
52.
00
81.0
0
56.0
0
2.5
51
27.2
22
-
0.03
3
775.0
0
-
54.869
1 0 0 1 0
36. 812 58.
00
87.0
0
52.0
0
2.4
81
37.0
00
-
0.11
6
1031.
00
-
63.000
1 0 0 1 0
37. 621 55.
00
87.0
0
70.0
0
2.4
82
33.1
18
-
0.09
7
837.0
0
-
54.869
1 0 0 1 0
38. 691 58.
00
100.
50
85.0
0
2.4
54
36.1
28
-
0.08
6
928.0
0
-
59.169
1 0 0 1 0
39. 882 62.
00
120.
00
116.
00
2.4
14
44.8
78
-
0.11
5
1152.
00
-
67.118
1 0 0 1 0
40. 141 70. 132. 131. 2.3 55.2 - 1764. - 1 0 0 1 0
13
7 00 00 00 64 50 0.06
2
00 83.587
41. 244
6
94.
00
162.
00
136.
00
2.4
02
58.8
64
-
0.07
7
3400.
00
-
109.21
1
0 0 0 1 0
42. 156
6
72.
00
135.
00
128.
00
2.3
55
54.5
37
-
0.01
7
1941.
00
-
87.811
1 0 0 1 0
43. 160
7
78.
00
135.
00
126.
00
2.4
07
52.5
60
-
0.10
2
2174.
00
-
87.679
0 0 0 1 0
44. 102
9
63.
00
117.
00
112.
00
2.4
16
48.1
65
-
0.14
8
1317.
00
-
71.168
1 0 0 1 0
45. 232
0
94.
00
163.
50
139.
00
2.4
00
62.9
85
-
0.13
8
3407.
00
-
105.08
9
0 0 0 1 0
46. 174
7
76.
00
136.
50
121.
00
2.3
47
58.0
72
-
0.02
1
2122.
00
-
92.362
1 0 0 1 0
47. 127
6
68.
00
108.
00
80.0
0
2.4
01
46.2
42
-
0.06
2
1762.
00
-
74.722
1 0 0 1 0
48. 121
2
65.
00
109.
50
89.0
0
2.4
73
48.7
54
-
0.17
1518.
00
-
75.145
1 0 0 1 0
14
9
49. 164
7
74.
00
115.
50
83.0
0
2.4
20
51.6
06
-
0.05
9
2047.
00
-
88.268
1 0 0 1 0
50. 247
6
88.
00
154.
50
133.
00
2.4
25
65.2
40
-
0.09
4
2926.
00
-
110.35
4
1 0 0 1 0
51. 134
4
70.
00
114.
00
88.0
0
2.4
60
38.7
39
-
0.02
9
1817.
00
-
78.864
0 0 0 1 0
52. 148
6
73.
00
114.
00
82.0
0
2.4
40
42.8
40
-
0.04
4
2042.
00
-
83.048
0 0 0 1 0
53. 166
2
75.
00
118.
50
87.0
0
2.4
23
45.9
76
-
0.04
2
2229.
00
-
87.562
0 0 0 1 0
54. 187
0
78.
00
123.
00
58.0
0
2.3
99
49.4
96
-
0.04
6
2464.
00
-
91.931
0 0 0 1 0
55. 130
6
70.
00
106.
50
73.0
0
2.3
57
43.0
32
-
0.01
3
1733.
00
-
78.864
0 0 0 0 0
56. 130
4
70.
00
108.
00
76.0
0
2.3
88
41.9
02
-
0.01
4
1733.
00
-
78.864
0 0 0 0 0
15
Table 3: Correlation Matrix
pC W P2 P3 P3-P2 χeq Sz Id IP1 IP2 IP3 IP5
pC 1
W -0.7465 1
P2 -0.6892 0.9392 1
P3 -0.5342 0.8447 0.9302 1
P3-P2 -0.2591 0.5904 0.6986 0.8913 1
16
In the present paper, a data set of 56 compounds have subjected to multiple regression
analysis for model generation. Preliminary analysis has carried out in terms of correlation
analysis (Table 3). Maximum correlation has been obtained between pC and Id (0.7555). The
high interrelationship has observed between W and Sz (r=0.9772) as well as low
interrelationship has been observed between χeq and IP3 (r=0.0717). The correlation matrix
indicated the predominance of topological parameters in describing the acute toxicity of
phenylsulfonyl carboxylates to Vibrio fischeri
It is well known that there are three important components in any QSAR study:
1. Development of models,
2. Validation of models and
3. Utility of developed models.
Validation is a crucial aspect of any QSAR analysis. The statistical quality of the resulting
models, as depicted in Table 4, has determined by R2 (Regression Coefficient), M.S.E. (Mean
Square Error), F-ratio and Q=R/MSE (Quality Factor). It is noteworthy that all these
equations have derived using the entire data set of compounds (N=56). We performed single
linear regression analysis after that multiple regression analysis and after performing
χeq -0.0758 -0.0975 -0.3134 -0.4235 -0.4368 1
Sz -0.7284 0.9772 0.9676 0.8828 0.6317 -0.1722 1
Id 0.7555 -0.9747 -0.9679 -0.8984 -0.6608 0.2064 -0.9782 1
IP1 0.5601 -0.6005 -0.5919 -0.4161 -0.1463 -0.0021 -0.6497 0.5767 1
IP2 0.6222 -0.3589 -0.3814 -0.2941 -0.1424 -0.0416 -0.3361 0.4027 0.2041 1
!P3 0.5035 -0.3128 -0.3106 -0.2406 -0.1139 0.0717 -0.2944 0.3373 0.1372 0.6715 1
IP5 0.3751 -0.2178 -0.1803 -0.1814 -0.1129 -0.0521 -0.2068 0.2419 0.0842 0.4127 0.6146 1
17
regression analysis; we have adopted maximum-R2 method and followed stepwise regression
analysis. The results have shown that for the set of 56 compounds mono-parametric
regressions starts giving statistically significant model.
Single Linear Regression Analysis
Though single regression analysis, many regression equations have been obtained, but we can
find that, for single regression analysis, three equations have satisfactory with R2 larger than
0.5. These regression equations have listed in Table 4.
In these three equations (Eqn. 1, 2 and 3) highest value of R2 is obtained with Intimation
theoretical index, Id (Eqn.1), this means Id is the largest correlated descriptors with pC than
any other descriptors. This correlation is showed in Table 3, and eqn.2 and 3 are less
significant because of low values of R2, AR2, F-ratio and Q-test.
ONE-PARAMETRIC MODEL
pC=2.8659+0.0220(±0.0026)Id …………………..(1)
N=56, R2=0.5708, AR2=0.5629, MSE=0.0929, F=71.829, Q=8.1325
pC=2.0476 -0.0007(±0.0001)W …..……………….(2)
N=56, R2=0.5574, AR2=0.5492, MSE=0.0958, F=68.002,
Q=7.7932
pC=2.0428-0.0005(±0.0001)Sz …………………..(3)
N=56, R2=0.5306, AR2=0.5219, MSE=0.1016, F=61.049,
Q=7.1695
Multilinear Regression Analysis
In order to improve the quality of QSTR model, multiple linear regression analysis has been
performed. As we know, models with variables correlated with each other have of no
significance. Successive regression analysis resulted into several binary combinations of
18
Id with the Path number P3, P3-P2, and with indicator parameter IP3 and IP2 used. The best
bi-parametric model contained Id and IP2 (Eqn.7).
TWO-PARAMETRIC MODEL
pC=2.4289+0.0166(±0.0039)P3+0.0417(±0.0052)Id …………………..(4)
N=56, R2=0.6792, AR2=0.6671, MSE=0.0707,
F=56.107, Q=11.6568
pC=2.6947+0.0083(±0.0020)P3-P2+0.0303(±0.0031)Id …………………..(5)
N=56, R2=0.6732, AR2=0.6608, MSE=0.0720,
F=54.585, Q=11.3957
pC=2.6370+0.0193(±0.0026)Id+0.4495(±0.1402)IP3 …………………..(6)
N=56, R2=0.6406, AR2=0.6270, MSE=0.0793,
F=47.230, Q=10.0931
pC=2.4803+0.0176(±0.0024)Id+0.4527(±0.0994)IP2 …………………..(7)
N=56, R2=0.6915, AR2=0.6798, MSE=0.0681,
F=59.391, Q=12.2109
Here, In all these four models, Intimation theoretical index have positive coefficient, and
therefore, with increasing the value of Id, toxicity also increases. The regression parameters
and the quality of model expressed by Eqn. 7, which indicates that addition of IP2, improves
the value of variance (R2) increases from 0.53 to 0.69.
In best tri-parametric equation contains the following independent variables: P3, Id
and IP2.
THREE-PARAMETRIC MODEL
19
pC=2.0435+0.0259(±0.0123)P2+0.0351(±0.0087)Id+0.4453(±0.0965)IP2
…………..…..(8)
N=56, R2=0.7156, AR2=0.6992, MSE=0.0639, F=43.606,
Q=13.2384
pC=2.3912+0.0069(±0.0018)P3-P2+0.0250(±0.0029)Id+0.3902(±0.0900)IP2
….…………..(9)
N=56, R2=0.7599, AR2=0.7461, MSE=0.0540, F=54.866,
Q=16.1431
pC=6.3321-1.5535 (±0.5904)χeq+0.0191(±0.0024)Id+0.4177(±0.0952)IP2
……………..(10)
N=56, R2=0.7277, AR2=0.7120, MSE=0.0612, F=46.327,
Q=13.9388
pC=2.0285+0.0141(±0.0952)Id+0.2522(±0.1120)IP1+0.4608(±0.0959)IP2
…………………..(11)
N=56, R2=0.7189, AR2=0.7029, MSE=0.0632, F=44.323,
Q=13.4158
pC=2.1626+0.0140(±0.0034)P3+0.0348(±0.0047)Id+0.3915(±0.0885)IP2 ……..(12)
N=56, R2=0.7669, AR2=0.7534, MSE=0.0524, F=57.015,
Q=16.7124
All regression coefficients have positive sign which indicates that with increasing the value
of coefficient of P3, Id and IP2, toxicity also increases. In our best tri-parametric model the
regression parameters and the quality of model expressed by Eqn. 12, which indicates that
20
addition of the path number P3, significantly improves the correlation coefficient and R2
increases from 0.69 to 0.76. Also, the quality factor Q increases from 12.2109 to 16.7124.
When Intimation theoretical index, Path number and indicator parameters have been tried, a
four parametric model is obtained. The best 4-parametric model (Eqn. 15) contains P3, Id, IP1
& IP2. The adjusted R2 and value of quality factor have in favour of this combination, slightly
improvement has been observed in the variance.
FOUR-PARAMETRIC MODEL
pC=1.8628+0.0140(±0.0034)P3-0.0002(±0.0002)Sz+0.0266(±0.0111)Id+0.4144(±0.0111)IP2
…………..(13)
N=56, R2=0.7699, AR2=0.7518, MSE=0.0528, F=42.652,
Q=16.7124
pC=2.1498+0.0139(±0.0034)P3+0.0345(±0.0047)Id+0.3558(±0.0047)IP2+0.1980(±0.1815)I
P5……………(14)
N=56, R2=0.7722, AR2=0.7543, MSE=0.0522, F=43.215,
Q=16.8343
pC=1.9418+0.0126(±0.0036)P3+0.0311(±0.0054)Id+0.1409(±0.0054)IP1+0.4021(±0.0883)I
P2…………..(15)
N=56, R2=0.7747, AR2=0.7570, MSE=0.0516, F=43.834,
Q=17.0576
From residual report of model no. 15, we find high residual value for three compounds (nos.
7, 20 & 31), when these compounds deleted as a outlier we obtained our best tetra-parametric
model (Eqn. 16) which contains P3, Id, IP1 & IP2.
21
As in case of Eq. 16, high regression coefficient and Q-value supported the validity of
developed QSAR models. The model described by Eq. 16 demonstrated the importance of
different topological and indicator parameter.
FOUR-PARAMETRIC MODEL AFTER DELETION OF COMPOUND NO. 7, 20 &
31
pC=2.0689+0.0113(±0.0025)P3+0.0307(±0.0038)Id+0.1500(±0.0747)IP1+0.3655(±0.0624)I
P2……………(16)
N=53, R2=0.8830, AR2=0.8733, MSE=0.0254, F=90.577,
Q=36.9953
Table 4: Regression Equations
Model
No.
Parameter
Used
Ai ( I
=1,2,3)
Intercept M.S.E. R2 AR2 R F-
Ratio
Q=R/MSE
1 Id A1=0.0220 2.8659 0.0929 0.5708 0.5629 0.7555 71.829 8.1325
2 W A1= -
0.0007
2.0476 0.0958 0.5574 0.5492 0.7466 68.002 7.7932
3 Sz A1= -
0.0005
2.0428 0.1016 0.5306 0.5219 0.7284 61.049 7.1695
4 P3 A1=0.0166 2.4289 0.0707 0.6792 0.6671 0.8241 56.107 11.6568
Id A2=0.0417
5 P3-P2 A1=0.0083 2.6947 0.0720 0.6732 0.6608 0.8205 54.585 11.3957
22
Id A2=0.0303
6 Id A1=0.0193 2.6370 0.0793 0.6406 0.6270 0.8004 47.230 10.0931
IP3 A2=0.4495
7 Id A1=0.0176 2.4803 0.0681 0.6915 0.6798 0.8316 59.391 12.2109
IP2 A2=0.4527
8 P2 A1=0.0259 2.0435 0.0639 0.7156 0.6992 0.8459 43.606 13.2384
Id A2=0.0351
IP2 A3=0.4453
9 P3-P2 A1=0.0069 2.3912 0.0540 0.7599 0.7461 0.8717 54.866 16.1431
Id A2=0.0250
IP2 A3=0.3902
10 χeq A1= -
1.5535
6.3321 0.0612 0.7277 0.7120 0.8531 46.327 13.9388
Id A2=0.0191
IP2 A3=0.4177
11 Id A1=0.0141 2.0285 0.0632 0.7189 0.7029 0.8479 44.323 13.4158
IP1 A2=0.2522
IP2 A3=0.4608
12 P3 A1=0.0140 2.1626 0.0524 0.7669 0.7534 0.8757 57.015 16.7124
Id A2=0.0348
IP2 A3=0.3915
13 P3 A1=0.0140 1.8628 0.0528 0.7699 0.7518 0.8757 42.652 16.7124
Sz A2= -
0.0002
23
Id A3=0.0266
IP2 A4=0.4144
14 P3 A1=0.0139 2.1498 0.0522 0.7722 0.7543 0.8787 43.215 16.8343
Id A2=0.0345
IP2 A3=0.3558
IP5 A4=0.1980
15 P3 A1=0.0126 1.9418 0.0516 0.7747 0.7570 0.8802 43.834 17.0576
Id A2=0.0311
IP1 A3=0.1409
IP2 A4=0.4021
Four-Parametric Model after Deletion of Compound No.7, 20 & 31
Model
No.
Parameter
Used
Ai ( I
=1,2,3)
Intercept M.S.E. R2 AR2 R F-
Ratio
Q=R/MSE
16 P3 A1=0.0126 2.2908 0.0263 0.8787 0.8686 0.9374 86.908 35.6423
Id A2=0.0342
IP2 A3=0.3216
IP5 A4=0.1899
17 P3 A1=0.0113 2.0689 0.0254 0.8830 0.8733 0.9397 90.577 36.9953
Id A2=0.0307
IP1 A3=0.1500
IP2 A4=0.3655
For the testing the validity of the predictive power of selected MLR models the LOO
technique has been used. The developed models have validated by the calculation of
24
following statistical parameters: PRESS/SSY, SPRESS , R2CV, R2A, PSE (Table 5). These
parameters have been calculated.
PRESS is used to validate a regression model with regards to predictability. The smaller
PRESS is, the better the predictability of the model. Its value being less than SSY Points out
that the model predicts better than chance and can be considered statistically significant. SSY
are the sum of squares associated with the corresponding sources of validation. These values
are in term of dependent variable.
The PRESS value above can be used to compute an R2CV statistic, called R2 cross-
validation, which reflects the prediction ability of the model. This is a good way to validate
the prediction of a regression model without selecting another sample or splitting your data. It
is very possible to have a high R2 and a very low R2CV. When this occurs, it implies that the
fitted model is data dependent. This R2CV ranges from below zero to above one. When
outside the range of zero to one, it is truncated to stay within this range.
Table 5: Cross-Validation Parameters
Mode
l No.
N Parameter PRES
S
SSY PRESS/SS
Y
R2CV AR2 SPRESS PSE
1 56 Id 5.0168 6.673
1
0.7518 0.248
2
0.562
9
0.304
8
0.299
3
2 56 Id, IP2 3.6067 8.083
2
0.4462 0.553
8
0.679
8
0.260
9
0.253
8
3 56 P3, Id, IP2 2.7253 8.964
6
0.3040 0.696
0
0.753
4
0.228
9
0.220
6
4 56 P3, Id ,IP1
,IP2
2.6341 9.055
9
0.2909 0.709
1
0.757
0
0.227
3
0.216
9
25
For the best model the value of PRESS/SSY should be smaller than 0.4. The value smaller
than 0.1 indicates the excellent model. All these model which are given in Table 5 are good
but in these models model no. 6 is the best. In many cases R2CV and Q=Quality Factor is
taken as a proof of the high predictive ability of QSAR models. A high value of these
statistical characteristic (>0.7) is considered as a proof of the high predictive ability of the
model. Besides high R2CV, a reliable model should be also characterized by a high correlation
coefficient between the predicted and observed toxicities of pollutants from a set of molecules
that was not used to develop the models.
Perusal of Table 5 shows that PSE, i.e. the predictive squared error, can be used successfully
for deciding uncertainty of the prediction. The PSE is found to be the lowest for the model no.
6; showing that this model has excellent correlation as well as predictive potential.
From the data presented in Table 6, it is shown that high agreement between experimental and
predicted toxicity values have been obtained (the residual values are small) indicating the
good predictability of the established models. According to the reference, without the
validation of the QSAR models by using the external set, we could not have come to a right
conclusion about high predictive ability of derived models.
Table 6: Observed (Obs.), Predicted (Pre.) Residual Value Obtained using Equation 16
5 53 P3, Id, IP2,
IP5
1.2636 9.151
0
0.1381 0.861
9
0.868
6
0.162
2
0.154
4
6 53 P3, Id, IP1,
IP2
1.2183 9.196
2
0.1325 0.867
5
0.873
3
0.159
3
0.151
6
Row Actual Predicted Residual
26
1 2.28 2.144 0.136
2 2.12 1.716 0.404
3 1.91 1.648 0.262
4 1.81 1.629 0.181
5 2.12 1.944 0.176
6 1.78 1.695 0.085
7 1.45 1.212 0.238
8 1.05 1.189 -0.139
9 1.89 2.058 -0.168
10 1.76 1.643 0.117
11 1.60 1.584 0.016
12 1.31 1.562 -0.252
13 1.96 2.059 -0.099
14 1.92 1.644 0.276
15 1.86 1.851 0.009
16 1.70 1.761 -0.061
17 1.51 1.371 0.139
18 1.32 1.479 -0.159
19 1.96 2.193 -0.233
20 1.46 1.409 0.051
21 2.22 2.052 0.168
22 1.92 1.935 -0.015
23 1.68 1.539 0.141
24 1.22 1.353 -0.133
27
25 1.09 1.343 -0.253
26 1.40 1.558 -0.158
27 1.29 1.476 -0.186
28 1.29 1.266 0.024
29 1.13 1.240 -0.110
30 1.49 1.516 -0.026
31 1.34 1.286 0.054
32 1.33 1.456 -0.126
33 1.45 1.240 0.210
34 1.48 1.525 -0.045
35 1.42 1.548 -0.128
36 1.36 1.522 -0.162
37 1.10 1.097 0.003
38 0.60 0.603 -0.003
39 1.08 0.988 0.092
40 0.98 1.000 -0.020
41 1.12 1.341 -0.221
42 0.83 0.766 0.064
43 1.05 0.847 0.203
44 1.19 1.096 0.094
45 1.00 1.111 -0.111
46 0.92 0.724 0.196
47 0.66 0.467 0.193
48 0.82 1.039 -0.219
29
very good fit with R2=0.8830. It indicates that the model 16 can be successfully applied to
predict the toxicity of these classes of molecules.
The applicability domain of the derived QSAR models has the different substituted
compounds. It is possible because similar molecules can show significantly different
biological toxicities. For these molecules, toxicities are often mispredicted, even when the
overall predictivity of the models is high.
Acknowledgement
The authors are grateful to Prof. Vijay K Agrawal, Director, NITTTR, Bhopal for his
unforgettable support.
30
References
1. Russom CL, Anderson EB, Greenwood BE, Pilli A. ASTER: an integration of the
AQUIRE data base and the QSAR system for use in ecological risk assessments. Sci.
Total Environ, 1991, 109-110, 667-670.
2. Hulzebos EM, Posthumus R, (Q)SARs: gatekeepers against risk on chemicals?. SAR
QSAR Environ Res, 2003, 14: 285-316.
3. Cronin MT, Jaworska JD, Walker JD, Comber MH, Watts CD, Worth AP, Use of
QSARs in international decision-making frameworks to predict health effects of
chemical substances. Environ. Health Perspect, 2003, 111 (10), 1391-1401.
4. Carlsen L., Walker J D, QSARs for prioritizing PTB substances to promote pollution
prevention. QSAR Comb. Sci, 2003, 22: 49-57.
5. Mackay D, Webster E, A perspective on environmental models and QSARs. SAR
QSAR Environ Res, 2003, 14 (1): 7-16.
6. Zeeman M, Auer CM, Clements RG, Nabholz JV, Boethling RS, U.S. EPA regulatory
perspectives on the use of QSAR for new and existing chemical evaluations. SAR
QSAR Environ Res, 1995, 3: 179-201.
7. Comber MHI, Walker JD, Watts C, Hermens J, Quantitative structure-activity
relationships for predicting potential ecological hazard of organic chemicals for use in
regulatory risk assessments. Environ. Toxicol. Chem., 2003, 22 (8): 1822-1828.
8. Ren S, Determining the mechanisms of toxic action of phenols to tetrahymena
pyriformis. Environ. Toxicol. Chem, 2002, 17: 119-127.
9. Cui S, Wang X, Liu S, Wang L, Predicting toxicity of benzene derivatives by
molecular hologram derived quantitative structure-activity relationships (QSARS).
SAR QSAR Environ Res, 2003, 14 (3): 223-231.
31
10. Liu X, Yang Z, Wang L, CoMFA of the acute toxicity of phenylsulfonyl carboxylates
to Vibrio fischeri. SAR QSAR Environ Res. 2003, 14 (3): 183-190.
11. Liu X, Wang B, Huang Z, Han S, Wang L, Acute toxicity and quantitative structure-
activity relationships of alpha-branched phenylsulfonyl acetates to Daphnia magna.
Chemosphere, 2003, 50 (3): 403-408.
12. Yuan X, Lu G, Zhao J, QSAR study on the joint toxicity of 2,4-dinitrotoluene with
aromatic compounds to Vibrio fischeri. J Environ. Sci. Health Part A. Tox. Hazard
Subst. Environ. Eng, 2002, 37 (4): 573-578.
13. Netzeva TI, Dearden JC, Edwards R, Worgan AD, Cronin MT, QSAR analysis of the
toxicity of aromatic compounds to Chlorella vulgaris in a novel short-term assay. J.
Chem Inf. Comput. Sci, 2004, 44 (1): 258-265.
14. Mazzatorta P, Benfenati E, Lorenzini P, Vighi M, QSAR in ecotoxicity: an overview
of modern classification techniques, J. Chem Inf. Comput. Sci, 2004, 44 (1): 105-
112.
15. Roy K, Ghosh G, QSTR with Extended Topochemical Atom Indices. Modeling of the
Acute Toxicity of Phenylsulfonyl Carboxylates to Vibrio fischeri Using Principal
Component Factor Analysis and Principal Component Regression Analysis, QSAR
Comb. Sci, 2004, 23: 526-535.