GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
1
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
Quantitative structure–property relationship studies for predicting gas to carbon
tetrachloride solvation enthalpy based on partial least squares,
artificial neural network and support vector machine
Zahra Dashtbozorgi
a,*, Hassan Golmohammadi
b, William E. Acree. Jr.
c
a Young Researchers Club, Science and Research Branch, Islamic Azad University, Tehran, Iran
b Department of Chemistry, Mazandaran University, P. O. Box 453, Babolsar, Iran c Department of Chemistry, P. O. Box 305070, University of North Texas, Denton, TX 76203-5070, USA
*Author for correspondence: Zahra Dashtbozorgi, email: [email protected]
Received 17 Jan 2012; Accepted 20 Feb 2012; Available Online 20 Feb 2012
Abstract
In the present work, partial least squares (PLS), artificial neural network (ANN) and support vector machine (SVM) techniques were
used for quantitative structure–property relationship (QSPR) studies of gas to carbon tetrachloride solvation enthalpy (ΔHsolv) of various organic
compounds based on molecular descriptors calculated from the optimized structures. Different kinds of molecular descriptors were calculated to
characterize the molecular structures of compounds, such as constitutional, topological, charge, and geometric descriptors. The variable selection
method of genetic algorithm-partial least squares (GA-PLS) was employed to select most favorable subset of descriptors. The five descriptors
selected using GA-PLS were used as inputs of ANN and SVM to predict the gas to carbon tetrachloride solvation enthalpy. The correlation
coefficients, R, between experimental and predicted solvation enthalpy for the test set by PLS, ANN and SVM are 0.922, 0.985 and 0.990
respectively. Satisfactory results indicated that the GA-PLS approach is a very effective method for variable selection and the predictive ability
of the SVM model is superior to those obtained by PLS and ANN. The obtained results demonstrate that SVM can be used as a substitute
powerful modeling tool for QSPR studies.
Keywords: Gas to carbon tetrachloride solvation enthalpy; Quantitative structure–property relationship; Partial least squares; Artificial neural
network; Support vector machine
1. Introduction
The enthalpy of solvation of any species is defined as
the heat gained or lost when the species is transferred from the
gas phase into solution. The enthalpy of solvation is an
important accompaniment to the free energy of solvation
because it provides additional information to help understand
and construe the physics of the solvation procedure. It is
significant to note that the thermodynamic properties related
to a solvation process do not depend entirely, even for very
dilute solutions, on solute/solvent interaction because the
addition of a solute entails the formation of a cavity of
satisfactory size to hold the solute molecule.
The solvation process can opportunely be
decomposed into three steps (1) formation of a cavity in the
solvent; (2) van der Waals interactions and (3) electrostatic
contributions [1-3]. The first step is obviously the creation of a
cavity in the solvent that is adequate in size to accommodate
the solute. Because this will involve breakage of the forces
maintaining cohesion with the solvent, the free energy
contribution to cavitation will be adverse. On the contrary, the
van der Waals contribution is favorable, since the solute cavity
is created in regions of the solvent where the dispersion term
is larger than the repulsion term. The third step involves two
mechanisms, that is the work necessary to create the gas-phase
charge distribution of the solute in solution, and the work
required to polarize this charge distribution by the solvent.
From a thermodynamic standpoint, the gas-to-
condensed phase partition coefficient, K, can be estimated by
[4]:
(1)
at other temperatures from measured partition coefficient data
at 298.15 K and the solute’s enthalpy of solvation, ΔHSol , or
enthalpy of transfer, ΔHtrans, between the two condensed
phases. The enthalpy of transfer needed in Eq. 1 (for K = P,
where P is the water to organic solvent partition coefficient) is
defined as
ΔHtrans= ΔHSol, Org - ΔHSol,W (2)
The difference is the enthalpy of solvation of the
solute in the specified organic solvent minus its enthalpy of
solvation in water. The above equations assume zero heat
capacity changes. The gas-to-organic solvent enthalpy is
defined by
Liquid solutes: ΔHSolv = ΔHSoln – ΔHVap, 298 K (3)
Crystalline solutes: ΔHSolv = ΔHSoln – ΔHSub, 298 K (4)
subtracting the solute’s standard molar enthalpy of
vaporization [5], ΔHVap,298K, or standard molar enthalpy of
sublimation [6], ΔHSub,298K, at 298.15 K.
Physical and thermodynamic property data of organic
compounds such as solvation enthalpy are significant in the
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
2
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
engineering design and operation of industrial chemical
processes. Since the experimental determination of solvation
enthalpy is time–consuming and expensive, and there is
increased requirement of reliable physical and thermodynamic
data for the optimization of chemical processes, it would be
very useful to develop predictive models that can be used to
predict these properties of organic compounds that are not
synthesized or their properties are unknown. Alternatively, the
quantitative structure-property relationship (QSPR) provides a
capable method for the evaluation of the solvation enthalpy of
organic compounds rooted in descriptors derived exclusively
from the molecular structure to fit experimental data [7]. The
QSPR approach has become very practical in the prediction of
physical and chemical properties. [8]. The support vector
machine (SVM) was recently developed from the machine
learning community by Vapnik and co-workers in 1995 [9,10].
It is a new algorithm developed for regression and
classification, and has indicated a good performance in
classification problems by several successful applications [11–
17]. In recent years, SVM has also exposed great performance
in QSPR studies due to its aptitude to interpret the nonlinear
relationships between molecular structure and properties [18–
27].
In the present work, for the first time, SVM was used
for predicting the gas to carbon tetrachloride solvation
enthalpy of various organic compounds. The aim was to
establish a QSPR model that could be used for the prediction
of solvation enthalpy from their molecular structures alone
and to show the flexible modeling ability of SVM and at the
same time, to seek the important structural features related to
the solvation enthalpy of organic compounds. PLS and ANN
methods were also utilized to establish quantitative linear and
nonlinear relationships to compare with the results obtained by
SVM.
2. Methodology
2.1. Data set
The data set of gas to carbon tetrachloride solvation
enthalpy was extracted from the values reported by Mintz et
al. [28]. The molecules in data set including alkanes, alkenes,
alkyl halides, alcohols, phenols, ethers, esters, ketones,
aldehydes, amines, anilines, nitriles, nitro compounds,
polycyclic hydrocarbons, heterocyclic compounds and
benzene derivatives are summarized in Table 1 (see
Appendix). The solvation enthalpies of all molecules included
in data set were obtained under the same conditions and refer
to a temperature of 298 K. The solvation enthalpies fall in the
range of -3.01 to -100.80 kJ/mole for methane and 18-crown-6
ether, respectively. The entire dataset is randomly divided into
two subsets. A training set of 119 compounds and a test set of
50 compounds. The training set was used to build the actual
models and the test set was used for evaluation of the
prediction power of obtained model. Leave-one-out (LOO)
cross-validation was performed to evaluate the modeling
ability of the model. In leave-one-out, each of the samples in
the dataset is in turn singled out as a test sample and the
remaining samples are used to train the classifier.
2.2. Molecular descriptors generation
Due to multiplicity of the molecules studied, different
descriptors were calculated. The calculation process of the
molecular descriptors was described as follows: molecules
were drawn with Hyperchem package (Version 7) [29] and
then pre-optimized using MM+ molecular mechanics force
field. A more precise optimization is then done with the
semiempirical PM6 method in Mopac [30]. To ensure getting
structure with optimum geometry, optimization was repeated
many times with different starting geometries. The
optimization was preceded by the Polak–Rebiere algorithm to
reach 0.01 root mean square gradient. As a next step, the
Hyperchem and Mopac output files were used by the
CODESSA program (Version 2.7.2) [31]. This software can
calculate more than 500 different descriptors on the basis of
molecular structural information [32,33]. Since CODESSA is
not able to calculate some new generated 3D descriptors such
as 3D-MORSE, GETAWAY and WHIM, hence they were
computed by using the DRAGON software [34].
2.3. GA-PLS based variable selection
The strategy implemented for genetic algorithm-
based variable selection in the frame of PLS regression can be
described through the different steps detailed in [35]. GA-PLS
is a refined hybrid approach that combines GA [36] as a
heuristic global optimization method with PLS [37] as a robust
statistical method for variable selection. In GA-PLS, the
chromosome and its fitness in the species correspond to a set
of variables and internal prediction of the derived PLS model,
respectively [38].
In QSPR studies, it is essential to attain a model
containing as few variables as possible because this will lead
to a simple and interpretable model. Therefore, the quality of a
chromosome is determined by both the internal predictivity it
gives and the number of variables it uses. In order to enhance
the quality of chromosomes in the population, an extra rule is
added to GA-PLS following the idea of Leardi et al. [39]: the
best chromosome using the same number of variables is
sheltered unless a chromosome with a lower number of
variables gives better internal predictivity.
In this paper, GA-PLS followed Leardi's method.
The values of empirical parameters affecting the performance
of GA-PLS were defined as in Table 2. Because each GA
gives a slightly different model, at least each run is repeated
five times to verify the robustness of the predictive ability and
importance of the selected model.
2.4. Partial Least Squares (PLS)
PLS regression is a modern technique that generalizes
and combines features from principal component analysis and
multiple regression. The PLS method takes into account
information of dependent variables during the decomposition
of the independent variables data matrix. Suppose that X
represents independent variables (X is a matrix) and Y
represents dependent variables (Y is a vector). Then a brief
description of computations is given as follows:
Y=XB+E (5)
where B is the matrix of PLS regression and E is the matrix of
the residuals. In this categorization application, the Y
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
3
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
variables contain the information about class memberships of
the training objects. The estimation of B can be obtained
through the generalized inverse of X (X+) provided by the PLS
algorithms:
B=X+Y= W (P′W)
-1Q (6)
where W is the matrix of weights of the X-space, Q is the
loading matrix for the Y-space, and P is the X-space loading
matrix.
The PLS algorithm used in this investigation was the
singular value decomposition (SVD)-based PLS. This
algorithm was proposed by Lobert et al. [40]. A concise
discussion of the SVD-based PLS algorithm can be found in
the literature [41–43]. The program of PLS modeling based on
SVD was written with MATLAB 7 in our laboratory [44].
2.5. Artificial Neural Network (ANN)
An ANN is a biologically inspired computer program
designed to learn from data in a manner of emulating the
learning pattern in the brain. Most ANN systems are very
multifaceted and high-dimension processing systems. A
detailed description of the theory behind a neural network has
been adequately described in our previous works [45–49].
In the present work, an ANN program was written
with MATLAB 7. This network was feed-forward fully
connected that has three layers with sigmoidal transfer
function. Descriptors selected by GA and PLS methods were
used as inputs of network and its output signal represent the
solvation enthalpy of interested molecules. Thus this network
has five nodes in input layer and one node in output layer. The
value of each input was divided into its mean value to bring
them into dynamic range of the sigmoidal transfer function of
the network. The initial values of weights were randomly
selected from a uniform distribution that ranged between -0.3
to +0.3 and the initial values of biases were set to be one.
Before training, the network parameters would be optimized.
These parameters are: number of nodes in the hidden layer,
weights and biases learning rates and the momentum.
Procedures for the optimization of these parameters were
reported elsewhere [50, 51].
2.6. Support Vector Machine (SVM)
SVM is gaining popularity due to many striking
features and promising empirical performance. It was invented
from early concepts developed by Vapnik and Chervonenkis
[52–54]. This technique has proven to be very effective for
addressing general intention classification and regression
problems [55–59]. For nonlinear regression, the basic idea in
support vector regression (SVR) is to plan the input data X
into a higher dimensional feature space F via a nonlinear
mapping ϕ and then a linear regression problem is acquired
and solved in the feature space. Therefore, the regression
approximation addresses the problem of estimating a function
based on a given data set [60]
(7)
where Xi is input vector, di is the desired value and l
corresponds to the size of the training set.
The generic SVR estimating function takes the form as Eq. 8:
(8)
where indicates the features of inputs, and
b are coefficients. The coefficients are predicted by
minimizing the regularized risk function
) + ∥w∥2 (9)
where
(10)
In Eq. 9 the first term, ) is the empirical
error (risk). The ϵ-insensitive loss function given by Eq. 10 is
used to measure them. This loss function presents the
advantage of enabling one to use sparse data points to signify
the decision function as Eq. 8. Also, the second term ∥w∥2 is
the regularization term that controls the model complexity and
is used as a measurement of function flatness, where C is the
regularized constant. C determines the tradeoff between the
empirical risk and the regularization term. Increasing the value
of C will result in the relative significance of the empirical risk
to the regularization term to grow. ϵ is called the tube size and
it corresponds to the approximation accuracy placed on the
training data points. Both C and ϵ are user-prescribed
parameters.
Then, by introduction of Lagrange multipliers (α, αi*)
and satisfying the α .αi*= 0, αi ≥ 0, αi* ≥ 0, i=1,…,l the
decision function (8) becomes the following form:
(11)
In Eq. 11, the kernel function K is equivalent to
K(X,Xi) = ϕ(X). All kernel functions must satisfy
Mercer’s condition (kernel function must be symmetric, and it
must be positive definite) that corresponds to the inner product
of some feature space. One has several possibilities for the
choice of this kernel function, including linear, polynomial,
spline, and radial basis function. The elegance of using the
kernel function lies in the fact that one can deal with feature
spaces of random dimensionality without having to compute
Table 2. Parameters of the genetic algorithm.
Population size 30 Chromosomes
Regression method PLS
Maximum number of variables selected in the same chromosome 30
Maximum number of components The optimal number
Response Cross-validated % explained variance
Probability of mutation 0.01
Probability of cross over 0.5
Number of evaluations 200
Number of runs 100
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
4
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
the plan ϕ(x) explicitly. In SVR, a commonly used kernel
function is the Gaussian radial basis function.
2.7. Estimation of the predictive ability of a QSPR model
For the optimized QSPR model numerous parameters
were chosen to test prediction potential of the model. A real
QSPR model may have a high predictive talent, if it is close to
ideal one. This may entail that the correlation coefficient R
between the experimental (actual) y and predicted y~
properties must be close to 1 and regression of y against y~ or
y~ against y through the origin, i.e. y~kyr0 and yk'y~ r0,
respectively, should be demonstrated by at least either k or k'
close to 1 [61]. Slopes k and k' are calculated as follows:
2i
ii
y~
y~yk (12)
2
i
ii
y
y~yk' (13)
The criteria formulated above may not be satisfactory
for a QSPR model to be really predictive. Regression lines
through the origin defined by y~ky r0 and yk'y~ r0
(with
the intercept set to one) should be close to optimum regression
lines by~ay rand 'bya'y~ r
(b and b' are
intercepts). Correlation coefficients for these lines 2
0R and
2
0R' are calculated as follows:
2i
2r0ii2
0)y~y~(
)yy~(1R (14)
2
i
2r0
ii2
0)y(y
)y~(y1R' (15)
where y and y~ are the average values of the observed and
predicted properties, respectively and the summations are over
all n compounds in the validation set.
A difference between 2R and 2
0R values (2
mR )
desires to be studied to check the prediction potential of a
model [62]. This term was defined in the following manner:
) RR(1RR 2
0
222
m (16)
Finally, the subsequent criteria for assessment of the
predictive ability of QSPR models should be considered:
1. High value of cross-validated R2 (q
2>0.5).
2. Correlation coefficient R between the predicted and actual
properties from an external test set close to 1. R 2
0 or R'20should be close to R
2.
3. At least one slope of regression lines (k or k') through the
origin should be close to 1.
4. 2
mR should be greater than 0.5.
3. Results and Discussion
3.1. Diversity validation
The basic investigation topics in chemical database
analysis are diversity of sampling [63]. In this study, diversity
analysis was done on the data set to make certain that the
structures of the test sets can illustrate those of the whole ones.
We consider a database of n compounds generated from m
highly correlated chemical descriptors . Each
compound, Xi, is shown as following vector (Eq. 17):
n1,2,...,ifor ),...xx,x,(x Χ imi3i2i1i (17)
where xij signifies the value of descriptor j of compound Xi.
The combined database is represented a n×m
matrix of X as follows (Eq. 18):
nmn2n1
2m2221
1m1211
N21
x... xx
x... xx
x... xx
)X,...X,(XX
T (18)
where the superscript T represents the vector/matrix transpose.
A distance score, dij, for two different compounds Xi and Xj
can be measured by the Euclidean distance norm (Eq. 19):
2m
1k
jkikjiij )x(xΧΧd (19)
The mean distances of one sample to the remaining
ones were calculated as follows (Eq. 20):
n1,2,...,i 1-n
d
d
n
1j
ij
i (20)
Then the mean distances were normalized within the
interval of zero to one. With the aim of calculating the values
of mean distances compliant with the Eqs. (19) and (20) a
MATLAB program was written in our laboratory. The closer
to one the distance is the more diverse to each other the
compound is. The mean distances of sample were plotted
against experimental solvation enthalpy (EXP) (Figure 1)
which shows the diversity of the molecules in the training and
test sets. As can be seen from this figure, the molecules are
varied in all sets and the training set with a wide
representation of the chemistry space was sufficient to ensure
the model's stability. The diversity of test set can prove the
predictive ability of the model.
3.2. PLS modeling
Table 1 (see Appendix) shows the data set and
corresponding observed PLS, ANN and SVM predicted values
of solvation enthalpy of all molecules studied in this work.
Parameters of genetic algorithm for generation of GA-PLS are
shown in Table 2. Table 3 shows the specifications of best
PLS model. The optimum number of latent variables to be
included in the model was three. It can be seen from this table
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
5
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
that five descriptors appeared in this model. These descriptors
are: mean information index on atomic composition (AAC),
maximal electrotopological negative variation (MAXDN),
solvation connectivity index chi-0 (X0sol), total information
context index (neighborhood symmetry of 0 order) (TIC0) and
3D- Balaban index (J3D). After the formation of the PLS
model, a Variance Inflation Factor (VIF) (VIF=1/(1-R2)) was
calculated to see if multicollinearities existed among the
descriptors in models. If VIF ranges from 1.0 to 5.0, the
related equation is acceptable; when VIF is larger than 10.0,
the regression equation is unstable and re-check of variables
correlation coefficient is necessary. As can be seen in the last
column of Table 3, the VIF of all descriptors are smaller than
5, indicating that generated model possesses statistic
significance and good stability. Table 4 represents the
correlation matrix for these descriptors. By interpreting the
descriptors in the models, it is possible to gain some insight
into factors that are likely related to solvation enthalpy of the
organic compounds.
For assessment of the relative importance and
donation of each descriptor in the model, the value of mean
effect (ME) was calculated for each descriptor by the
following equation:
(21)
where, MEj is the mean effect for considered descriptor j, βj is
the coefficient of descriptor j, dij is the value of descriptors for
each molecule, and m is the number of descriptors in the
model. The calculated values of MEs are represented in Table
3 and are also plotted in Figure 2. The value and sign of mean
effect demonstrates the relative contribution and direction of
influence of each variable on the solvation enthalpy. The first
descriptor according to its mean effect is the solvation
connectivity index chi 0 (X0sol), which represents the linear
fragment of one carbon atom that is defined in order to model
solvation entropy and to describe dispersion interaction in
solution. If the characteristic dimensions of the molecules by
atomic parameters are taken into account, it defined as:
= (22)
where La is the principal quantum number (2 for C, N, O
atoms, 3 for Si, S,Cl,…) of ath atom in the kth subgraphs; δa is
the corresponding vertex degree; k is the total number of mth
order subgraphs and n is the number of vertices in the
subgraph [64]. This molecular descriptor has negative sign for
its mean effect, which reveals that by increasing the value of
Table 3. The partial least squares regression coefficients.
Descriptor Notation Coefficient Mean effect VIF
Mean information index on atomic composition AAC -5.506 -6.733 1.478
Maximal electrotopological negative variation MAXDN 1.425 1.354 1.206
Solvation connectivity index chi-0 X0sol -4.264 -24.581 2.537
Total information context index (neighborhood symmetry of 0 order) TIC0 -0.805 -16.132 3.114
3D- Balaban index J3D 2.661 10.650 1.798
Constant -7.087
.
Table 4. Correlation matrix for descriptors applied in this work.
AAC MAXDN X0sol TIC0 J3D
AAC 1 0.382 -0.014 -0.012 -0.411
MAXDN 1 -0.164 -0.128 -0.168
X0sol 1 0.723 0.066
TIC0 1 0.409
J3D 1
.
Figure 1. Scatter plot of samples for training and test sets.
Figure 2. Plot of descriptor’s mean effects.
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
6
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
this descriptor the values of ΔHSolv decrease.
The second descriptor is total information context
index (neighborhood symmetry of 0 order) (TIC0). This
topological descriptor represents a measure of the graph
complexity and is calculated as follows [65]:
TICr =A. ICr (23)
where A is the atom number and ICr is neighborhood
information content and is defined as follows:
(24)
where g runs over the G equivalence classes, Ag is the
cardinality of the gth equivalence class, A is the total number
of atoms and Pg is the probability of randomly selecting a
vertex of the gth class. It represents a measure of structural
complexity per vertex. The negative value of mean effect for
this descriptor in the PLS model indicates that this descriptor
contributes negatively to value of ΔHSolv.
The next descriptor accordance to mean effect is 3D-
Balaban index (J3D). The 3D-Balaban indexes are calculated
based on the geometry distance matrix [66]. This descriptor
describes the mobility of the backbone chain and is defined as
follows:
(25)
where ζi and ζj are vertex distance degrees of two adjacent
atoms i and j that are connected by the bond b, and the sum
runs over all the bonds b in the molecule, B is the total number
of bonds in the molecule, and C is the cyclomatic number (the
minimum number of edges that must be removed from the
molecular graph to make it acyclic). This descriptor has
positive sign, which reveals that by increasing the values of
this descriptor, the values of ΔHSolv increase.
The fourth descriptor is the mean information index
on atomic composition (AAC). This descriptor is the mean
value of the total information content and was calculated as
(26)
where Ah is the total number of atoms (hydrogen included), A
is the number of equal type atoms in the gth
equivalence class,
and P is the probability of randomly selecting a gth
type atom
[67]. The negative value of mean effect for AAC (-6.733) in
the PLS model indicates that this descriptor contributes
negatively to value of ΔHSolv. The last descriptor described here is Maximal
electrotopological negative variation (MAXDN). This
descriptor represents the maximum negative intrinsic state
difference in the molecule and can be related to the
nucleophilicity of the molecule and was calculated as follows
[68]:
If (27)
where is the field effect on the ith atom due to the
perturbation of all other atoms as defined by Kier and Hall:
(28)
where the sum runs over all the other atoms in the molecular
graph, I is the atomic intrinsic state and d the topological
distance between the two considered atoms. The positive value
of mean effect for MAXDN (1.354) in the PLS model reveals
that this descriptor contributes positively to value of ΔHSolv.
From the above discussion, it can be seen that all
descriptors involved in the QSPR model have physical
meaning, and these descriptors can account for structural
features that affect the solvation enthalpy of the interested
molecules.
3.3. ANN modeling
The next step was the construction of an ANN.
Before training the ANNs, the parameters of network
including the number of nodes in the hidden layer, weights
and biases learning rates and momentum values were
optimized. Table 5 shows the architecture and specification of
the optimized network. The predictive power of the ANN
models developed on the selected training sets are estimated
on the predictions of test set chemicals, by calculating the q2
that is defined as follows:
2i
2ii2
)yy~(
)y~-(y1q (29)
where yi and iy~ , respectively, are the measured and predicted
values of the dependent variable (solvation enthalpy), y is the
averaged value of dependent variable of the training set and
the summations cover all the compounds. The calculated value
of q2 was 0.970.
The statistical values of test set for the ANN model
was characterized by q2 = 0.970, R
2 = 0.971 (R = 0.985), R0
2=
0.971, Rm2
= 0.939 and k = 0.996. These values and other
statistical parameters which are shown in Table 6 reveal the
high predictive ability of the model. Figure 3a shows the plot
of the ANN predicted versus experimental values for solvation
enthalpy of all of the molecules in data set. The residuals of
the ANN calculated values of the solvation enthalpy are
plotted against the experimental values in Figure 4a. The
propagation of the residuals in both sides of zero line indicates
that no systematic error exists in the constructed QSPR model.
3.4. SVM modeling
The influential modeling method of SVM is then
used to investigate the possible nonlinear relation between the
selected descriptors and the ΔHSolv values. The performances
of SVM for regression depend on the combination of several
parameters: capacity parameter C, ε of ε-insensitive loss
function, and γ. C is a regularization parameter that controls
the tradeoff between maximizing the margin and minimizing
the training error. If C is too small, then inadequate stress will
be placed on fitting the training data. If C is too large, then the
Table 5. Architecture and specifications of optimized ANN
model.
Number of nodes in the input layer 5
Number of nodes in the hidden layer 6
Number of nodes in the output layer 1
Weights learning rate 0.2
Biases learning rate 0.5
Momentum 0.3
Transfer function Sigmoid
.
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
7
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
algorithm will overfit the training data. To make the learning
process steady, a large value should be set up for C. The
kernel type is another important parameter. For regression
tasks, the Gaussian RBF kernel is commonly used. The form
of the Gaussian RBF function is represented as follows:
) (30)
where γ is a constant, parameter of the kernel, u and V are two
independent variables. γ controls the amplitude of the
Gaussian RBF function and therefore, controls the
generalization ability of SVM. The optimal value for ε
depends on the type of noise present in the data, which is
usually unknown. Even if adequate knowledge of the noise is
accessible to select an optimal value for ε, there is the useful
contemplation of the number of resulting support vectors. ε
insensitivity prevents the whole training set meeting border
conditions and so authorizes for the possibility of scattering in
the dual formulation’s solution. So, selecting the suitable
value of ε is mandatory. These parameters should be
optimized to obtain better results. To select the accurate values
for these parameters, different values of them were tried; the
set of values with the best leave-five-out cross-validation
performance will be selected as the optimal ones. The overall
performances of SVM were evaluated in terms of root-mean-
square (RMS), which was defined as below:
Table 6. Statistical parameters obtained using the PLS, ANN and SVM models a.
Model SEtr SEt Rtr Rt Ftr Ft
PLS 5.444 5.425 0.927 0.922
712 273
ANN 2.126 2.410 0.989 0.985 5318 1580
SVM 1.708 2.016 0.993 0.990 8308 2279 a tr, training set; t, test set; SE, standard error; R, the correlation coefficient; and F, the statistical F-value.
.
Figure 3a. Plot of ANN calculated versus experimental gas
to carbon tetrachloride solvation enthalpy.
Figure 3b. Plot of SVM calculated versus experimental
gas to carbon tetrachloride solvation enthalpy.
Figure 4a. Plot of ANN residual versus experimental
values of gas to carbon tetrachloride solvation enthalpy.
Figure 4b. Plot of SVM residual versus experimental
values of gas to carbon tetrachloride solvation enthalpy.
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
8
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
(31)
where di is the desired outputs in the test set, oi the SVM
outputs, and n is the number of samples in test set. The
influences of the parameters on the performance of SVM are
shown in Figures 5–7. Through the above process, the γ, ε and
C were fixed to 25, 0.03 and 200 respectively, when the
support vector number was 119. The predicted results of the
optimal SVM were shown in Table 6. The model gave a RMS
error of 1.708 for the training set and 2.016 for the test set, and
the corresponding correlation coefficients (R) were 0.993 and
0.990, respectively. Figure 3b shows the plot of the SVM
predicted versus experimental values for solvation enthalpy of
all of the molecules in data set. The residuals of the SVM
calculated values of the solvation enthalpy are plotted against
the experimental values in Figure 4 b. The propagation of the
residuals in both sides of zero line indicates that no systematic
error exists in the constructed QSPR model.
3.5. Comparison of the results obtained by different QSPR
approaches
The results of different QSPR models are collected in
Table 6. The correlation coefficient (R) between experimental
and predicted solvation enthalpy by PLS, ANN and SVM are
0.927, 0.989 and 0.993, respectively for training set and 0.922,
0.985 and 0.990, respectively for the test set. As can be seen
from Table 6, the result of SVM model is better than those
obtained by PLS method.
Furthermore, the result of SVM is comparable to
those of ANN. SVM exhibits the better overall performance
owing to exemplifying the structural risk minimization
principle and some advantages over the other techniques of
converging to the global optimum and not to a local optimum.
It is important to note that as a general machine learning
method, SVM is based on the structural risk minimization
principle, which minimizes an upper bound of the
generalization error rather than minimizes the training error.
So SVM is of better generalization performance than PLS and
ANN, and thus is especially suitable for QSPR modeling on
the small datasets. Moreover, when compared to ANN, once
corresponding parameters are specified, the solution of SVM
is definite and reproducible, which is clearly superior to ANN.
It is also significant to note that, the standard error
values for SVM model were not only low but also as similar
as possible for the training and external test set, which
suggests that the proposed model has both predictive ability
(low values) as well as sufficient generalization performance
(similar values).
4. Conclusions
In this paper, QSPR models based on PLS, ANN and
SVM have been developed for the first time for predicting the
ΔHSolv of a diverse set of organic compounds from the
molecular structure. Results obtained, show that nonlinear
models using SVM based on the same set of descriptors
produced even better models with a good predictive ability
than the two other PLS and ANN models. By performing
model validation, it can be concluded that the presented model
is a suitable model and can be successfully used to predict the
ΔHSolv of organic compounds with accuracy similar to the
accuracy of experimental ΔHSolv determination. It can be
logically concluded that the proposed model would be
expected to predict ΔHSolv for new organic compounds or for
other organic compounds for which experimental values are
unknown.
Figure 5. The gamma versus rms error on LOO cross-validation (C
=100, ε = 0.1) [69].
Figure 6. The epsilon versus rms error on LOO cross-validation
(C = 100, γ= 25) [69].
Figure 7. The cost versus rms error on LOO cross-validation (γ=
25, ε= 0.03) [69].
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
9
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
References
1. H.M.J. Neumann, Solution Chem. 6 (1977) 33.
2. N. Morel-Desrosiers, J. P. Morel, Can. J. Chem. 59 (1981) 1.
3. M. Irisa, K. Nagayama, F. Hirata, Chem. Phys. Lett. 207 (1993)
430.
4. C, Mintz, K. Burton, W. E. Acree Jr., Fluid Phase Equilibr. 258
(2007) 191.
5. J.S. Chickos, W.E. Acree Jr., J. Phys. Chem. Ref. Data 32
(2003) 519.
6. J.S. Chickos, W.E. Acree Jr., J. Phys. Chem. Ref. Data 31
(2002) 537.
7. X.J. Yao, Y.W. Wang, X.Y. Zhang, R.S. Zhang, M.C. Liu, Z.D.
Hu, B.T. Fan, Chemom. Intell. Lab. Syst. 62 (2002) 217.
8. X.J. Yao, M.C. Liu, X.Y. Zhang, Z.D. Hu, B.T. Fan, Anal.
Chim. Acta 462 (2002) 101.
9. V.N. Vapnik, The Nature of Statistical Learning Theory,
Springer, New York (1995).
10. V.N. Vapnik, Statistical Learning Theory, Wiley, New York
(1998).
11. J. Wang, H.Y. Du, X.J. Yao, Z.D. Hu, Anal. Chim. Acta 601
(2007) 156.
12. X.J. Yao, A. Panaye, J.P. Doucet, H.F. Chen, R.S. Zhang, B.T.
Fan, M.C. Liu, Z.D. Hu, Anal. Chim. Acta 535 (2005) 259.
13. U. Thissen, B. Ustun, W.J. Melssen, L.M.C. Buydens, Anal.
Chem. 76 (2004) 3099.
14. E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, J. Chem.
Inf. Comput. Sci. 43 (2003) 1882.
15. S.R. Amendolia, G. Cossu, M.L. Ganadu, B. Golosio, G.L.
Masala, G.M. Mura, Chemometr. Intell. Lab. Syst. 69 (2003) 13.
16. Y. Lee, C.K. Lee, Bioinformatics 19 (2003) 1132.
17. A.I. Belousov, S.A. Verzakov, J.V. Frese, Chemometr. Intell.
Lab. Syst. 64 (2002) 15.
18. H.Z. Si, S.P. Yuan, K.J. Zhang, A.P. Fu, Y.B. Duan, Z.D. Hu,
Chemometr. Intell. Lab. Syst. 90 (2008) 15.
19. J. Wang, H.Y. Du, H.X. Liu, X.J. Yao, Z.D. Hu, B.T. Fan,
Talanta 73 (2007) 147.
20. C.X. Xue, R.S. Zhang, H.X. Liu, X.J. Yao, M.C. Liu, Z.D. Hu,
B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 1693.
21. W.P. Ma, X.Y. Zhang, F. Luan, H.X. Zhang, R.S. Zhang, M.C.
Liu, Z.D. Hu, B.T. Fan, J. Phys. Chem. A 109 (2005) 3485.
22. H.X. Liu, X.J. Yao, R.S. Zhang, M.C. Liu, Z.D. Hu, B.T. Fan, J.
Phys. Chem. B 109 (2005) 20565.
23. C.Y. Zhao, H.X. Zhang, X.Y. Zhang, M.C. Liu, Z.D. Hu, B.T.
Fan, Toxicology 217 (2006) 105.
24. J.Z. Li, H.X. Liu, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan,
Chemometr. Intell. Lab. Syst. 87 (2007) 139.
25. H.X. Liu, C.X. Xue, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu,
B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004) 1979.
26. M.K. Leong, Chem. Res. Toxicol. 20 (2007) 217.
27. L. Peter, M. Tatiana, J. Chem. Inf. Comput. Sci. 43 (2003) 1855.
28. C. Mintz, M. Clark, K. Burton, W. E. Acree, Jr., M. H.
Abraham, J. Solution Chem. 36 (2007) 947.
29. Hyperchem, re. 4. for Windows, Autodesk, Sansalito, CA
(1995).
30. Mopac for Windows, Stewart Computational Chemistry (2009).
31. A. R. Katritzky, M. Karelson, R. Petrukhin, Comprehensive
Descriptors for Structural and Statistical Analysis (CODESSA)
Version 2.7.2, University of Florida (1994).
32. A.R. Katritzky, V.S. Labadov, M. Carelson, CODESSA
Training Manual, University of Florida, Gainesville (1995).
33. A.R. Katritzky, V.S. Labadov, M. Carelson, CODESSA Version
1 Reference Manual, University of Florida, Gainesville, Florida
(1994).
34. I.V. Tetko, J. Gasteiger, R. Todeschini, A. Mauri, D.
Livingstone, P. Ertl, V.A. Palyulin, E.V. Radchenko, N.S.
Zefirov, A.S. Makarenko, V. Tanchuk, V.V. J. Prokopenko, J.
Comput. Aid. Mol. Des. 19 (2005) 453.
35. R. Leardi, Chemom. Intell. Lab. Syst. 41 (1998) 195.
36. D.E. Goldberg, Genetic Algorithms in Search, Optimization and
Machine Learning, Addison–Wesley, New York (1989).
37. A. Hoskuldsson, Prediction Methods in Science and
Technology, Basic Theory, Thur Publishing, Denmark Vol 1
(1996).
38. K. Hasegawa, T. Kimura, K. Funatsu, Quant. Struct.-Act. Relat.
18 (1999) 262.
39. R. Leardi, R. Boggia, M. Terrile, J. Chemom. 6 (1992) 267.
40. A. Lorber, L. Wangen, B.R.J. Kowalsky, Chemometrics 1
(1987) 19.
41. T. Khayamian, A.A. Ensafi, B. Hemmateenejad, Talanta 49
(1999) 587.
42. M. Shamsipur, B. Hemmateenejad, M. Akhond, H. Sharghi,
Talanta 54 (2001) 1113.
43. A. Hoskuldsson, Chemom. Intell. Lab. Syst. 55 (2001) 23.
44. MATLAB 7.0. The Mathworks Inc., Natick. http://www.math
works.com.
45. H. Golmohammadi, Comput. Chem. 30 (2009) 2455.
46. Z. Dashtbozorgi, H. Golmohammadi, Eur. J. Med. Chem. 45
(2010) 2182.
47. Z. Dashtbozorgi, H. Golmohammadi, J. Sep. Sci. 33 (2010)
3800.
48. H. Golmohammadi, Z. Dashtbozorgi, Struct. Chem. 21 (2010)
1241.
49. H. Golmohammadi, M. Safdari, Microchem. J. 95 (2010) 140.
50. T.B. Blank, S.T. Brown, Anal. Chem. 65 (1993) 3081.
51. M. Jalali-Heravi, M. H. Fatemi, J. Chromatogr. A 915 (2001)
177.
52. V. Vapnik, Estimation of Dependencies Based on Empirical
Data, Nauka, Moscow (1979).
53. C. Cortes, V. Vapnik, Mach. Learn. 20 (1995) 273.
54. V. Vapnik, S. Golowich, A. Smola, Adv. Neural Inform.
Process. Syst. 9 (1997) 281.
55. E. Byvatov, U. Fechner, J. Sadowski, G. Schneider, J. Chem.
Inf. Comput. Sci. 43 (2003) 1882.
56. C.Y. Zhao, R.S. Zhang, H.X. Liu, C.X. Xue, S.G. Zhao, X.F.
Zhou, M.C. Liu, B.T. Fan, J. Chem. Inf. Comput. Sci. 44 (2004)
2040.
57. F. Luan, C.X. Xue, R.S. Zhang, C.Y. Zhao, M.C. Liu, Z.D. Hu,
B.T. Fan, Anal. Chim. Acta 537 (2005) 101.
58. F. Luan, W.P. Ma, X.Y. Zhang, H.X. Zhang, M.C. Liu, Z.D. Hu,
B.T. Fan, Chemosphere 63 (2006) 1142.
59. V.Z. Vladimir, V.B. Konstantin, A.I. Andrey, P.S. Nikolay, V.P.
Igor, J. Chem. Inf. Comput. Sci. 43 (2003) 2048.
60. H.X. Liu, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, J.
Chem. Inf. Comput. Sci. 43 (2004) 161.
61. A. Golbraikh, A. Tropsha, J. Mol. Graphics Model. 20 (2002)
269.
62. P.P. Roy, K. Roy, QSAR Comb. Sci. 27 (2008) 302.
63. A.G. Maldonado, J.P. Doucet, M. Petitjean, Mol. Divers. 10
(2006) 39.
64. V.K. Gombar, A. Kumar, M.S. Murthy, Indian J. Chem. 268
(1987) 1168.
65. R. Sarkar, A.B. Roy, P.K. Sarkar, Math. Biosci. 39 (1978) 299.
66. A.T. Balaban, S.C. Basak, T. Colburn, G.D. Grunwald, J. Chem.
Inf. Comput. Sci. 34 (1994) 1118.
67. S.M. Dancoff, H. Quastler, Essays on the Use of Information
Theory in Biology. University of Illinois, Urbana, IL (1953).
68. L.B. Kier, L.H. Hall, Molecular Structure Description. The
Electrotopological State. Academic Press, London, UK (1999).
69. V. Cherkassky, F. Mulier, Learning from data: Concepts, theory,
and methods; Wiley, New York (1998).
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
10
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
Appendix
Table 1. Comparison of experimental and predicted values of gas to carbon tetrachloride solvation enthalpy (ΔHSolv in units of kJ/mole)
for training and test sets.
ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number
Training set
-2.91 -3.39 -3.27 -3.01 Methane 1
-19.49 -20.66 -22.43 -18.49 Butane 2
-25.07 -23.87 -28.36 -25.19 Pentane 3
-49.37 -49.55 -52.12 -48.37 Decane 4
-58.70 -59.51 -61.09 -57.70 Dodecane 5
-75.30 -72.08 -82.40 -76.30 Hexadecane 6
-26.90 -27.53 -29.94 -25.90 2,2-Dimethylbutane 7
-31.76 -31.47 -32.29 -33.89 Ethyl pentane 8
-37.23 -39.74 -43.71 -36.23 2,2,4,4-Tetramethylpentane 9
-29.47 -27.59 -29.58 -28.47 Cyclopentane 10
-34.26 -32.78 -35.85 -32.34 Cyclohexane 11
-38.33 -36.88 -38.32 -37.91 Cycloheptane 12
-41.88 -40.94 -41.91 -42.95 Cyclooctane 13
-51.00 -50.43 -50.66 -52.00 Cyclododecane 14
-36.90 -35.47 -39.52 -34.65 Methylcyclohexane 15
-50.71 -51.65 -55.83 -49.22 trans Decalin 16
-50.12 -49.89 -53.41 -48.40 Adamantane 17
-9.80 -9.75 -17.59 -9.58 Ethene 18
-34.27 -32.91 -32.50 -35.21 1-Heptene 19
-38.58 -37.78 -34.04 -39.58 1-Octene 20
-36.58 -38.13 -40.04 -35.52 Norbornadiene 21
-27.85 -29.12 -24.08 -28.14 Acetone 22
-32.13 -32.87 -28.80 -33.13 2-Butanone 23
-34.09 -35.74 -31.76 -37.07 2-Pentanone 24
-56.19 -57.78 -48.41 -55.19 2-Nonanone 25
-39.79 -39.79 -32.18 -41.38 Cyclopentanone 26
-42.27 -43.78 -31.47 -44.68 Cyclohexanone 27
-42.23 -43.24 -37.71 -44.89 2,2,4,4-Tetramethyl-3-pentanone 28
-32.92 -30.37 -36.27 -28.96 Diethyl ether 29
-39.74 -39.31 -40.30 -36.69 Dipropyl ether 30
-49.79 -49.48 -53.79 -46.46 Dibutyl ether 31
-30.07 -27.14 -35.67 -28.79 Methyl tert-butyl ether 32
-38.24 -38.03 -31.83 -37.50 1,2-Dimethoxyethane 33
-38.50 -36.93 -37.55 -37.20 Tetrahydropyran 34
-35.50 -33.97 -39.35 -34.50 Butyl methyl ether 35
-32.78 -32.33 -22.58 -30.40 Dimethoxymethane 36
-75.90 -79.92 -85.13 -76.90 15-Crown-5 37
-99.80 -97.43 -121.09 -100.80 18-Crown-6 38
-29.29 -30.34 -33.34 -30.13 Chloroform 39
-33.43 -30.46 -22.62 -32.43 Carbon tetrachloride 40
-32.89 -32.88 -32.51 -33.18 1-Chlorobutane 41
-29.67 -28.11 -38.73 -27.61 cis-1,2-Dichloroethylene 42
-30.67 -29.88 -34.73 -28.03 trans-1,2-Dichloroethylene 43
-26.12 -25.92 -28.51 -27.12 Iodomethane 44
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
11
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
Table 1 Continued
ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number
-34.57 -33.30 -37.67 -35.57 2-Iodo-2-methylpropane 45
-40.88 -38.44 -32.93 -41.88 Diiodomethane 46
-11.13 -10.41 -17.54 -10.63 Chlorotrifluoromethane 47
-27.79 -29.28 -22.54 -28.79 Propanal 48
-31.86 -32.57 -30.10 -32.90 Butanal 49
-36.84 -37.52 -39.15 -35.84 2-Nitropropane 50
-24.36 -23.96 -21.16 -25.36 Acetonitrile 51
-19.40 -18.97 -13.90 -19.80 Methanol 52
-24.02 -24.07 -21.73 -24.28 Ethanol 53
-28.45 -28.93 -21.37 -27.90 1-Propanol 54
-39.46 -39.10 -38.33 -42.20 1-Hexanol 55
-50.91 -49.23 -47.87 -52.20 1-Octanol 56
-29.71 -30.04 -32.61 -30.71 2-Methyl-2-butanol 57
-36.17 -37.40 -42.12 -35.14 Propyl formate 58
-38.97 -40.73 -40.32 -39.97 Butyl formate 59
-33.39 -30.87 -33.04 -30.94 Methyl acetate 60
-36.40 -34.63 -37.14 -34.97 Ethyl acetate 61
-43.23 -42.27 -34.26 -44.23 Butyl acetate 62
-35.22 -34.55 -36.24 -35.77 Methyl propionate 63
-39.04 -39.03 -46.10 -39.78 Ethyl propionate 64
-45.15 -42.24 -45.70 -44.45 Propyl propionate 65
-32.31 -30.61 -33.57 -33.31 Benzene 66
-38.61 -39.74 -38.55 -38.12 Toluene 67
-52.76 -51.09 -49.65 -54.14 Naphthalene 68
-76.18 -74.41 -65.98 -77.18 Anthracene 69
-50.82 -47.24 -50.11 -50.08 Acetophenone 70
-48.28 -47.84 -46.90 -45.27 Anisole 71
-45.09 -43.61 -43.00 -46.86 Benzaldehyde 72
-69.20 -65.37 -74.31 -68.20 1,3,4,5-Tetrabromobenzene 73
-42.11 -43.12 -43.29 -40.33 Chlorobenzene 74
-49.36 -48.20 -51.03 -46.15 1,2-Dichlorobenzene 75
-50.00 -48.43 -51.19 -46.70 1,4-Dichlorobenzene 76
-57.91 -55.28 -66.74 -58.09 1,2,4,5-Tetrachlorobenzene 77
-70.48 -71.36 -62.28 -71.48 Hexachlorobenzene 78
-35.43 -33.00 -39.42 -34.43 Fluorobenzene 79
-48.31 -49.78 -42.33 -48.01 Iodobenzene 80
-35.35 -34.72 -38.28 -34.35 Trifluoromethylbenzene 81
-36.87 -41.15 -34.88 -44.52 N,N-Dimethylformamide 82
-42.11 -42.97 -32.19 -44.98 Dimethyl sulfoxide 83
-43.43 -43.97 -42.32 -46.47 Aniline 84
-48.49 -47.09 -44.58 -49.59 N-Methylaniline 85
-52.99 -49.27 -58.88 -51.99 N,N-Dimethylaniline 86
-52.22 -51.61 -47.57 -54.37 N-Ethylaniline 87
-37.31 -37.63 -37.74 -38.31 Pyridine 88
-44.01 -44.92 -42.81 -43.01 2-Methylpyridine 89
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
12
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
Table 1 Continued
ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number
-49.39 -50.40 -48.21 -48.88 2,4-Dimethylpyridine 90
-49.51 -49.43 -48.32 -46.68 2,6-Dimethylpyridine 91
-50.10 -50.82 -49.92 -50.38 2-Bromopyridine 92
-50.58 -53.40 -47.12 -51.14 3-Bromopyridine 93
-47.17 -48.37 -47.25 -45.20 2-Chloropyridine 94
-47.60 -51.74 -47.45 -48.60 3-Chloropyridine 95
-47.65 -45.06 -46.17 -49.40 3-Cyanopyridine 96
-47.82 -45.88 -46.21 -49.10 4-Cyanopyridine 97
-33.59 -32.48 -29.93 -35.53 Butylamine 98
-32.22 -32.72 -30.96 -33.22 Diethylamine 99
-38.53 -38.82 -38.41 -38.04 Triethylamine 100
-67.72 -65.40 -61.93 -68.72 Tributylamine 101
-53.27 -53.26 -52.65 -54.72 Dibutyl sulfide 102
-45.24 -45.63 -38.95 -47.56 N,N-Dimethylacetamide 103
-41.80 -42.20 -41.03 -43.40 Phenol 104
-48.11 -47.05 -54.57 -46.10 2-Chlorophenol 105
-57.78 -56.53 -53.95 -59.70 4-Bromophenol 106
-41.00 -40.64 -35.68 -42.00 Pentafluorophenol 107
-47.37 -48.65 -41.59 -46.06 3-Methylphenol 108
-54.04 -54.51 -52.78 -54.40 2-Methoxyphenol 109
-57.39 -55.24 -53.06 -59.50 3-Methoxyphenol 110
-55.45 -55.93 -50.11 -57.60 4-Methoxyphenol 111
-61.70 -63.90 -57.84 -60.70 1-Naphthol 112
-43.59 -46.40 -46.97 -42.81 Diethyl carbonate 113
-51.16 -51.02 -49.29 -52.79 Phenyl methyl sulfide 114
-27.20 -32.20 -29.10 -27.20 Acrylonitrile 115
-35.46 -35.67 -36.63 -33.89 1,4-Difluorobenzene 116
-73.48 -74.51 -67.31 -74.48 Benzophenone 117
-56.59 -57.72 -53.73 -55.40 Quinoline 118
-65.39 -64.51 -71.36 -64.39 1-Nitronaphthalene
119
Test set
-9.34 -10.69 -15.11 -9.20 Ethane 120
-15.10 -15.69 -22.21 -14.40 Propane 121
-30.19 -29.27 -34.89 -29.75 Hexane 122
-34.54 -34.82 -39.57 -34.48 Heptane 123
-39.21 -40.16 -30.27 -39.13 Octane 124
-44.11 -45.34 -35.13 -43.18 Nonane 125
-16.26 -17.07 -21.62 -16.15 2-Methylpropane 126
-24.11 -24.18 -25.43 -23.93 2-Methylbutane 127
-31.21 -32.19 -24.35 -30.95 Methylcyclopentane 128
-50.66 -52.70 -58.59 -50.78 cis Decalin 129
-59.09 -60.33 -61.66 -57.16 Bicyclohexyl 130
-53.54 -52.69 -52.08 -55.50 Tetralin 131
-35.37 -36.37 -30.15 -37.73 3-Pentanone 132
-39.99 -41.57 -32.06 -41.84 2-Hexanone 133
-44.77 -45.64 -38.97 -46.40 2-Heptanone 134
GLOBAL JOURNAL OF PHYSICAL CHEMISTRY
13
Global J. Phys. Chem. 2012, 3: 13 www.simplex-academic-publishers.com
© 2012 Simplex Academic Publishers. All rights reserved.
Table 1 Continued
ΔHSolv(SVM) ΔHSolv(ANN) ΔHSolv( PLS) ΔHSolv( EXP) Name Number
-43.14 -46.62 -40.01 -45.64 4-Heptanone 135
-48.96 -50.10 -42.62 -50.65 2-Octanone 136
-56.66 -54.58 -51.91 -54.77 5-Nonanone 137
-37.37 -36.37 -32.15 -39.88 2,4-Pentanedione
138
-46.80 -50.17 -51.32 -45.50 1,2-Diethoxyethane 139
-38.53 -39.13 -43.19 -37.90 1,4-Dioxane 140
-69.77 -69.35 -68.57 -71.30 2,5,8,11-Tetraoxadodecane 141
-34.80 -34.83 -30.31 -34.60 Tetrahydrofuran 142
-61.78 -65.70 -65.27 -63.90 12-Crown-4 143
-51.62 -52.51 -47.52 -53.59 1-Chlorooctane 144
-41.94 -36.78 -38.90 -39.33 Tetrachloroethylene
145
-38.35 -39.54 -38.61 -40.39 1-Iodobutane
146
-36.17 -36.42 -34.11 -38.15 Pentanal 147
-28.22 -29.25 -34.68 -28.37 Nitromethane 148
-32.67 -34.44 -34.82 -32.90 Nitroethane 149
-37.65 -39.65 -38.98 -38.23 1-Nitropropane 150
-27.31 -27.23 -25.40 -26.40 2-Propanol 151
-33.52 -31.28 -29.38 -34.53 1-Butanol 152
-35.59 -36.01 -33.76 -37.82 1-Pentanol 153
-42.81 -43.66 -43.19 -41.83 Ethylbenzene 154
-47.95 -48.24 -49.47 -46.74 Mesitylene 155
-65.44 -61.68 -58.80 -63.06 Biphenyl 156
-47.10 -46.20 -44.57 -48.10 Benzonitrile 157
-43.46 -47.41 -45.96 -42.97 Bromobenzene 158
-61.81 -60.75 -64.82 -60.67 1,3,5-Tribromobenzene 159
-51.43 -52.67 -54.18 -50.21 Nitrobenzene 160
-58.84 -58.52 -64.44 -56.08 4-Chloro-1-nitrobenzene 161
-33.47 -36.04 -34.05 -33.21 Pyrrole 162
-40.03 -41.67 -39.59 -38.99 N-Methylpyrrole 163
-39.08 -36.98 -35.61 -41.31 Tetrahydrothiophene 164
-24.46 -27.30 -26.54 -23.38 Dimethyl sulfide 165
-35.02 -35.83 -34.38 -38.73 Diethyl sulfide 166
-39.55 -40.79 -38.98 -48.63 γ -Butyrolactone 167
-76.87 -79.59 -68.23 -78.24 trans-Stilbene
168
-64.31 -60.20 -59.71 -62.37 1-Chloronaphthalene
169
Cite this article as:
Zahra Dashtbozorgi et al.: Quantitative structure–property relationship studies for predicting gas to carbon tetrachloride
solvation enthalpy based on partial least squares, artificial neural network and support vector machine.
Global J. Phys. Chem. 2012, 3: 13