Practical Overview QSAR 2009

8/8/2019 Practical Overview QSAR 2009

http://slidepdf.com/reader/full/practical-overview-qsar-2009 1/15

EXCLI Journal 2009;8:74-88 – ISSN 1611-2156 Received: April 28, 2009, accepted: May 3, 2009, published: May 5, 2009

74

Review article:

A PRACTICAL OVERVIEW OF

QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIP

Chanin Nantasenamat1, Chartchalerm Isarankura-Na-Ayudhya1, Thanakorn Naenna2,Virapong Prachayasittikul1,*

1 Department of Clinical Microbiology, Faculty of Medical Technology, Mahidol University,Bangkok 10700, Thailand

2 Department of Industrial Engineering, Faculty of Engineering, Mahidol University, Nakhon Pathom 73170, Thailand

* Corresponding author: Telephone: 662-441-4376, Fax: 662-441-4380E-mail: [email protected]

ABSTRACT

Quantitative structure-activity relationship (QSAR) modeling pertains to the construction of predictive models of biological activities as a function of structural and molecular informationof a compound library. The concept of QSAR has typically been used for drug discovery anddevelopment and has gained wide applicability for correlating molecular information with notonly biological activities but also with other physicochemical properties, which has therefore

been termed quantitative structure-property relationship (QSPR). Typical molecular parame-ters that are used to account for electronic properties, hydrophobicity, steric effects, and to-

pology can be determined empirically through experimentation or theoretically via computa-tional chemistry. A given compilation of data sets is then subjected to data pre-processing anddata modeling through the use of statistical and/or machine learning techniques. This reviewaims to cover the essential concepts and techniques that are relevant for performingQSAR/QSPR studies through the use of selected examples from our previous work.

Keywords: quantitative structure-activity relationship, QSAR, quantitative structure-propertyrelationship, multivariate analysis

INTRODUCTION

Drug discovery has often evolved fromserendipitous and fortuitous findings, for example, the discovery of penicillin byAlexander Fleming in 1928 triggered theAntibiotic Revolution which contributedtremendously to humankind’s quest of lon-gevity. If not by chance, such discoveriesmay be achieved through random system-atic experimentation or chemical intuitionwhere combinatorial libraries are synthe-sized and screened for potent activities.

Such approach is extremely time consum-ing, labor intensive, and impractical interms of costs. A more lucrative solution to

this problem is to rationally design drugsusing computer-aided tools via molecular

modeling, simulation, and virtual screeningfor the purpose of identifying promisingcandidates prior to synthesis.

Quantitative structure-activity relation-ship (QSAR) and quantitative structure-

property relationship (QSPR) makes it pos-sible to predict the activities/properties of agiven compound as a function of its mo-lecular substituent. Essentially, new anduntested compounds possessing similar mo-lecular features as compounds used in the

development of QSAR/QSPR models arelikewise assumed to also possess similar activities/properties. Several successful



EXCLI Journal 2009;8:74-88 – ISSN 1611-2156

Received: April 28, 2009, accepted: May 3, 2009, published: May 5, 2009

75

QSAR/QSPR models have been publishedover the years which encompass a widespan of biological and physicochemical

properties. QSAR/QSPR has great potential

for modeling and designing novel compounds with robust properties by being ableto forecast physicochemical properties as afunction of structural features. The popular-ity of QSAR/QSPR has seen exponentialgrowth as illustrated by a literature searchin Scopus for research articles with QSAR,QSPR, structure-activity relationship, andstructure-property relationship as keywords(Figure 1).

This review covers the essential con-

cepts and history of QSAR/QSPR as well asthe components involved in the develop-ment of QSAR/QSPR models. Several ex-amples from our previous research andrelevant equations are presented.

BRIEF HISTORY OF QSAR

QSAR has its origins in the field of toxicology whereby Cros in 1863 proposeda relationship which existed between the

toxicity of primary aliphatic alcohols withtheir water solubility (Cros, 1863). Like-

wise, Crum-Brown and Fraser (Crum-Brown and Fraser, 1868-1869) postulatedthe linkage between chemical constitutionand physiological action in their pioneering

investigation in 1868 as follows:

“performing upon a substance a

chemical operation which shall in-troduce a known change into its

constitution, and then examining and comparing the physiological

action of the substance before and

after the change”

Shortly after, Richet (1893), Meyer

(1899), and Overton (1901) separately dis-covered a linear correlation between lipo-

philicity (e. g. oil-water partition coeffi-cients) and biological effects (e. g. narcoticeffects and toxicity). By 1935, Hammett(1935, 1937) introduced a method to ac-count for substituent effects on reactionmechanisms through the use of an equationwhich took two parameters into considera-tion namely the (i) substituent constant andthe (ii) reaction constant.

Year

1950 1960 1970 1980 1990 2000 2010

N u m b e r o f R e

s e a r c h A r t i c l e s

0

1000

2000

3000

4000

5000

6000

Figure 1: Number of research articles in the field of QSAR/QSPR.





76

Complementing the Hammett’s model,Taft proposed in 1956 an approach for separating polar, steric, and resonance ef-fects of substituents in aliphatic compounds

(Taft, 1956). The contributions fromHammett and Taft set forth the mechanistic

basis for QSAR/QSPR development byHansch and Fujita (1964) in their seminaldevelopment of the linear Hansch equationwhich integrated hydrophobic parameterswith Hammett’s electronic constants. Aninsightful account on the development of QSAR/QSPR can be found in the excellent

book by Hansch and Leo (1995).

DEVELOPMENT OF QSAR MODEL

The construction of QSAR/QSPR model typically comprises of two mainsteps: (i) description of molecular structureand (ii) multivariate analysis for correlatingmolecular descriptors with observed activi-ties/properties. An essential preliminarystep in model development is data under-standing. Intermediate steps that are alsocrucial for successful development of such

QSAR/QSPR models include data pre-

processing and statistical evaluation. Aschematic representation of the QSAR

process is illustrated in Figure 2.

Data understandingData understanding is a crucial step that

one should not overlook as it helps the re-searcher to become familiar with the natureof the data prior to actual QSAR/QSPR model construction thereby reducing un-necessary errors or labors that would oth-erwise occur. An added benefit is that such

preliminary observations can often lead tothe identification of interesting associationsor relationships to study. However, beforeexploring the data it is essential that thor-ough literature search on relevant back-ground information pertaining to the bio-logical or chemical system of interest is

performed.This can be achieved through what is

known as exploratory data analysis whichoften starts with simple observation of thedata matrix particularly the variables (alsoknown as attributes or fields), its corre-

sponding data types, and the data samples(also called records).

OH

OC1=CC=CC=C1 1D

2D

3D

Molecular Structures Molecular Descriptors

ConstitutionalElectronic

GeometricalHydrophobicLipophilicity

SolubilitySteric

Quantum ChemicalTopological

Multivariate Analysis

Multiple Linear RegressionSelf-Organizing Map

Principal Component AnalysisPartial Least Squares

Neural NetworkSupport Vector Machine

Data Pre-Processing

NormalizationStandardization

Feature SelectionOutlier Detection

Statistical Evaluation

RR2

Q2

MSERMSE

E x p e r i m e n t a l

A c t i v i t y

Predicted Activity

OH

OC1=CC=CC=C1 1D

2D

3D

Molecular Structures

OH

OC1=CC=CC=C1 1D

2D

3D

Molecular Structures Molecular Descriptors



SolubilitySteric


Molecular Descriptors



SolubilitySteric










Data Pre-Processing



Data Pre-Processing




RR2

Q2

MSERMSE


RR2

Q2

MSERMSE


A c t i v i t y

Predicted Activity


A c t i v i t y

Predicted ActivityPredicted Activity

Figure 2: Schematic overview of the QSAR process.





77

As applied to the QSAR discipline,variables represent molecular descriptors;data samples represent each unique com-

pound; data types refer to the characteristics

or the kinds of data the particular value isrepresented as, which essentially is qualita-tive or quantitative in nature. Qualitativedata types are interpreted as categorical la-

bels while quantitative data types areamendable to arithmetic operations. A morein-depth look into the nature of the data can

be performed via a simple scatter plot of thevariables.

Molecular descriptors

Molecular descriptors can be defined asthe essential information of a molecule interms of its physicochemical propertiessuch as constitutional, electronic, geometri-cal, hydrophobic, lipophilicity, solubility,steric, quantum chemical, and topologicaldescriptors. A more in-depth explanation of molecular descriptors can be found in theliterature (Helguera et al., 2008; Karelson etal., 1996; Katritzky and Gordeeva, 1993;

Labute, 2000; Randić

, 1990; Randić

andRazinger, 1997; Xue and Bajorath, 2000)and a more extensive treatment in the ency-clopedic Handbook of Molecular Descrip-tors (Todeschini and Consonni, 2000).From a practical viewpoint, molecular de-scriptors are chemical information that isencoded within the molecular structures for

which numerous sets of algorithms areavailable for such transformation.

Such descriptors could be calculated us-ing general quantum chemical software

such as Gaussian (Frisch et al., 2004), Spar-tan (Wavefunction, 2004), GAMESS(Gordon and Schmidt, 2005; Schmidt et al.,1993), NWChem (Kendall et al., 2000), Ja-guar (Schrödinger, 2008), MOLCAS(Karlström et al., 2003), Q-Chem (Shao etal., 2006), Dalton (Angeli et al., 2005), andMOPAC (Stewart, 2009) or specializedsoftware such as DRAGON (Talete srl,2007; Tetko et al., 2005), CODESSA(Katritzky et al., 2005), ADRIANA.Code

(Molecular Networks GmbH Computer-chemie, 2008), and RECON (Sukumar andBreneman, 2002). Once the molecular de-scriptors have been calculated it will serveas independent variables for further con-struction of the QSAR model.

Modeled activities/properties

The activities and properties that can bemodeled by QSAR/QSPR are dependent

variables of the QSAR model. These dependent variables are assumed to be influ-enced by the independent variables whichare the molecular descriptors. A variety of

biological and chemical properties havesuccessfully been modeled using the QSAR approach, such parameters are summarizedin Table 1.

Table 1: Summary of biological and chemical properties explored in QSAR studies.

Biological properties Chemical properties

BioconcentrationBiodegradationCarcinogenicityDrug metabolism and clearanceInhibitor constantMutagenicityPermeability

Blood brain barrierSkin

PharmacokineticsReceptor binding

Boiling pointChromatographic retention timeDielectric constantDiffusion coefficientDissociation constantMelting pointReactivitySolubilityStability

Thermodynamic propertiesViscosity





78

Data pre-processing

Data pre-processing can be consideredto be one of the most important phase of data mining as it helps to ensure the integ-

rity of the data set before proceeding further with data mining analysis. Essentially, thequality of a data mining analysis is a func-tion of the quality of the data to be ana-lyzed. This is often summarized by the“garbage in–garbage out” rule. Therefore,to obtain reliable QSAR models it is impor-tant to handle the data with great care.

Data cleaning

The preliminary steps in data pre- processing typically requires data cleaning as raw data often contain anomalies, errors,or inconsistencies such as missing data, in-complete data, and invalid character valueswhich may cause trouble for data miningsoftware if left untreated. This matter ismade complicated when informations areconsolidated from various sources as suchdata would need to be prepared to conformto designated criteria and redundant infor-

mation would also need to be eliminated.

Data transformation

There exists a great deal of variability inthe range and distribution of each variablein the data set. However, this may pose a

problem for data mining algorithms such asneural network which involves distancemeasurements in the learning step. Suchsituation is handled by applying statisticaltechniques such as min-max normalization

or z-score standardization. In min-maxnormalization, the minimum and maximumvalue of each variable is adjusted to a uni-form range between 0 and 1 according tothe following equation:

minmax

min

x x

x x x i

normalized −

−=

where xnormalized represents the min-maxnormalized value, xi represents the value of

interest, xmin represents the minimum value,and xmax represents the maximum value.

In z-score standardization, essentiallythe variable of interest is subjected to statis-tical operation to achieve mean center andunit variance according to the following

formula:

( )∑=

−

−=

N

i

jij

jij stnd

ij

N x x

x x x

1

2 /

where stnd

ij x represents the standardizedvalue, xii represents the value of interest, j x represents the mean, and N represents thesample size of the data set. The equation is

essentially the difference of the value of interest and its mean followed by a divisionoperation with the numerator, which is thevariance. Practically, both normalizationand standardization requires statistical op-eration to be applied to each individualvalue using the global parameter of eachvariable such as its minimum value, mean,or variance.

In situations where the data does not

have a Gaussian (normal) distribution, simple mathematical functions can be appliedto achieve normality or symmetry in thedata distribution. A commonly used ap-

proach is to apply logarithmic transforma-tion on the variable of interest in order toachieve distribution approaching normality.This is typically performed on dependentvariables such as the modeled biologi-cal/chemical properties of interest wherebyIC50 may be transformed to logIC50 or – logIC50. Practically, such mathematical op-eration is applied to each individual valueof a given variable of interest.

Feature or variable selection

Typical data sets often contain redun-dant or noisy variables which make it moredifficult for learning algorithms to discernmeaningful patterns from the input data setof interest. For example, a data set may

contain 1,500 variables but only 15 of thosemay contain unique and useful informationwhile the rest may contain redundant in-





79

formation to the aforementioned variableset. Therefore, such multicollinearity of thevariables in the data set would need to betreated before proceeding with data mining

analysis in order to reduce unnecessarycomputational resources that are required inmodel construction.

Similarly, feature fusion is another in-teresting approach for reducing the dimen-sionality of the variable matrix while keep-ing the core information intact. This is per-formed by merging the information of twoor more variables through mathematicaloperations. A more in-depth treatment of such issue is addressed in the literature

(Bosse et al., 2007; Goodman et al., 1997;all and McMullen, 2004; Torra, 2003).

Multivariate analysis

Multivariate analysis is essentially anapproach to quantitatively discern relation-ships between the independent variables(e. g. molecular descriptors) and the de-

pendent variables (e. g. biological/chemical properties of interest). The classical ap-

proach is a linear regression technique typi-cally involving the establishment of a linear mathematical equation:

nn xa xaa y +++= L110

where y is the dependent variable (e. g.

biological/chemical property of interest), a0 is the y-intercept or baseline value for thecompound data set, a1 …an are the regres-

sion coefficients calculated from a set of training data in a supervised manner wherethe independent and dependent variablesare known. The equation essentially relatesthe variation of biological/chemical proper-ties as a function of the variations of themolecular substituents present in the mo-lecular data set. Such linear approach workswell for biological/chemical systems inwhich the phenomenon of interest is of lin-ear nature. However, not all properties are

clearly straightforward and may be non-linear in nature, therefore calls upon the useof non-linear approaches in order to prop-

erly model such properties. Non-linear techniques such as artificial neural network are a quite popular technique which pos-sesses uncanny capability to model proper-

ties of interest. This review article will briefly cover the fundamentals of artificialneural network as an example of a non-linear learning algorithm. Other popular learning methods frequently used in thefield of QSAR such as partial least squaresregression (Geladi and Kowalski, 1986;Höskuldsson, 1988; Wold et al., 2001) or support vector machine (Chen et al., 2004;Cristianini and Shawe-Taylor, 2000; Wang,2005) can be found in excellent resources

elsewhere.

Artificial neural network

Artificial neural network (ANN) is a pattern recognition technique that closelyresembles the inner workings of the brainwhich is essentially composed of intercon-nected neurons. The neurons receive itssignals via synapses at the axon-dendron

junction in which the axon of one neuron

relays neurotransmitters to the dendron of another neuron. Such phenomenon is emu-lated by ANN’s architectural design whereneuronal units are interconnected to oneanother. A commonly used architecture asshown in Figure 3 is a three-layer feed-forward network which is comprised of (i)input layer, (ii) hidden layer, and (iii) out-

put layer. The input layer essentially passesinformation of the independent variablesinto the ANN system; therefore the number

of neuronal units present in the input layer is equal to the number of independent vari-ables in the data set. The connectionsamong neurons are assigned numerical val-ues known as weights. The informationfrom the input layer is relayed to the hiddenlayer for pattern recognition processing and

predictions will then be passed from thehidden layer to the output layer. In a back-

propagation algorithm, the error is calcu-lated, which is derived from the difference

between the predicted value and the actualvalue, and if it is acceptable then the learn-ing process will stop otherwise signals will





80

be sent backwards to the hidden layer for further processing and weight readjust-ments. This is performed iteratively until asolution is reached and learning is termi-

nated.

x1

x2

x3

x4

x5

x6

x7

x8

y

x1

x2

x3

x4

x5

x6

x7

x8

y

Figure 3: Schematic representation of arti-ficial neural network.

Parameter optimization

In deriving a robust QSAR model, it is

essential to optimize the parameters of thelearning technique of interest. Such ap-

proaches could be performed via a system-atic and empirical grid search or via sto-chastic approaches using techniques such asMonte Carlo or genetic algorithm. A typicalsystematic grid search is performed from a

predetermined minimum to maximum valuewhich essentially is dependent on the pa-rameter to be optimized. The step size be-tween such parameter interval can be ini-tially large in order to minimize computa-tional resources. From this preliminary cal-culation, the optimal regions are then iden-tified and a more refined parameter searchcan then be performed using a more strin-gent approach by narrowing the step size.

Statistical evaluation

In construction of a QSAR model, it isessential to validate the model as well as

apply statistical parameters to evaluate its predictive performance.

Model validation

The predictive performance of a data setcan be assessed by dividing it into a train-ing set and a testing set. The training set is

used for constructing a predictive modelwhose predictive performance is evaluatedon the testing set. Internal performance istypically assessed from the predictive per-formance of the training set while external

performance can be assessed from the pre-dictive performance of the independent test-ing set that is unknown to the trainingmodel. A commonly used approach for in-ternal validation is known as the N -foldcross-validation where a data set is parti-tioned into N number of folds. For example,in a 10-fold cross-validation 1 fold is leftout as the testing set while the remaining 9folds are used as the training set for modelconstruction and then validated with thefold that was left out. In situations wherethe number of samples in the data set is lim-ited, leave-one-out cross-validation is the

preferred approach. Analogously, the number of folds is equal to the number of sam-

ples present in the data set, therefore onesample is left out as the testing set while therest is used as the training set for modelconstruction. Finally, validation is per-formed on the data sample that was left outinitially. This is iteratively performed untilall data samples are given the chance to beleft out as the testing set.

Statistical parameters

Pearson’s correlation coefficient (r ) is a

commonly used parameter to describe thedegree of association between two variablesof interest. Calculated r value of two vari-ables of interest can take a value rangingfrom –1 to +1 where the former indicates anindirect (negative) correlation while the lat-ter suggests a direct (positive) correlation.For describing the relative predictive per-formance of a QSAR model, r is used tomeasure the correlation between experi-mental ( ) and predicted ( y ) values of in-

terest in order to observe the variability thatexists between the variables. This is calcu-lated according to the following equation:





81

( ( ) ) ( ) )( ∑ ∑∑∑

∑∑∑−−

−=

2222 y yn x xn

y x xynr xy

where r xy is the correlation coefficient be-tween variables x and y, n is the samplesize, x is the individual value of variable x,

y is the individual value of variable y, xy isthe product of variables x and y, x2 is thesquared value of variable x, and y

2 is thesquared value of variable y.

Root mean squared error (RMS) is an-other commonly used parameter for assess-ing the relative error of the QSAR model.RMS is computed according to the follow-ing formula:

( )

n

y x

RMS

n

i

∑=

−

= 1

2

where RMS is the root mean squared error, x is the experimental value of the activ-ity/property of interest, y is the predictedvalue of the activity/property of interest,and n is the sample size of the data set.

F-test

The statistical significance of QSAR models are typically assessed by perform-ing ANOVA and observing the calculated Fvalues, which is essentially the ratio be-tween the explained and the unexplainedvariance. Comparison of the performanceof multiple QSAR models can be per-

formed when all models compared have thesame number of degrees of freedom mean-ing that the same sets of compounds anddescriptors are used. Each model yields acalculated F value and the best performingmodel is identified as those bearing thehighest value.

Degrees of freedom take into considera-tion the number of compounds and thenumber of independent variables that are

present in the data set. This can be calcu-

lated using the equation n – k – l where n represents the number of compounds and k represents the number of descriptors. The

higher the value becomes the more reliablethe QSAR model is.

Outliers

Outlying compounds are those mole-cules which have unexpected biological ac-tivity and do not fit in a QSAR model ow-ing to the fact that such compounds may beacting in a different mechanism or interactwith its respective target molecules in dif-ferent modes (Verma and Hansch, 2005).Similarly, conformational flexibility of tar-get protein binding site (Kim, 2007a) andunusual binding mode (Kim, 2007b) areattributed to be possible source of outliers.Mathematically speaking, an outlier is es-sentially a data point which has high stan-dardized residual in absolute value whencompared to the other samples of the dataset. Methods for identification and treat-ment of outlying compounds are thereforecrucial in development of reliable QSAR models (Furusjö et al., 2006). A commonlyused approach for detecting outliers is per-formed by calculating the standardized re-

siduals of all compounds in the data set of aQSAR model.

Predictive QSAR Model

In evaluating the performance of theconstructed QSAR model, a commonlyused approach in the field of QSAR followsthe recommendation of Tropsha (Tropsha etal., 2003) that a predictive QSAR modelshould possess the following statisticalcharacteristics:

5.02>q

6.02> R

1.0)(

2

20

2

<−

R

R Ror 1.0

)'(2

20

2

<−

R

R R

15.185.0 ≤≤ k or 15.1'85.0 ≤≤ k

where q2 represents cross-validated explained variance, R2 represents coefficientof determination (where 2

0 R and 20' R repre-





82

sents predicted versus observed activitiesand observed versus predicted activities,respectively), slopes k and k’ of regressionlines passing through the origin.

It should be noted that q2

is calculatedaccording to the following equation:

∑

∑

=

=

−

−

−=training

i

i

training

i

ii

y y

y y

q

1

2

1

2

2

)(

)(

1

)

where yi is the measured value, i y )

is the predicted value, and y is the averaged

value of the entire data set, and summationapplies to all compounds in the training set.Similarly, an external q2 is calculated usingcompounds that are previously not used inQSAR model development. This is calcu-lated according to the following equation:

∑

∑

=

=

−

−

−=training

ii

training

i

ii

ext

y y

y y

q

1

2

1

2

2

)(

)(

1

)

CASE STUDY

In this review, we present examplesfrom our previous QSAR/QSPR investiga-tions on various data sets of biological andchemical systems: (i) recognition of DNAsplice junction sites (Nantasenamat et al.,2005a), (ii) prediction of antioxidant activi-ties of phenolics antioxidants (Nantase-namat et al., 2008), (iii) prediction of bind-ing performance of molecularly imprinted

polymers (Nantasenamat et al., 2007a; Nan-tasenamat et al., 2005b; Nantasenamat etal., 2006), (iv) prediction of spectral proper-ties of green fluorescent protein variants(Nantasenamat et al., 2007b), (v) predictionof anti-anthrax activity of furin inhibitors(Worachartcheewan et al., 2009), (vi) pre-diction of lactonolysis activity of N -acyl-

homoserine lactones. Among these, we se-lect some representative data sets as examples of QSAR/QSPR in action.

Recognition of DNA splice junction sites

The deoxyribonucleic acid (DNA) of humans is made up of over three billion nu-cleotides which contains an estimated num-

ber of 30,000 genes that can express over 150,000 different proteins. The amazingfact that a limited number of genes can pro-duce an overwhelming number of differentgene products is made possible by a phe-nomenon known as alternative splicingwhere the stretch of DNA strands arecleaved at specific regions. Such regions inthe DNA are known as exons (coding re-gion) and introns (non-coding region)which are not readily discernible by simpleobservation of the DNA sequences. Our

previous investigation has made it possibleto recognize boundaries cleavage regions of the DNA called splice junction sites whichare boundaries where splicing occurs.

Splice sites are essentially comprised of 2 types: (i) AG dinucleotide that borders thetransition from intron to exon (intron/exon

border) and (ii) GT dinucleotide that bor-ders the transitions from exon to intron

(exon/intron border). Owing to the fact thata gene is capable of expressing several dis-tinct mRNAs encoding for different pro-teins, it is therefore important to be able to

predict the location of DNA splice sites asit has great potential for the identification of

probable gene products in unknown DNAsequences.

In our efforts to develop a computa-tional approach for recognition of DNAsplice junction sites, the DNA sequences

were transformed to sequences of binarynumbers by converting each nucleotide to afour digit binary code where nucleotidesadenine, cytosine, guanine, and thymine arerepresented as 0001, 0010, 0100, and 1000,respectively. Each entry of the data set de-scribes information surrounding the splice

junction site, particularly 15 nucleotidesupstream and downstream resulting in a to-tal of 32 nucleotides. This informationserves as independent variables while the

dependent variable is the class of splice junction site which was labeled as one of three possible types (going from 5’ to 3’ or





83

left to right of the splice site): (i) Intron-AG-Exon, (ii) Exon-GT-Intron, and (iii)unknown-AG or GT-unknown. The data setis made of a total of 1,424 human DNA se-

quences that is divided into two portions: (i)a training set of 1,000 sequences and (ii) atesting set of 424 sequences. Various pre-dictive models were developed using threedifferent types of learning algorithm com-

prising of (i) self-organizing map, (ii) back- propagation neural network, and (iii) support vector machine.

Predicting the antioxidant activity of phe-

nolic antioxidants

Reactive oxygen species (ROS) are produced during normal aerobic metabo-lism. Antioxidants are biomolecules whichscavenge and reduce the deleterious effectsof these free radicals. Under normal physio-logical conditions, equilibrium exists be-tween the production and elimination of free radicals. Such equilibrium may be per-turbed by environmental factors to trigger acondition known as oxidative stress, which

may result in oxidative damage to various biomolecules such as DNA, RNA, proteins,and membrane lipids. Antioxidant enzymesand compounds that are present inherentlyin living organisms as well as those ac-quired from nutrition play crucial role incombating the deleterious effects of ROS.Therefore, the ability to predict the antioxi-dant activity, in terms of the bond dissocia-tion enthalpy, offers great potential for de-signing more robust antioxidant com-

pounds.This was addressed in our previous in-vestigation on the structure-activity rela-tionship of a library of phenolic antioxi-dants. Multivariate analysis of the QSAR model was performed by support vector machine using molecular descriptors de-rived from quantum chemical calculationsas independent variables to predict the anti-oxidant activity, which is the dependentvariable. The aim of the study was to de-

velop a rapid approach to assess the anti-oxidant activity of the phenolic antioxidantsusing readily available quantum chemical

descriptors. Such descriptors were calcu-lated at various theoretical levels in order toselect the level which gave good perform-ance while at the same time consume

minimal computational resources. The theo-retical levels consisting of the semi-empirical Austin Model 1 (AM1), Hartree-Fock with 3-21g(d) basis set, Becke’s three

parameter Lee-Yang-Parr (B3LYP) with 3-21g(d) basis set, and B3LYP with 6-31g(d)

basis set were tested with multiple linear regression. Results indicated that AM1 andB3LYP/3-21g(d) were the best performinglevels as observed from correlation coeffi-cient of 0.897 and 0.917, respectively, and

root mean squared error of 1.974 and 1.777,respectively. Such results outperformedthose of HF/3-21g(d) and B3LYP/6-31g(d)which had lower correlation coefficientthan the previous two at 0.761 and 0.730respectively, while having higher root meansquared error at 4.624 and 4.773, respec-tively.

Refinement of the predictive model was performed using support vector machine,which is a more robust learning classifier,to yield significant improvements with cor-relation coefficients of 0.968 and 0.966,respectively, for models using descriptorsderived from B3LYP/3-21g(d) and AM1calculations. Likewise, the root meansquared error showed substantial decline to1.122 and 1.247, respectively, for B3LYP/3-21g(d) and AM1 descriptors.

Predicting the imprinting factor of

molecularly imprinted polymersMolecular imprinting is a technology

which enables the production of macromo-lecular matrices which can bind to templatemolecules of interest and function as artifi-cial receptors, antibodies, and enzymes.These molecularly imprinted polymers(MIPs) are produced by polymerization of cross-linking monomers with the self-assembled template-monomer adducts. Thetemplate molecules are then extracted from

the polymers to reveal complementary binding cavities that are specific to theoriginal template molecule.





84

We have developed an approach to cal-culate the interaction strength of templatemolecules with its complementary func-tional monomers. This methodology essen-

tially correlates the molecular properties of template-monomer adducts with its respec-tive interaction strength in a quantitativemanner via multivariate analysis. The mo-lecular properties were derived from quan-tum chemical calculations to serve as quan-titative description of the template mole-cules and functional monomers. Artificialneural network implementing the back-

propagation algorithm was used as the mul-tivariate analysis method.

The data sets used was comprised of two types of polymer: (i) irregularly-sized

particles that was prepared by traditional bulk polymerization and (ii) uniformly-sized particles that was prepared by multi-step swelling or precipitation polymeriza-tion. The former yielded rather poor predic-tivity with correlation coefficient of 0.382while the latter gave more robust resultswith correlation coefficient of 0.946. Rea-sons for such disparity in the predictive per-formance was attributed to the fact that theirregularly-sized MIPs had rather heteroge-neous properties in terms of the (i) number of binding sites, (ii) distribution of the bind-ing sites, (iii) size, and (iv) shape.

In the molecular imprinting literature,uniformly-sized MIPs has gained wide rec-ognition for its larger surface area, mono-dispersity, and colloidal stability. Such factwas in line with the predictive performance

of the devised QSAR model where uni-formly-sized MIPs gave high predictive performance than the heterogeneous irregu-larly-sized MIPs.

Predicting GFP spectral properties

A practical example of QSAR/QSPR inaction is modeling the spectral properties of Green Fluorescent Protein (GFP) from thePacific Northwest jellyfish Aequorea victo-ria. Owing to its autofluorescent nature,

GFP is an amazing protein which finds ex-tensive applications in life sciences as re-

porters for gene expression, protein local-

ization, protein-protein interaction, protein-lipid interaction, structural and behavioraldetermination of macromolecules and asanalytical sensors. Much effort has been put

forth to enhance the utility of such proteins by expanding the palette of colors whichcan be afforded by GFP and GFP-like pro-teins. The relationship between the struc-tures of GFP chromophores and their re-spective spectral properties had been estab-lished in our previous study (Nantasenamatet al., 2007b).

In such investigation, the excitation andemission maximas of 19 GFP color variantsand 29 synthetic GFP chromophores were

modeled using multiple linear regression, partial least squares regression, and back- propagation neural network. Molecular de-scriptions of the GFP chromophores wereused as independent variables and the spec-tral properties (e. g. excitation and emissionmaximas) were used as dependent vari-ables.

For development of the QSPR model,molecular descriptors were derived fromthree software packages: (i) Spartan’04, (ii)E-Dragon, and (iii) RECON. Sparan’04 is aquantum chemical package which calcu-lates the electronic properties of the chro-mophores. E-Dragon is an online version of the Dragon software package which cancompute over 1,600 molecular descriptorsspanning 20 categorical types. RECON is asoftware package used for deriving charge-

based descriptors for the molecules of in-terest. Comparative assessment of the pre-

dictive performance for the QSPR modelderived from the three software packageswere carried out. Results indicated that thequantum chemical descriptors derived fromSpartan’04 were most suitable for QSPR development as the selected descriptorscould properly account for the substituenteffects of the GFP chromophores.

In preliminary trials, the predictive per-formance of the QSPR model was relativelylow for the data set comprising of 19 GFP

color variants.Taking a closer look into thedetails of the QSPR model, it was foundthat the molecular structures did not reflect





85

the actual protonation state that was presentin natural biological systems. The p-hydroxybenzylidene chromophores of GFPis present in 2 protonation forms, namely

the protonated and deprotonated formswhich are responsible for the major absorb-ance peak at 395 nm and the minor absorb-ance peak at 475 nm, respectively. The pre-liminary QSPR models were derived fromGFP chromophores which were all drawnin the protonated form. This does not reflectthe actual protonation states, therefore cor-rection to the chromophore protonationstate was performed by drawing chromo-

phores with 395 nm absorbance peak in the

protonated form and 475 nm absorbance peak in the deprotonated form. Conse-quently, the predictive performance of theQSPR model improved drastically from(r excitation = 0.3272, RMSexcitation = 57.7310)and (r emission = 0.7209, RMSemission =32.1526) to (r excitation = 0.9795, RMSexcitation = 8.8237) and (r emission = 0.9067, RMSemission = 15.7614) for structures not taking the pro-tonation state into consideration and for structures taking the protonation state intoconsideration, respectively.

In regards to the synthetic GFP chro-mophores, the absorbance spectra indicatedthat the compound is present in the proto-nated form. Such QSPR model gave satis-factory performance as the drawn structuresaccurately reflected those present in natural

biological systems with correlation coeffi-cient and root mean squared error for theexcitation and the emission maxima of (r ex-

citation = 0.9335, RMSexcitation = 9.9095) and(r emission = 0.9626, RMSemission = 9.7508),respectively.

CONCLUSION

The past few decades have witnessed muchadvances in the development of computa-tional models for the prediction of a widespan of biological and chemical activitiesthat are beneficial for screening promising

compounds with robust properties. In thisreview article, we have provided a brief in-troduction to the concepts of QSAR along

with examples from our previous investiga-tions on diverse biological and chemicalsystems. It should be noted that the appli-cability of QSAR models are only useful in

the domains that they were trained andvalidated. As such, QSAR models spanningwider domains of molecular diversity havethe benefit of being valid for wider spans of molecules. It is also interesting to note thatthere are many paths for researchers in thefield of QSAR/QSPR in their quest of es-tablishing relationships between structureand activities/properties. Such abstract na-ture holds the beauty of the field as thereare endless possibilities in reaching the

same destination of designing novel mole-cules with desirable properties.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge finan-cial support from the Young Scholars Re-search Fellowship to C. Nantasenamat (No.MRG5080450) from the Thailand ResearchFund and the governmental budget of Ma-hidol University (B.E. 2551-2555).

REFERENCES

Angeli C, Bak KL, Bakken V, et al. DAL-TON, a molecular electronic structure pro-gram. Release 2.0; 2005.

Bosse E, Roy J, Wark S. Concepts, models,and tools for information fusion. Norwood,MA: Artech House, Inc., 2007.

Chen N, Lu W, Yang J, Li G. Support vec-tor machine in chemistry. Singapore: WorldScientific Publishing, 2004.

Cristianini N, Shawe-Taylor J. An introduc-tion to support vector machines and other kernel-based learning methods. Cambridge:Cambridge University Press, 2000.

Cros AFA. Action de l’alcohol amylique

sur l’organisme. Strasbourg, University of Strasbourg, Thesis, 1863.





86

Crum-Brown A, Fraser TR. On the connec-tion between chemical constitution and

physiological action. Pt 1. On the physio-logical action of the salts of the ammonium

bases, derived from Strychnia, Brucia, The- bia, Codeia, Morphia, and Nicotia. T RoySoc Edin 1868-1869;25:151-203.

Frisch MJ, Trucks GW, Schlegel HB, et al.Gaussian 03W, Revision C.02. Walling-ford: Gaussian Inc., 2004.

Furusjö E, Svenson A, Rahmberg M,Andersson M. The importance of outlier detection and training set selection for reli-

able environmental QSAR predictions.Chemosphere 2006;63:99-108.

Geladi P, Kowalski BR. Partial least-squares regression: a tutorial. Anal ChimActa 1986;185:1-17.

Goodman IR, Mahler RPS, Nguyen HT.Mathematics of data fusion. Dordrecht,Boston: Kluwer Academic Publishers,1997.

Gordon MS, Schmidt MW. Advances inelectronic structure theory: GAMESS adecade later. In: Dykstra CE, Frenking G,Kim KS, Scuseria GE (eds.): Theory andapplications of computational chemistry:the first forty years (pp 1167-1189). Am-sterdam: Elsevier, 2005.

Hall DL, McMullen SAH. Mathematical

techniques in multisensor data fusion. Bos-ton, MA: Artech House, Inc., 2004.

Hammett LP. Some relations between reac-tion rates and equilibrium constants. ChemRev 1935;17:125-36.

Hammett LP. The effect of structure uponthe reactions of organic compounds. Ben-zene derivatives. J Am Chem Soc 1937;59:96-103.

Hansch C, Fujita T. p-σ -π analysis. Amethod for the correlation of biological ac-tivity and chemical structure. J Am ChemSoc 1964;86:1616-26.

Hansch C, Leo A. Exploring QSAR. Wash-ington, DC: American Chemical Society,1995.

Helguera AM, Combes RD, Gonzalez MP,Cordeiro MN. Applications of 2D descrip-tors in drug design: a DRAGON tale. Curr Top Med Chem 2008;8:1628-55.

Höskuldsson A. PLS regression methods. J

Chemometr 1988;2:211-28.

Karelson M, Lobanov VS, Katritzky AR.Quantum-chemical descriptors in QSAR/QSPR studies. Chem Rev 1996;96: 1027-44.

Karlström G, Lindh R, Malmqvist P-Å, etal. MOLCAS: a program package for com-

putational chemistry. Comput Mater Sci2003;28:222-39.

Katritzky AR, Gordeeva EV. Traditionaltopological indices vs electronic, geometri-cal, and combined molecular descriptors inQSAR/QSPR research. J Chem Inf ComputSci 1993;33:835-57.

Katritzky AR, Karelson M, Petrukhin R.CODESSA PRO, Florida, U.S.A., 2005.

Kendall RA, Aprà E, Bernholdt DE, et al.High performance computational chemis-try: An overview of NWChem a distributed

parallel application. Comput Phys Commun2000;128:260-83.

Kim K. Outliers in SAR and QSAR: 2. Is aflexible binding site a possible source of outliers? J Comput Aid Mol Des 2007a;21:421-35.





87

Kim K. Outliers in SAR and QSAR: Is un-usual binding mode a possible source of outliers? J Comput Aid Mol Des 2007b;21:63-86.

Labute P. A widely applicable set of de-scriptors. J Mol Graph Model 2000;18:464-77.

Meyer H. Zur Theorie der AIkoholnarkose.Arch Exp Path Pharm 1899;42:109-18.

Molecular Networks GmbH Computerche-mie. ADRIANA.Code, Erlangen, Germany,2008.

Nantasenamat C, Naenna T, Isarankura-Na-Ayudhya C, Prachayasittikul V. Recogni-tion of DNA splice junction via machinelearning approaches. Excli J 2005a;4:114-29.

Nantasenamat C, Naenna T, Isarankura NaAyudhya C, Prachayasittikul V. Quantita-tive prediction of imprinting factor of molecularly imprinted polymers by artifi-cial neural network. J Comput Aid Mol Des2005b;19:509-24.

Nantasenamat C, Tantimongcolwat T, Naenna T, Isarankura-Na-Ayudhya C,Prachayasittikul V. Prediction of selectivityindex of pentachlorophenol-imprinted

polymers. Excli J 2006;5:150-63.

Nantasenamat C, Isarankura-Na-Ayudhya

C, Naenna T, Prachayasittikul V. Quantita-tive structure-imprinting factor relationshipof molecularly imprinted polymers. BiosensBioelectron 2007a;22:3309-17.

Nantasenamat C, Isarankura-Na-AyudhyaC, Tansila N, Naenna T, Prachayasittikul V.Prediction of GFP spectral properties usingartificial neural network. J Comput Chem2007b;28:1275-89.

Nantasenamat C, Isarankura-Na-AyudhyaC, Naenna T, Prachayasittikul V. Predictionof bond dissociation enthalpy of antioxidant

phenols by support vector machine. J Mol

Graph Model 2008;27:188-96.

Overton CE. Studien über die Narkose. Je-na: Fischer, 1901.

Randić M. The nature of chemical struc-ture. J Math Chem 1990;4:157-84.

Randić M, Razinger M. On characterizationof 3D molecular structure. In: Balaban AT(ed.): From chemical topology to three-

dimensional geometry. New York: PlenumPress, 1997.

Richet MC. Note sur le rapport entre latoxicité et les propriétes physiques descorps. Compt Rend Soc Biol (Paris)1893;45:775-6.

Schmidt MW, Baldridge KK, Boatz JA, etal. General atomic and molecular electronicstructure system. J Comput Chem 1993;14:1347-63.

Schrödinger, Inc. Jaguar, Version 7.5207;Portland, OR, 2008.

Shao Y, Molnar LF, Jung Y, et al. Ad-vances in methods and algorithms in amodern quantum chemistry program pack-age. Phys Chem Chem Phys 2006;8:3172-91.

Stewart J. MOPAC2009, Colorado, USA,2009.

Sukumar N, Breneman CM. RECON, Ver-sion 5.5; New York, USA, 2002.

Taft RW. Separation of polar, steric andresonance effects in reactivity. In: NewmanMS (ed.): Steric effects in organic chemis-try (pp 556-675). New York: Wiley, 1956.





88

Talete srl. DRAGON, Milano, Italy, 2007.

Tetko IV, Gasteiger J, Todeschini R, et al.Virtual computational chemistry laboratory-

design and description. J Comput Aid MolDes 2005;19:453-63.

Todeschini R, Consonni V. Handbook of molecular descriptors, Vol. 11. Weinheim:Wiley-VCH, 2000.

Torra V. Information fusion in data mining.Secaucus, NJ: Springer-Verlag, 2003.

Tropsha A, Gramatica P, Gombar VK. The

importance of being earnest: Validation isthe absolute essential for successful appli-cation and interpretation of QSPR models.QSAR Comb Sci 2003;22:69-77.

Verma RP, Hansch C. An approach towardthe problem of outliers in QSAR. BioorgMed Chem 2005;13:4597-621.

Wang L. Support vector machines: theoryand applications. New York: Springer-Verlag, 2005.

Wavefunction, Inc. Spartan'04, Irvine, Cali-fornia, USA, 2004.

Wold S, Trygg J, Berglund A, Antti H.Some recent developments in PLS model-ing. Chemometr Intell Lab 2001;58:131-50.

Worachartcheewan A, Nantasenamat C, Naenna T, Isarankura-Na-Ayudhya C,Prachayasittikul V. Modeling the activity of furin inhibitors using artificial neural net-

work. Eur J Med Chem 2009;44:1664-73.

Xue L, Bajorath J. Molecular descriptors inchemoinformatics, computational combina-torial chemistry, and virtual screening.Comb Chem High Throughput Screening2000;3:363-72.

Date post:	10-Apr-2018
Category:	Documents
Upload:	sherin-alfalah
View:	218 times
Download:	0 times

Practical Overview QSAR 2009

Documents