+ All Categories
Home > Documents > An Introduction to QSAR Methodology

An Introduction to QSAR Methodology

Date post: 08-Apr-2018
Category:
Upload: parult
View: 221 times
Download: 0 times
Share this document with a friend

of 24

Transcript
  • 8/7/2019 An Introduction to QSAR Methodology

    1/24

    An Introduction to QSAR Methodology

    Introduction

    D rug design is an iterative process which begins with a compound that displays aninteresting biological profile and ends with optimizing both the activity profile for themolecule and its chemical synthesis. The process is initiated when the chemist conceivesa hypothesis which relates the chemical features of the molecule (or series of molecules)to the biological activity. Without a detailed understanding of the biochemical process(es)responsible for activity, the hypothesis generally is refined by examining structuralsimilarities and differences for active and inactive molecules. Compounds are selected for synthesis which maximize the presence of functional groups or features believed to beresponsible for activity.

    The combinatorial possibilities of this strategy for even simple systems can be explosive.As an example, the number of compounds required for synthesis in order to place 10substituents on the four open positions of an asymmetrically disubstituted benzene ringsystem is approximately 10,000. The alternative to this labor intensive approach tocompound optimization is to develop a theory that quantitatively relates variations in

    biological activity to changes in molecular descriptors which can easily be obtained for each compound. A Quantitative Structure Activity Relationship (QSAR) can then beutilized to help guide chemical synthesis. This chapter develops the concepts used toderive a QSAR and reviews the application of these techniques to medicinal research.

    Statistical Concepts

    Computational chemistry represents molecular structures as a numerical models andsimulates their behavior with the equations of quantum and classical physics. Available

    programs enable scientists to easily generate and present molecular data includinggeometries, energies and associated properties (electronic, spectroscopic and bulk). Theusual paradigm for displaying and manipulating these data is a table in which compoundsare defined by individual rows and molecular properties (or descriptors) are defined bythe associated columns. A QSAR attempts to find consistent relationships between thevariations in the values of molecular properties and the biological activity for a series of

    compounds so that these "rules" can be used to evaluate new chemical entities.

    A QSAR generally takes the form of a linear equation

    Biological Activity = Const + (C 1 P1) + (C 2 P2) + (C 3 P3) + ...

    where the parameters P 1 through P n are computed for each molecule in the series and thecoefficients C 1 through C n are calculated by fitting variations in the parameters and the

  • 8/7/2019 An Introduction to QSAR Methodology

    2/24

    biological activity. Since these relationships are generally discovered through theapplication of statistical techniques, a brief introduction to the principles behind thederivation of a QSAR follows.

    The work reported from The Sandoz Institute for Medical Research on the development

    of novel analgesic agents1

    can be used as an example of a simple QSAR. In this study,vanillylamides and vanillylthioureas related to capsaicin were prepared and their activitywas tested in an in vitro assay which measured 45Ca 2+ influx into dorsal root ganglianeurons. The data, which was reported as the EC 50 (M), is shown in Table 1 (note thatcompound 6f is the most active of the series).

    TABLE 1Capsaicin Analogs Activity Data

    CmpdNumber

    CmpdName

    X EC 50(M)

    1 6a H 11.80 1.90

    2 6b Cl 1.24 0.11

    3 6d NO 2 4.58 0.29

    4 6e CN 26.50 5.87

    5 6f C 6H5 0.24 0.30

    6 6g N(CH 3)2 4.39 0.67

    7 6h I 0.35 0.05

    8 6i NHCHO ???

  • 8/7/2019 An Introduction to QSAR Methodology

    3/24

    In the absence of additional information, the only way to derive a best "guess" for theactivity of 6i is to calculate the average of the values for the current compounds in theseries. The average, 7.24, provides a guess for the value of compound 8 but, how good isthis guess? The graphical presentation of the data points is shown in Graph 1.

    GRAPH 1Capsaicin Analogs Activity Data

    The standard deviation of the data, s, shows how far the activity values are spread abouttheir average. This value provides an indication of the quality of the guess by showing theamount of variability inherent in the data. The standard deviation is calculated as shown

    below.

    Rather than relying on this limited analysis, one would like to develop an understandingof the factors that influence activity within this series and use this understanding to

    predict activity for new compounds. In order to accomplish this objective, one needs:

    binding data measured with sufficient precision to distinguish betweencompounds;

    a set of parameters which can be easily obtained and which are likely to be relatedto receptor affinity;

  • 8/7/2019 An Introduction to QSAR Methodology

    4/24

    a method for detecting a relationship between the parameters and binding data(the QSAR) and

    a method for validating the QSAR.

    The QSAR equation is a linear model which relates variations in biological activity tovariations in the values of computed (or measured) properties for a series of molecules.For the method to work efficiently, the compounds selected to describe the "chemicalspace" of the experiments (the training set) should be diverse. In many synthesiscampaigns, compounds are prepared which are structurally similar to the lead structure.

    Not surprisingly, the activity values for this series of compounds will frequently span alimited range as well. In these cases, additional compounds must be made and tested tofill out the training set.

    The quality of any QSAR will only be as good as the quality of the data which is used toderive the model. Dose-response curves need to be smooth, contain enough points toassure accuracy and should span two or more orders of magnitude. Multiple readings for a given observation should be reproducible and have relatively smaller errors. The issue

    being addressed is the signal-to-noise ratio. The variation of the readings obtained byrepeatedly testing the same compound should be much smaller than the variation over theseries. In cases where the data collected from biological experiments do not follow theseguidelines, other methods of data analysis should be utilized since the QSAR modelsderived from the data will be questionable.

    Once biological data has been collected, it is often found that the data is expressed interms which cannot be used in a QSAR analysis. Since QSAR is based on the relationshipof free energy to equilibrium constants, the data for a QSAR study must be expressed interms of the free energy changes that occur during the biological response. Whenexamining the potency of a drug (the dosage required to produce a biological effect), thechange in free energy can be calculated to be proportional to the inverse logarithm of theconcentration of the compound.

    G0 = - 2.3RTlogK = log 1/[S]

    Further, since biological data are generally found to be skewed, the log transformationmoves the data to a nearly normal distribution. Thus, when measuring responses under equilibrium conditions, the most frequent transformation used is to express concentrationvalues (such as IC 50, EC 50, etc.) as log[C] or log 1/[C]. The transformed data for thecapsaicin agonists are shown in Table 2.

  • 8/7/2019 An Introduction to QSAR Methodology

    5/24

    TABLE 2Capsaicin AnalogsTransformed Data

    CmpdNumber

    CmpdName

    X EC 50 Log EC 50 Log 1/EC 50

    1 6a H 11.80 1.90 1.07 -1.07

    2 6b Cl 1.24 0.11 0.09 -0.09

    3 6d NO 2 4.58 0.29 0.66 -0.66

    4 6e CN 26.50 5.87 1.42 -1.42

    5 6f C 6H5 0.24 0.30 - 0.62 0.62

    6 6g N(CH 3)2 4.39 0.67 0.64 -0.64

    7 6h I 0.35 0.05 - 0.46 0.46

    8 6i NHCHO ?? ?? ?? ??

    The effect of this transformation on the spread of the data relative to the average is shownin Graph 2. Note that the data points, projected onto the Y-axis, have become moreuniformly distributed.

  • 8/7/2019 An Introduction to QSAR Methodology

    6/24

    GRAPH 2Capsaicin AnalogsTransformed Data

    Given the transformed data, our best guess for the activity of 6i is still the average of thedata set (or 0.40). As before, the error associated with this guess is calculated as thesquare root of the average of the squares of the deviations from the average.

    This is an example data set intended to show the general approach; real data sets wouldhave many more compounds and descriptors. Since the purpose of a QSAR is to highlightrelationships between activity and structural features, we would like to find one or morestructural features which relate these molecules and their associated activity.Additionally, we would like to find a parameter that works consistently for all of the

    molecules in the series.There are several potential classes of parameters used in QSAR studies. Substituentconstants and other physico-chemical parameters (such as Hammett sigma constants)measure the electronic effects of a group on the molecule. Fragment counts are used toenumerate the presence of specific substructures. Other parameters can includetopological descriptors and values derived from quantum chemical calculations.

  • 8/7/2019 An Introduction to QSAR Methodology

    7/24

    The selection of parameters is an important first step in any QSAR study. If theassociation between the parameter(s) selected and activity is strong, then activity

    predictions will be possible. If there is only weak association, knowing the value of the parameter(s) will not help in predicting activity. Thus, for a given study, parametersshould be selected which are relevant to the activity for the series of molecules under

    investigation and these parameters should have values which are obtained in a consistentmanner.

    The Sandoz group divided their analysis of capsaicin analogs into three regions: the A-region which was occupied by an aromatic ring; the B-region which was defined by anamide bond; and the C-region which was occupied by a hydrophobic side-chain (Seefigure in Table 1). The hypothesis for the C-region assumed that a small, hydrophobicsubstituent would increase activity. Given this assumption, the parameters selected to

    best define this characteristic were molar refractivity (size) and , the hydrophobicsubstituent constant. These values are given in Table 3.

    TABLE 3Capsaicin AnalogsParameter Values

    CmpdNumber

    CmpdName

    X Log EC 50 MR

    1 6a H 1.07 0.00 1.03

    2 6b Cl 0.09 0.71 6.03

    3 6d NO 2 0.66 - 0.28 7.364 6e CN 1.42 - 0.57 6.33

    5 6f C 6H5 - 0.62 1.96 25.36

    6 6g N(CH 3)2 0.64 0.18 15.55

    7 6h I - 0.46 1.12 13.94

  • 8/7/2019 An Introduction to QSAR Methodology

    8/24

    8 6i NHCHO ?? ?? ??

    The data above can be analyzed for relationships by two means: graphically and

    statistically. The most visual approach to a problem with a limited number of variables isgraphical. In this case, a plot of activity versus either molar refractivity or hydrophobicitygives some insight into the relationship between the parameters and activity. The plotsderived by the Sandoz group are reproduced in Graph 3.

    GRAPH 3Capsaicin AnalogsParameter Values

    Does the graph provide insight into the the activity for compound 6i? Does knowing thevalue for either the hydrophobicity or molar refractivity parameters for this compound

    provide a good estimate for activity?

    Since this is a simple example where only two values are examined, the answers to thesequestions are a qualified yes. In more complex situations however, where multiple

  • 8/7/2019 An Introduction to QSAR Methodology

    9/24

    parameters are correlated to activity, statistics is used to derive an equation which relatesactivity to the parameter set. The linear equation which defines the best model for this setof data is

    Log EC 50 = 0.764 - (0.817)

    How much confidence should we place in this model? The first step to answering thisquestion is to determine how well the equation predicts activities for known compoundsin the series. The equation above estimates the average value for the EC 50 based on thevalue for ; because assays vary, it is not surprising that individual values will differ from the regression estimate. The difference between the calculated values and the actual(or measured) values for each compound is termed the residual from the model. Thecalculated values for activity and their residuals (or the errors of the estimate for individual values) are shown in Table 4.

    TABLE 4

    Capsaicin AnalogsCalculated Values

    CmpdNumber

    CmpdName

    X Log EC 50CalculatedLog EC 50

    Residual

    1 6a H 1.07 0.00 0.79 0.28

    2 6b Cl 0.09 0.71 0.21 - 0.12

    3 6d NO 2 0.66 - 0.28 1.02 - 0.36

    4 6e CN 1.42 - 0.57 1.26 0.16

    5 6f C 6H5 - 0.62 1.96 - 0.81 0.19

    6 6g N(CH 3)2 0.64 0.18 0.65 - 0.01

    7 6h I - 0.46 1.12 - 0.12 - 0.34

    8 6i NHCHO ?? - 0.98 1.60 ??

  • 8/7/2019 An Introduction to QSAR Methodology

    10/24

    The residuals are one way to quantify the error in the estimate for individual valuescalculated by the regression equation for this data set. The standard error for the residualsis calculated by taking the root-mean-square of the residuals (in this calculation, thedenominator shown as decremented by two to reflect the estimation of two parameters).

    In order to be an improved model, the standard deviation of the residuals calculated fromthe model should be smaller than the standard deviation of the original data. The standarderror about the mean was previously calculated to be 0.76 whereas the standard error from the QSAR model is 0.28. Clearly, the the use of linear regression has improved theaccuracy of our analysis. The plot of measured values versus calculated is shown inGraph 4 with a 45 line.

    GRAPH 4Capsaicin AnalogsPredicted Versus Actual EC 50 Values

    There are several assumptions inherent in deriving a QSAR model for a series of compounds. First, it is assumed that parameters can be calculated (or measured in somecases) more accurately and cheaply than activity can be measured. Second, it is assumedthat deviations from the best fit line follow a normal (Gaussian) distribution. Finally, it isassumed that any variation in the line described by the QSAR equation is independent of

  • 8/7/2019 An Introduction to QSAR Methodology

    11/24

    the magnitude of both the activity and the parameters. Given these assumptions, thequality of the model can be gauged using a variety of techniques.

    Variation in the data is quantified by the correlation coefficient, r, which measures howclosely the observed data tracks the fitted regression line. Errors in either the model or in

    the data will lead to a bad fit. This indicator of fit to the regression line is calculated as:

    where the Regression Variance is defined as the Original Variance minus the Variance

    around the regression line. The Original Variance is the sum-of-the-squares distances of the original data from the mean. This can be viewed graphically as shown in Graph 5.

    The calculation is carried out as follows:

    Original Variance = (1.07 - 0.40) 2 + (0.09 - 0.40) 2 + ...

    Original Variance = 3.49

    Variance around the line = (0.28) 2 + (- 0.12) 2 + (- 0.36) 2 + ...

    Variance around the line = 0.40

    Regression Variance = Original Variance - Variance around the line

    Regression Variance = 3.49 - 0.40 = 3.09

    r 2 = Regression Variance/Original Variance

    r 2 = 3.09/3.49

    r 2 = 0.89

    Possible values reported for r 2

    fall between 0 and 1. An r 2

    of 0 means that there is norelationship between activity and the parameter(s) selected for the study. An r 2 of 1means there is perfect correlation. The interpretation of the r 2 value for the capsaicinanalogs is that 89% of the variation in the value of the Log EC 50 is explained by variationin the value of , the hydrophobicity parameter.

  • 8/7/2019 An Introduction to QSAR Methodology

    12/24

    GRAPH 5Capsaicin AnalogsDerivation of r 2 values

    While the fit of the data to the regression line is excellent, how can one decide if this

    correlation is based purely on chance? The higher the value for r 2

    the less likely that therelationship is due to chance. If many explanatory variables are used in a regressionequation, it is possible to get a good fit to the data due to the flexibility of the fitting

    process; a line will fit two points perfectly, a quadratic curve will fit three, multiple linear regression will fit the observed data if there are enough explanatory variables 2. Given theassumption that the data has a Gaussian distribution, the F statistic below assesses thestatistical significance of the regression equation.

  • 8/7/2019 An Introduction to QSAR Methodology

    13/24

    The F statistic is calculated from r 2 and the number of data points (or degrees of freedom)in the data set. The F ratio for the capsaicin analogs is calculated as:

    This value often appears as standard output from statistical programs or it can be checkedin statistical tables to determine the significance of the regression equation. In this case,the probability that there is no relationship between activity and the value is less than1% (p=0.01).

    We have found that hydrophobicity values correlate well with biological activity. Doesthe addition of a size parameter (MR) improve our model? In order to analyze arelationship which is possibly influenced by several variables (or properties), it is usefulto assess the contribution of each variable. and MR appear to be somewhat correlated inthis data set so the order of fitting can influence how much the second variable helps thefirst. Multiple linear regression is used to determine the relative importance of multiplevariables to the overall fit of the data.

    Multiple linear regression attempts to maximize the fit of the data to a regressionequation (minimize the squared deviations from the regression equation) for the

    biological activity (maximize the r 2 value) by adjusting each of the available parametersup or down. Regression programs often approach this task in a stepwise fashion. That is,successive regression equations will be derived in which parameters will be either addedor removed until the r 2 and s values are optimized. The magnitude of the coefficientsderived in this manner indicate the relative contribution of the associated parameter to

    biological activity.

    There are two important caveats in applying multiple regression analysis. The first is based on the fact that, given enough parameters any data set can be fitted to a regressionline. The consequence of this is that regression analysis generally requires significantlymore compounds than parameters; a useful rule of thumb is three to six times the number of parameters under consideration. The difficulty is that regression analysis is mosteffective for interpolation and it is extrapolation that is most useful in a synthesiscampaign (i.e., the region of experimental space described by the regression analysis has

    been explained, but projecting to a new, unanalyzed region can be problematic).

    Using multiple regression for the capsaicin analogs, one can derive the followingequation which relates hydrophobicity and molar refractivity to biological activity.

    Log EC 50 = 0.762 - (0.819) + (0.011)MR s = 0.313, r 2 = 0.888

  • 8/7/2019 An Introduction to QSAR Methodology

    14/24

    To judge the importance of a regression term, three items need to be considered.

    1. Statistical significance of the regression coefficient.

    2. The magnitude of the typical effect b ixi (in this case, 0.011 25.36).

    3. Any cross-correlation with other terms.

    As more terms are added to multiple linear regression, r 2 always gets larger. Werecompute the previous calculations (r 2 = 0.89) carrying three significant figures so thatrounding does not lead to confusion.

    These results of this analysis indicate that, within this series , steric bulk is not animportant factor in activity. The influence of the hydrophobicity constant confirms the

    presence of a hydrophobic binding site. Given the limited number of substituents in thisanalysis, it is unlikely that more can be learned from further analysis.

    This section has developed the fundamental mathematics of QSAR studies. Severalauthors have published reviews of QSAR and have discussed various aspects of themethods 3-8. Each of the examples to follow uses these techniques to derive informationabout the chemical factors which are important for activity.

    Approaches to Developing a QSAR

    D rugs exert their biological effects by participating in a series of events which includetransport, binding with the receptor and metabolism to an inactive species. Since theinteraction mechanisms between the molecule and the putative receptor are unknown inmost cases (i.e., no bound crystal structures), one is reduced to making inferences from

    properties which can easily be obtained (molecular properties and descriptors) to explainthese interactions for known molecules. Once the relationship is defined, it can be used toaid in the prediction of new or unknown molecules.

    The first approach to developing quantitative relationships which described activity as afunction of chemical structure relied on the principles of thermodynamics. The free-

    energy terms E, H and S were represented by a series of parameters which could be derived for a given molecule.

    Electronic effects such as electron donating and withdrawing tendencies, partial atomiccharges and electrostatic field densities were defined by Hammett sigma ( ) values,resonance parameters (R values), inductive parameters (F values) and Taft substituentvalues ( *, *, E s). Steric effects such as molecular volume and surface area wererepresented by values calculated for Molar Refractivity and the Taft steric parameter.

  • 8/7/2019 An Introduction to QSAR Methodology

    15/24

    Enthalpic effects were calculated using partition coefficients (LogP) or the hydrophobic parameter, , which was derived from the partition coefficient. In addition, anassortment of structural indices were used to describe the presence of specific functionalgroups at positions within the molecule. The linear equation which described therelationship between activity and this parameter set was the Hansch equation

    log 1/[C] = A(logP) - B(logP) 2 + C(E s) + D( ) + E + ...

    Multiple linear regression analysis was used to derive the values of the coefficients. Ingeneral, Hansch type studies were performed on compounds which contained a commontemplate (usually a rigid one such as an aromatic ring) with structural variation limited tofunctional group changes at specific sites.

    Hansch utilized this approach in his analysis of 256 4,6-diamino-1,2-dihydro-2,3-

    dimethyl-1-(X-phenyl)-s-triazines which were active against tumor dihydrofolatereductase 9. It was demonstrated that for 244 of the compounds, activity could becorrelated to the presence of hydrophobic groups at the three and four positions of the N-

    phenyl ring. The parameters used to derive this correlation were the hydrophobic constant( ) and molar refractivity constant (MR) for meta and para substituents on the N-phenylring and six indicator variables I 1-I6 which were used to indicate the presence (a value of 1) or absence (a value of 0) of specific structural features. The equation which wasformulated from these data using the method of least squares is shown below.

    FIGURE 5Analysis of the Baker Triazines

    log 1/[C] = 0.680( 3) - 0.118( 3)2 + 0.230(MR 4) - 0.024(MR 4)2 + 0.238(I 1) - 2.530(I 2) -1.991(I 3) + 0.877(I 4) + 0.686(I 5) + 0.704(I 6) + 6.489

    n = 244, r = 0.923, s = 0.377

  • 8/7/2019 An Introduction to QSAR Methodology

    16/24

    The optimal values for MR 4 (4.7) and 3 (2.9) were obtained from the partial derivativesof the equation. Note that the number of compounds in the data set was reduced to 244.Hansch and Silipo reported improvements in the value for r and s by removing 12compounds which were incorrectly predicted by a factor of 10 or more.

    While there are limits to the Hansch approach, it permitted complex biological systems to be modeled successfully using simple parameters. The approach has been usedsuccessfully to predict substituent effects in a wide number of biological assays. Themain problem with the approach was the large number of compounds which wererequired to adequately explore all structural combinations. Further, the analysis methodsdid not lend themselves to the consideration of conformational effects. Several authorshave published articles which provide additional background on the Hansch approach 10-11 .

    Alternative approaches to compound design have been suggested which avoid thecombinatorial problem found in Hansch type analyses. Free and Wilson used a series of substituent constants which related biological activity to the presence of a specific

    functional group at a specific location on the parent molecule12

    . The relationship between biological activity and the presence or absence of a substituent was then expressed by thefollowing equation:

    Activity = A + i jG ijX ij

    where A was defined as the average biological activity for the series, G ij the contributionto activity of a functional group i in the jth position and X ij the presence (1.0) or absence(0.0) of the functional group i in the jth position.

    The procedure used the equation above to build a matrix for the series and represented

    this matrix as a series of equations. Substituent constants then were derived for everyfunctional group at every position. Statistical tests were used to test the importance of theconstants. If the models were shown to be valid, the model was used to predict activityvalues for compounds which had not been prepared. In general, while a large number of compounds are required to explore the effects of multiple substitution patterns, the Free-Wilson approach substantially reduces the number of analogs required. However, themethod demands that the effects of substituents are additive.

    In 1972, John Topliss published a paper which detailed methodology to automate theHansch approach 2. The method assumed that the lead compound of interest contained atleast one phenyl ring which could serve as the template for functional group

    modifications. The first modification to the template was preparation of the para-chloroderivative to examine lipophilicity. Additional substitution patterns were then madesequentially in an attempt to explore and optimize the relationship between activity andthe hydrophobic and electronic character of the molecule. While the Topliss approach iseasy to follow, it has several drawbacks. The primary problems are that the procedure isnot applicable to all types of studies and that there is a high degree of risk associated withits use (it essentially ignores the possibility of interactions between substituents as itchanges one substituent at a time).

  • 8/7/2019 An Introduction to QSAR Methodology

    17/24

    The use of classical QSAR was expanded during the 1960's as a means of correlatingobserved activity to chemical properties. However, there are many areas where thesetechniques could not be used or where they failed to provide useful correlations. Theseincluded situations in which activity was found to be determined by 3-dimensionalgeometry, where poor training sets of compounds were used or the set of compounds

    were too small or insufficiently diverse and cases where biological activity could not bewell quantified. Many of these problems were addressed by extensions to the Hanschmethod and the development of alternative approaches to QSAR.

    There are cases where biological activity values cannot be determined accurately for avariety of reasons, e.g. lack of sensitivity of a particular test system. Alternative statisticaltechniques can be used in these cases; the problem is simplified to a classification schemein which compounds are labeled as active, partially active, inactive, etc. The resultingdata set is then searched for patterns which predict these categories. The methods whichhave been used for this type of analysis include SIMCA (Soft, Independent Modeling of Class Analogy) 13, ADAPT (Automated Data Analysis by Pattern recognition

    Techniques)14

    , CASE (Computer Automated Structure Evaluation)15

    and CSA (Cluster Significance Analysis) 16.

    Pattern recognition methods 17 attempt to define the set of parameter values which willresult in clustering compounds of similar activity into regions of n-dimensional space.The methods used to accomplish this goal can be parametric or nonparametric.Parametric methods search the n-dimensional space for clusters of compounds based ontheir calculated properties. These methods do not use derived values (e.g., mean vectorsand covariance matrices), but instead use the original data to find clustering definitionsand apply iterative procedures to find the linear set of parameters which best define theclassification scheme.

    Where the methods described above develop discriminant functions, SIMCA methodsuse Principal Component Analysis (PCA) to describe the data set. The objective of PCAis to create a reduced number variables which describe biological activity or chemical

    properties into a relatively few independent ones. This is accomplished through ananalysis of the correlation matrix of biological or chemical properties.

    Principle component analysis can be used to create derived variables for each class (e.g.,active and inactive) separately by decomposing the correlation matrix; this method isuseful to point out redundancies or interrelationships among the variables. PCA seeks tofind simplified relationships in data by transforming the original parameters into a newset of uncorrelated variables which are termed principal components. The symmetriccorrelation matrix is decomposed by an eigenvalue decomposition. The largesteigenvalue and its eigenvector are used to form a linear combination of the originalvariables with maximum variance. Successively smaller eigenvalues and vectors producelinear combinations of the original variables with diminishing variance. Successiveeigenvectors are independent of one another. The simplification is derived bydisregarding eigenvectors associated with small eigenvalues. In summary, the procedurefinds the set of orthogonal axes for the data which decompose variance in the data.

  • 8/7/2019 An Introduction to QSAR Methodology

    18/24

    Another approach to examining the effects of chemical structure on activity wasdeveloped by the Jurs' group. Rather than rely on multivariant statistics to highlight theserelationships, Jurs used the combination of cluster analysis and pattern recognitiontechniques as a tool to develop these correlations. The ADAPT program generated a dataset of molecular descriptors (topological, geometrical and physicochemical) derived from

    three dimensional model building, projected these data points onto an n-dimensionalsurface and analyzed them using pattern recognition methods. The goal of this analysiswas to discriminate between active and inactive compounds in a series.

    Jurs has reported several applications of the methodology contained in ADAPT. In onestudy of chemical carcinogens 18, a linear discriminant function was derived from a set of 28 calculated structure features including fragment descriptors, substructure descriptors,environment descriptors, molecular connectivity descriptors and geometric descriptors.Two hundred and nine compounds from twelve structural classes (130 carcinogens, 79noncarcinogens) were selected for this study. The program was used to identify a trainingset of 192 compounds which was used to find the best set of descriptors and analyze the

    entire data set. A predictive success of 90% for carcinogenic compounds and 78% for noncarcinogenic compounds was obtained in randomized testing.

    The CASE program extended the techniques in ADAPT by using topological methods todefine substructural fragments which were essential for activity. CASE was able todifferentiate between positional isomers. Both CASE and ADAPT are limited toanalyzing structurally similar data sets.

    The analysis methods described to this point have not explicitly incorporated thecontribution of three dimensional shape in the analysis of the activity of a molecule.While the use of chemical graph indexes 18, intermolecular binding distances 19, molecular

    surface areas20

    and electrostatic potentials21

    contain some information about the 3-Dshape of molecules, the Hopfinger 22 and Marshall 23 groups were the first to exhaustivelyanalyze these effects.

    In 1979, Marshall extended the 2-D approach to QSAR by explicitly considering theconformational flexibility of a series as reflected by their 3-D shape 23. The first step of the Active Analog Approach was to exhaustively search the conformations of acompound which was highly active in a particular biological assay. The result of thesearch was a map of interatomic distances which was used to filter the conformationalsearches of subsequent molecules in the series. The implicit assumption of the methodwas that all compounds which display similar activity profiles were able to adopt similar conformations. Once the "active conformation" was determined, molecular volumes for each molecule were calculated and superimposed. Regression analysis of the volumeswas used to establish a relationship to biological activity. Marshall and co-workerscommercialized the Active Analog Approach and a suite of other drug design techniquesin the SYBYL molecular modeling program.

    Hopfinger and co-workers also used 3-D shape in QSAR. In molecular shape analysis 24

    of the Baker Triazines, the common space shared by all molecules of a series and the

  • 8/7/2019 An Introduction to QSAR Methodology

    19/24

    differences in their potential energy fields were computed. When these calculations werecombined with a set of rules for overlapping the series, comparative indicies of the shapeof different molecules were obtained. Inclusion of these shape descriptors in standardHansch analysis schemes lead to improved descriptions relating computed parameters to

    biological activity such that no compounds in the original data set had to be eliminated

    from the calculations. The techniques developed by Hopfinger and co-workers weremade available in the CAMSEQ, CAMSEQ-II, CHEMLAB and CAMSEQ-M computer programs.

    In 1988, Richard Cramer proposed that biological activity could be analyzed by relatingthe shape-dependent steric and electrostatic fields for molecules to their biologicalactivity 25. Additionally, rather than limiting the analysis to fitting data to a regressionline, CoMFA (Comparative Molecular Field Analysis) utilized new methods of dataanalysis, PLS (Partial Least Squares) and cross-validation, to develop models for activity

    predictions.

    The approach used in the CoMFA procedure requires that the scientist define alignmentrules for the series which overlap the putative pharmacophore for each molecule; theactive conformation and alignment rule must be specified. Once aligned, each molecule isfixed into a three-dimensional grid by the program and the electrostatic and stericcomponents of the molecular mechanics force field, arising from interaction with a probeatom (e.g., an SP 3 C atom), are calculated at intersecting lattice points within the 3-Dgrid. The equations which result from this exercise have the form

    Act 1 = Const 1 + a 1(steric xyz) + b 1(steric xyz) + ... + a' 1(estatic xyz) + b' 1(estatic xyz) + ...Act 2 = Const 2 + a 2(steric xyz) + b 2(steric xyz) + ... + a' 2(estatic xyz) + b' 2(estatic xyz) + ...Act n = Const n + a n(steric xyz) + b n(steric xyz) + ... + a' n(estatic xyz) + b' n(estatic xyz) + ...

    Traditional regression methods require that the number of parameters must beconsiderably smaller than the number of compounds in the data set (or the number of degrees of freedom in the data). The data tables which result from CoMFA analysis havefar more parameters than compounds. PLS, which removes this limitation, is used toderive the coefficients for all of the steric and electrostatic terms. PLS essentially reliesupon the fact that the correlations among near parts of a molecule are similar so that thereal dimensionality is smaller that the number of grid points. Since these coefficients are

    position dependant, substituent patterns for the series are elucidated which define regionsof steric bulk and electrostatic charge associated with increased or decreased activity. Thesize of the model (the number of components 27 needed for the best model) and thevalidity of the model as a predictive tool are assessed using cross-validation.

    As opposed to traditional regression methods, cross-validation evaluates the validity of amodel by how well it predicts data rather than how well it fits data. The analysis uses a"leave-one-out" scheme; a model is built with N-1 compounds and the N th compound is

    predicted. Each compound is left out of the model derivation and predicted in turn. An

  • 8/7/2019 An Introduction to QSAR Methodology

    20/24

    indication of the performance of the model is obtained from the cross-validated (or predictive) r 2 which is defined as

    r 2 (cross-validated) = (SD - Press)/SD

    SD is the Sum-of-Squares deviation for each activity from the mean. Press (or PredictiveSum of Squares) is the sum of the squared differences between the actual and that predicted when the compound is omitted from the fitting process.

    As we have discussed, values for conventional r 2 range from 0 to 1. Values for the cross-validated r 2 are reported by the method to range from -1 to 1. Negative values indicatethat biological activity values are estimated by the mean of the activity values better thanthey are by the model (i.e, the predictions derived from the model are worse than nomodel). Once a model is developed which has the highest cross-validated r 2, this model isused to derive the conventional QSAR equation and conventional r 2 and s values. Theresults of the final model are then visualized as contour maps of the coefficients.

    The first CoMFA study reported analyzed the binding affinities of 21 steroid structures tohuman corticosteroid-binding globulins and testosterone-binding globulins. This class of compounds is rigid and was selected to eliminate conformationally dependant effectsfrom the study. The models for each steroid were built from coordinates from theCambridge Crystallographic Database which were minimized using the Tripos forcefield. Side chain positioning was accomplished using systematic conformationalsearching. The Field Fit algorithm was used to align each structure within the fixed lattice(the 3-D grid used to calculate the CoMFA field effects). The fit of the regression line for the predicted versus actual binding values for the corticosteroids showed a cross-validated r 2 of 0.65 (conventional r 2 = 0.897, s = 0.397). For the testosterone-binding

    steroids, the cross-validated r 2

    was 0.555 (conventional r 2

    = 0.873, s = 0.453).As noted, CoMFA starts with defined pharmacophore and overlap rules and derives a 3-Dmodel which can be used to predict activity for new chemical entities. The Apex andCatalyst ( Accelrys , Formerly Molecular Simulations Incorporated) programs are used toidentify pharmacophores from databases of chemical structures and biological activity.These models are then used to predict activities for novel compounds.

    Apex-3D is an automated pharmacophore identification system which can identify possible pharmacophores from a set of biologically active molecules using statisticaltechniques and 3-D pattern matching algorithms. The program classifies molecular

    structures using three methods: the agreement inductive method identifies commonstructural patterns in compounds having similar activity; the difference inductive methodidentifies structural patterns which differentiate active and inactive compounds and theconcomitant variations inductive method highlights variations in structural features thatexplain changes in biological activity for sets of compounds.

    The methods defined above follow logic similar to that used by a practicing medicinalchemist: What pharmacophoric patterns are present in the active molecule which are not

    http://www.accelrys.com/http://www.accelrys.com/
  • 8/7/2019 An Introduction to QSAR Methodology

    21/24

    present in the inactive ones? Pharmacophores are defined by different chemical centers(atom centered functional groups) and the distances between these centers. Thesedescriptor centers can include such things as aromatic ring centers, electron donor ability,hydrogen bonding sites, lipophilic regions, and partial atomic charge. The information for each molecule is stored in a knowledge base in the form of rules which can be used to

    predict the activity of novel structures.

    Apex-3D contains an expert system which automatically selects the best conformationand alignment for structures based on identified pharmacophores. When quantitative

    biological data is available, a 3D-QSAR model can be developed for any of the possibleidentified pharmacophores. Depending on the type of biological activity available, it is

    possible to identify pharmacophores for different binding orientations, receptor subtypes,or agonist versus antagonist activity.

    To use this approach, the scientist is required to assign the training set of compounds toone or more activity classes Specific descriptor centers also can be defined if desired. The

    automated pharmacophore identification portion of the program then builds theknowledge base using the following steps:

    identify all possible binding interaction centers for each compound in the data set;

    generate topological (2D) or topographical (3D) distance matrices based on theset of descriptors;

    identify possible pharmacophores from all pairs of molecules using cliqueselection algorithms;

    classify these pharmacophores based upon their occurance in compounds in eachactivity class using Bayesian statistics and their nonchance occurance;

    set thresholds for probability and reliability statistics associated with a pharmacophore so that all training set molecules are properly classified by the pharmacophore rules;

    align compounds containing high probability pharmacophores on the pharmacophore.

    Once the knowledge base has been constructed, the scientist can use it to predict biological activity for compounds not included in the training set.

    The pharmacophores defined above can be used to build 3D QSAR models by correlatingindexes calculated for biophore sites, secondary sites or whole molecule properties. 3-DQSAR models in Apex are generated and screened using a modified scheme of multiplelinear regression analysis with variable selection. Special randomization checks are madeto estimate the chance of fortuitous correlation. The steps involved are:

    Interactively select the pharmacophore(s) to use in the analysis;

  • 8/7/2019 An Introduction to QSAR Methodology

    22/24

    Interactively choose parameters to include in the pool of possible parameters to beselected by the program and

    Calculate the best 3D QSAR models for each pharmacophore using stepwisemultiple regression and analysis of statistics to assess the validity and predictive

    power of the model.At this point the knowledge base with 3D QSAR models can be used to calculateactivities of novel compounds. Given a set of conformations for a novel structure, anactivity range will be calculated based on each conformation which contains one of the

    pharmacophores and fits the 3D QSAR model.

    Few applications of Apex have appeared to date in the literature. One example applied tononpeptide Angiotensin II antagonists will be discussed for its heuristic value. Severalstructurally diverse compounds in this activity class have been reported in the literature 31-44. Pharmacophore models have been postulated with some disagreement about whether all of the highly active molecules are binding at the same site 30. Automated

    pharmacophore identification can be used to analyze these compounds and to assess the probability that they could be acting at the same site. A set of 55 compounds withspecific binding activity (IC 50) values ranging over 6 orders of magnitude was used.Multiple conformations were included using 3D structures whose geometries wereoptimized by AMPAC. Four activity classes were defined; the most active class (< 100nM) included 27 compounds. Apex-3D was able to generate rules which properlyclassified all compounds in the most active class without false negatives or positives. Thefact that several biophores were required is consistent with the exsistence of multiple

    binding sites.

    Quantitative 3D QSAR models also were developed for some of these pharmacophores.One model which contained 48 compounds had the following statistical parameters:

    predicted R 2 = 0.83, predicted RMSE = 0.86. Predicted activities for compoundsexcluded from the training set were within statistical boundaries.

    Like Apex, Catalyst generates structure-activity hypotheses from a set of molecules of various activities. Once molecular connectivity and activity values are specified for allmolecules, Catalyst derives a hypotheses which consist of a set of generalized chemicalfunctions (regions of hydrophobic surface, hydrogen bond vectors, charge centers, or other user-defined features) at specified relative positions. Up to ten hypotheses are

    produced and ranked by estimated statistical significance. The hypotheses can beexamined graphically, fit to new molecules, or fed directly to flexible 3D database search.

    In the first step of the process, a set of representative conformers is found that covers thelow-energy conformational space of each molecule. Representative structures are chosento maximize the sampling of conformational space. The second step locates a list of candidate hypotheses that are common among active and rare among inactive compounds(the presence of all identified features is not required for inclusion in the active classsince it is unlikely that all active members of a training set possess all the important

    binding features). The cost of a hypothesis is defined as the number of bits needed to

  • 8/7/2019 An Introduction to QSAR Methodology

    23/24

    describe the hypothesis as well as the errors in activities as estimated by the hypothesis.The theory of minimum complexity estimation indicates that a predictive hypothesis willminimize this cost. Minimization is carried out over the space of hypotheses covered bythe candidates identified above. Statistical significance of the results (low probability of having found a chance correlation) on a variety of medicinal training sets is verified by a

    non-parametric randomization test.

    The Genetic Function Approximation (GFA) algorithm is a novel technique for constructing QSAR models 45. It was specifically developed for use with data setscontaining many more variables than samples, or data sets which contain nonlinear relationships between the variables and the activity.

    GFA begins with a population of randomly-constructed QSAR models; these models arerated using an error measure which estimates each model's relative predictiveness. The

    population is evolved by repeatedly selecting two better-rated models to serve as parents,then creating a new child model by using terms from each of the parent models. The

    worst-rated model in the population is replaced by this new model. As evolution proceeds, the population becomes enriched with higher and higher quality models.

    The different models are, in effect, multiple fits to the data. Scientists can then use their scientific knowledge and intuition to select among the final models. By studying thesimilarities and differences among the models, the scientist may be provoked to consider alternative mechanisms which explain the data or may plan new experiments to decide

    between different possible models. Since the models usually use only a subset of the totalnumber of variables in the data set, GFA behaves as a member of the class of variablereduction algorithms and is most useful if the critical information in the data set isconcentrated in a few variables.

    Experimental results against published data sets demonstrate that GFA discovers modelswhich are comparable to, and in some cases superior to, models discovered usingstandard techniques such as stepwise regression, linear regression, or partial least-squaresregression. As the number of computational and instrumentational sources of experimental data increase, the ability of GFA to perform variable reduction, to discover nonlinear relationships, and to present the user with multiple models representingmultiple interpretations of the data set, may become increasingly useful in data analysis.

    There are many additional properties beyond binding which contribute to the biologicalactivity of a molecule including transportation, distribution, metabolism and elimination.Since the techniques described above address binding and typically do not account for any of these other processes, they often fail. However, one should note that their utility isone of efficiency and probability. They are generally more successful and efficient thanare the ad hoc/intuitive methods. Because no single method works for all cases, manygroups are examining alternative approaches to developing SAR equations includingelectronic based descriptors and topological indicies.

  • 8/7/2019 An Introduction to QSAR Methodology

    24/24

    Peter Goodford has reported an energy based grid approach to compound design. TheGRID program 46 is a computational procedure for detecting energetically favorable

    binding sites on molecules of known structure. It has been used to study arrays of molecules in membranes and crystals or proteins. The energies are calculated as theelectrostatic, hydrogen-bond and Lennard-Jones interactions of a specific probe group

    with the target structure. GRID has also been used to distinguish between selective binding sites for different probes.

    In general, topological approaches start with a very different graph theoreticrepresentation of molecular structure in which atoms are represented by vertexes and

    bonds by edges. Numerical indexes for the structure are then defined which abstractinformation including atom descriptors (atom types, atomic weight, atomic number, ratioof valance electrons to core electrons, etc.) and sub-group descriptors. In addition, aseries of indexes (the indexes) are developed which describe the molecule as a set of fragments of varying size and complexity. Regression analysis of equations which havechi indexes as parameters has been used to correlate chemical structure to

    physicochemical behavior in applications such as chromatographic retention times, molar refractivity, ionization potential and heats of atomization. While the application of topological indicies has been widely reported in QSAR 47-49 , the utility of these methodsgenerally has been limited to predicting structure-property relationships for polymers andhydrocarbons.

    SUMMARY

    Developing a quantitative structure activity relationship is difficult. Molecules aretypically flexible and it is possible to compute many possibly useful properties that might

    relate to activity. Early in a research program there are typically few compounds tomodel. Thus we have a few compounds in a very high dimensional descriptor space.Which are the important variables and how do we optimize them? It is clear that manytraining compounds need to span through the space and model fitting techniques need toaddress not only deriving a fit, but the predictive quality of the fit. While these methodshave not discovered a new compound, they have aided scientists in examining thevolumes of data generated in a research program. As the methods evolve, they will find

    broader application in areas such as combinatorial chemistry.


Recommended