Date post: | 21-Jan-2016 |
Category: |
Documents |
Upload: | deirdre-carson |
View: | 218 times |
Download: | 0 times |
CZ3253: Computer Aided Drug designCZ3253: Computer Aided Drug design
Drug Design Methods I: QSARDrug Design Methods I: QSAR
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore
22
TerminologyTerminology• SAR (Structure-Activity Relationships)
– Circa 19th century?
• QSAR (Quantitative Structure Activity Relationships)– Specific to some biological/pharmaceutical function of
molecule (Absorption, Distribution/Digestion, Metabolism, Excretion)
– Brown and Frazer (1868-9)• ‘constitution’ related to biological response
– LogP
• QSPR (Quantitative Structure Property Relationships)– Relate structure to any physical-chemical property of
molecule
33
Statistical ModelsStatistical Models
• Simple– Mean, median and variation– Regression
• Advanced– Validation methods– Principal components, co-variance– Multiple Regression
QSAR,QSPR
44
Modern QSARModern QSAR
– Hansch et. Al. (1963)• Activity ‘travel through body’ partitioning
between varied solvents
– C (minimum dosage required)– (hydrophobicity)– (electronic)– Es (steric)
1/C a b 2 c dE s const.
55
Choosing DescriptorsChoosing Descriptors
• Buffon’s Problem
– Needle Length?– Needle Color?– Needle Composition?– Needle Sheen?– Needle Orientation?
66
Choosing DescriptorsChoosing Descriptors• Constitutional
– MW, Natoms of element
• Topological– Connectivity,Weiner index (sums of bond distances)– 2D Fingerprints (bit-strings)– 3D topographical indices, pharmacophore keys
• Electrostatic – Polarity, polarizability, partial charges
• Geometrical Descriptors– Length, width, Molecular volume
77
Choosing DescriptorsChoosing Descriptors• Chemical
– Hydrophobicity (LogP)– HOMO and LUMO energies– Vibrational frequencies– Bond orders– Energy total– GSH
88
Statistical MethodsStatistical Methods
• 1-D analysis• Large dimension sets require decomposition
techniques– Multiple Regression– PCA– PLS
• Connecting a descriptor with a structural element so as to interpolate and extrapolate data
99
Simple Error Analysis(1-D)Simple Error Analysis(1-D)
• Given N data points
– Mean
– Variance
– Regression
ycalc
yobs
xcalc
xobs
)()(
),(
YStdXStd
YXCovR
1010
Simple Error Analysis(1-D)Simple Error Analysis(1-D)
• Given N data points– Regression
residualy
yyy obsi
calci
obscalc
obscalc
xx
yy
1111
Simple Error Analysis(1-D)Simple Error Analysis(1-D)
• Given N data points– (Poor 0<R2<1(Good)
2
)()(
),(
)(
N
icalc yySSR
YStdXStd
YXCov
YStd
SSRR
nsfluctuatiobetween n Correlatio
1),(
1
YYXXN
YXCov i
N
ii
2
1
1)(
N
ii YY
NYStd
1212
Correlation vs. Dependence?Correlation vs. Dependence?
• Correlation– Two or more variables/descriptors may correlate to
the same property of a system
• Dependence– When the correlation can be shown to be due to one
changing caused by the change of the other
• Example: Elephants head and legs– Correlation exists between size of head and legs– The size of one does not depend on the size of the other
1313
Quantitative Structure Quantitative Structure Activity/Property Relationships Activity/Property Relationships
(QSAR,QSPR)(QSAR,QSPR)
• Discern relationships between multiple variables (descriptors)
• Identify connections between structural traits (type of subunits, bond angles local components) and descriptor values (e.g. activity, LogP, % denatured)
1414
Pre-QualificationsPre-Qualifications
• Size– Minimum of FIVE samples per descriptor
• Verification– Variance– Scaling– Correlations
1515
QSAR/QSPRQSAR/QSPRPre-QualificationsPre-Qualifications
• Variance– Coefficient of Variation
Standard Deviation
Mean
x
x
"Spread"
1616
QSAR/QSPRQSAR/QSPRPre-QualificationsPre-Qualifications
• Scaling – Standardizing or normalizing descriptors to
ensure they have equal weight (in terms of magnitude) in subsequent analysis
1717
QSAR/QSPRQSAR/QSPRPre-QualificationsPre-Qualifications
• Scaling – Unit Variance (Auto Scaling)– Ensures equal statistical weights (initially)
– Mean Centering
x i' x i
' 1
x i' x i x
x ' 0
1818
QSAR/QSPRQSAR/QSPRPre-QualificationsPre-Qualifications
• Correlations
– Remove correlated descriptors
– Keep correlated descriptors so as to reduce data set size
– Apply math operation to remove correlation (PCR)
n)correlatio positive (100% 1
n)correlatio negative (100% 1
:
11
ij
ij
r
ENTATIONOVERREPRES
r
2
,
2
,
,,
,
thth,
descriptor j and ibetween n Correlatio)()(
),(
M
kjkj
M
kiki
jkj
M
kiki
ji
ji
jiji
XXXX
XXXXR
YStdXStd
XXCovR
1919
QSAR/QSPRQSAR/QSPRPre-QualificationsPre-Qualifications
• Correlations
2020
QSAR/QSPR SchemeQSAR/QSPR Scheme
• Goal– Predict what happens next (extrapolate)!– Predict what happens between data points
(interpolate)!
2121
QSAR/QSPR SchemeQSAR/QSPR Scheme
• Types of Variable– Continuous
• Concentration, occupied volume, partition coefficient, hydrophobicity
– Discrete• Structural (1: Methyl group substituted, 0: no
methyl group substitution)
2222
QSAR/QSPRQSAR/QSPRPrincipal Components AnalysisPrincipal Components Analysis
• Reduces dimensionality of descriptors
• Principle components are a set of vectors representing the variance in the original data
2323
Principal components – Principal components – reducing the dimensionality of a datasetreducing the dimensionality of a dataset
x
y
Clearly there is a relationship between x and y- a high correlation.We can define a new variable z = x+y suchthat we can express most of the variation inthe data as the new variable z.This new variable is a principal component.
v
j
jjii xcp1
,pi is the ith principalcomponent and ci,j is the coefficient of the variable xj.There are v such variables.
2424
QSAR/QSPR-Principal QSAR/QSPR-Principal Components AnalysisComponents Analysis
• Geometric Analogy (3-D to 2-D PCA)
y
z
x
x1 x2 ....xNy1 y2 ....yNz1 z2 ....zN
O
~
2525
PCA is the transformation of a set of correlated variablesto a set of orthogonal uncorrelated variables called principalcomponents. These new variables are a linear combination of theoriginal variables in decreasing order of importance.
ikpkipiik tbYYr
p
.1
data matrix loadings (measure of the variation betweenvariables)
scores (measure of the variation between samples)
eigenvalue
Principal componentsPrincipal components
2626
QSAR/QSPRQSAR/QSPRPrincipal Components AnalysisPrincipal Components Analysis
• Formulate matrix
• Diagonalize matrix
• Eigenvectors are the principal components – These principal components (new descriptors) are a linear
combination of the original descriptors
• Eigenvalues represent variance– Largest accounts for greatest % of data variance– Next corresponds to second greatest and so on
2727
QSAR/QSPR-Principal QSAR/QSPR-Principal Components AnalysisComponents Analysis
• Formulate matrix (Several types)
– Correlation or covariance (N x P)• N is number of molecules• P is number of descriptors
– Variance-Covariance matrix (N x N)
• Diagonalize (Rotate) matrix
r11 r12 ....r1pr21 r22 ....r2p rn1 rn2 ....rnp
A~
AA
T Avc
2828
QSAR/QSPR-Principal QSAR/QSPR-Principal Components AnalysisComponents Analysis
• Eigenvectors (Loadings) – Represents contribution from each original descriptor
to PC (new descriptor)• # columns = # of descriptors• # rows = # of descriptors OR # of molecules
• Eigenvalues– Indicate which PC most important (representative of
original descriptors)• Benzene has 2 non-zero and 1 zero eigenvalue (planar)
2929
QSAR/QSPR-Principal QSAR/QSPR-Principal Components AnalysisComponents Analysis
• Scores
– Graphing each object/molecule in space of 2 or more PCs
• # rows = # of objects/molecules• # columns = # of descriptors OR # of molecules
For benzene corresponds to graph in PC1 (x’) and PC2 (y’) system
3030
PC1
PC2
x
y
The PC’s each maximise the variancein the data in orthogonal directions andare ordered by size.
Usually only a few components are neededto explain (>90%) of the variance in thedata – or the properties are not relevant
The first step is to calculate the varience-covarience matrix from the data
Principal componentsPrincipal components
3131
PC1
PC2
x
y
If there are s observations each of which contains v values, the data can be represented by a matrix D with v rows and s columns.
The varience-covariance matrix is Z = DTD.
The eigenvectors of Z are the principal components. Z is a square symmetric matrix so the eigenvectors are orthogonal. Usually the matrix is diagonalised to obtain the eigenvectors (the weightings for the properties) and eigenvalues (the explained variance).
Principal componentsPrincipal components
3232
80 10 5 3 2
p1 .2 .3 .4 .1 .1 p2 .01 .02 .3 .4 .5p3 .02 .03 .1 .2 .4p5 .03 .4 .4 .04 .3p5 .3 .5 .5 .05 .3
eigenvalues – explain % variance
Properties
Multiply the property valuefor molecule by this for eacheigenvalue
Can do regression on the PC’s, egV = 0.3PC1(0.1) + 0.2PC2(0.1) + 0.4(0.2)
so, we’ve reduced a 5 property problem to a two property problem
The output looks like this :
Principal componentsPrincipal components
3333
QSAR on SYBYL (Tripos Inc.)QSAR on SYBYL (Tripos Inc.)
3434
QSAR on SYBYL (Tripos Inc.)QSAR on SYBYL (Tripos Inc.)
10D3D
3535
QSAR on SYBYL (Tripos Inc.)QSAR on SYBYL (Tripos Inc.)
• Eigenvalues Explanation of variance in data
3636
QSAR on SYBYL (Tripos Inc.)QSAR on SYBYL (Tripos Inc.)
• Each point corresponds to column (# points = # descriptors) in original data
Proximity correlation
3737
QSAR on SYBYL (Tripos Inc.)QSAR on SYBYL (Tripos Inc.)• Each point corresponds to row of original data
(i.e. #points = #molecules) or graph of molecules in PC space
HeNapthalene
H2O
Molecular Size
Small acting Big
Proximitysimilarity
3838
QSAR on SYBYL (Tripos Inc.)QSAR on SYBYL (Tripos Inc.)
Outlier
3939
QSAR on SYBYL (Tripos Inc.)QSAR on SYBYL (Tripos Inc.)
4040
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• Principal Component Analysis
4141
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• Principal Component Analysis
4242
Non-Linear MappingsNon-Linear Mappings
• Calculate “distance” between points in N-dimensional descriptor/parameter space– Euclidean– City-block distances
• Randomly assign compounds in set to points on a 2-D or 3-D space
• Minimize Difference (Optimal N-d 2D plot)
4343
Non-Linear MappingsNon-Linear Mappings
• Advantages– Non-linear– No assumptions!– Chance groupings unlikely (2D group likely an
N-D group)
• Disadvantages– Dependence on initial guess (Use PCA scores
to improve)
4444
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• Multiple Regression (MR)• PCR• PLS
4545
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• Linear Regression– Minimize difference between calculated and
observed values (residuals)
Multiple Regression
y mx b
mx i x y i y
i1
N
x i x 2
i1
N
b y m x
y mi * x ii1
N
B
4646
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• Principal Component Regression
– Regression but with Principal Components substituted for original descriptors/variables
4747
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• Partial Least Squares
– Cross-validation determines number of descriptors/components to use
– Derive equation – Use bootstrapping and t-test to test
coefficients in QSAR regression
4848
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• Partial Least Squares (a.k.a. Projection to Latent Structures)– Regression of a Regression
• Provides insight into variation in x’s(bi,j’s as in PCA) AND y’s (ai’s)
– The ti’s are orthogonal – M= (# of variables/descriptors OR
#observations/molecules whichever smaller)
y ai * tii
N
ti bij * x jj
M
4949
QSAR/QSPR-Regression TypesQSAR/QSPR-Regression Types
• PLS is NOT MR or PCR in practice
– PLS is MR w/cross-validation– PLS Faster
• couples the target representation (QSAR generation) and component generation while PCA and PCR are separate
• PLS well applied to multi-variants problems
5050
QSAR/QSPRQSAR/QSPRPost-QualificationsPost-Qualifications
• Confidence in Regression– TSS-Total Sum of Squares– ESS-Explained Sum of Squares– RSS-Residual Sum of Squares
TSSESS RSS
R2 ESS
TSS
1 (100% explaination of data)
0 (no explaination of data)
y i y 2
i
N
TSS
ycalc,i y 2ESS
i
N
y i ycalc,i 2
i
N
RSS
5151
QSAR/QSPRQSAR/QSPRPost-QualificationsPost-Qualifications
• Confidence in Prediction (Predictive Error Sum of Squares)
Q2 1PRESS
y i y 2
i1
N
, PRESS y i ycalc,i 2
i1
N
5252
QSAR/QSPRQSAR/QSPRPost-QualificationPost-Qualification
• Bias?– Bootstrapping
• Choosing best model?– Cross Validation
5353
QSAR/QSPRQSAR/QSPRPost-QualificationPost-Qualification
• Bootstrapping
– ASSUME calculated data is experimental/observed data
– Randomly choose N data (allowing for a multiple picks of same data)
– Re-generate parameters/regression – Repeat M times– Average over M bootstraps– Compare (calculate residual)
• If close to zero then no bias• If large then bias exists
M is typically 50-100
5454
QSAR/QSPRQSAR/QSPRPost-QualificationPost-Qualification
• Cross-Validation (used in PLS)– Remove one or more pieces of input data– Re-derive QSAR equation– Calculate omitted data– Compute root-mean-square error to evaluate efficacy of model
• Typically 20% of data is removed for each iteration• The model with the lowest RMS error has the optimal number of
components/descriptors
5555
QSPR ExampleQSPR Example
• Relation between musk odorant properties and benzenoid structure– Training set of 148 compounds (81 non-musk and 67 musk)– 47 chemical descriptors initially– Pre-qualifications
• Correlations (47-12=35)
– Post-qualifications• Bootstrapping • Test-set
– 6/6 musks, 8/9 non-musks
Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156 (1986)
5656
Practical IssuesPractical Issues
• 10 times as many compounds as parameters fit
• 3-5 compounds per descriptor
• Traditional QSAR – Good for activity prediction– Not good for whether activity is due to binding
or transport
5757
Advanced MethodsAdvanced Methods
• Neural Networks• Support Vector Machines• Genetic/Evolutionary Algorithms• Monte Carlo• Alternate descriptors
– Reduced graphs– Molecular connectivity indices– Indicator variables (0 or 1)
• Combinatorics (e.g. multiple substituent sites)
5858
Tools AvailableTools Available
• Sybyl (Tripos Inc.)
• Insight II (Accelrys Inc.)
• Pole Bio-Informatique Lyonnais – http://pbil.univ-lyon1.fr/
• Molecular Biology– http://www.infobiogen.fr/services/deambulum/
english/logiciels.html
5959
SummarySummary
• QSAR/QSPR– Statistics connect structure/behavior w/ observables– Interpolate/Extrapolate
• Multi-Variate Analysis– Pre-Qualification– Regression
• PCA• PLS• MLS
– Post-Qualification