22/01/2013
1
EBI is an Outstation of the European Molecular Biology Laboratory.
Virtual Screening:
Methods and Applications
Dr Pedro J Ballester
MRC Methodology Research Fellow
EMBL-EBI, Cambridge, United Kingdom
Talk outline
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
2
1. Introduction
2. Ligand-based Virtual Screening: Methods
3. Ligand-based Virtual Screening: Applications
4. Structure-based Virtual Screening: Methods
5. Structure-based Virtual Screening: Applications
22/01/2013
2
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
3
The Drug Discovery Process
• Developing new drug = average US$4 billion and 15 years http://www.forbes.com/sites/matthewherper/2012/02/10/the-truly-staggering-cost-of-inventing-new-drugs/
• While (pre)clinical trials are the most expensive stages,
the research determining approval at early stages:
• Finding a target linked to the disease and a molecule modulating
the function of target without trigering harmful side effects.
• New targets, but hard, less funding than traditional, etc.
Payne et al. (2007) Nat Rev. Drug Disc. 6:29
Payne et al. (2007) Nat Rev. Drug Disc. 6:29
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
4
Virtual Screening: Why?
• HTS: Main strategy for identifying active molecules (hits)
by wet-lab testing a library of molecules against a target.
• Computational methods (Virtual Screening) are needed:
• HTS is slow: HTS of corporate collections many months
• HTS is expensive: Average cost US$1M per screen.Payne et al. 2007
• Growing # of research targets no HTS until target validation
• Limited diversity:
HTS 106 cpds...
but 1060 small molecules!
(Dobson 2004 Nature)
• Target really undruggable?
22/01/2013
3
Virtual Screening (VS): when and which?
VS can complement HTS by
enriching libraries w/ likely ligands
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
5
HTS possible?
Can afford expense and
time?
Diversity essential?
HTS VS
VS
VS
Y
Y
N
N
Y
N
Ligand/s or structure for
target?
Target structure
available?
Ligand- > target-
based VS
Ligand-based VS
only
VS not possible
Y
Y
N
N
HTS or Virtual Screening (VS)? Type of Virtual Screening (VS)?
VS to predict target affinity (hit identification)
• Search for molecules that modulate
the function of a therapeutic target.
• Hypothesis: target fn modulation
cures/alleviates associated disease
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
6
• In silico predictions must be validated in vitro (IC50, Ki, Kd)
• Important requirements:
• Potent threshold depends on target and [drug]plasma
• Different chemical structure new IP, avoid previous problems
• Work well in practice vs Looking good on paper
22/01/2013
4
VS to predict selectivity (lead identification)
• Drugs must selectively bind to their intended target, as
binding to other proteins may cause harmful side-effects
1. The more chemically diverse the found hits are, the
more likely directly having selective hits will be.
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
7
2. Structure-based design: e.g. identify
hits that occupy a subpocket that is in
the target but not in related proteins.
3. Ligand-based design: 3D superpose
diverse hits and use activity against
related proteins to formulate hypothesis.
VS to predict whole-cell Activity (lead identif.)
• Drug molecules must also inhibit the
target in the cell environment.
• Whole-cell assay: e.g. cancer cell
growth inhibition (GI50) measured
in the presence of the molecule.
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
8
• Phenotypic screening is attractive ∵ many molecules
binding to the target do not have whole-cell activity.
• However, very few in silico methods for predicting which
molecules are likely to have whole-cell activity.
22/01/2013
5
Talk outline
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
9
1. Introduction
2. Ligand-based Virtual Screening: Methods
3. Ligand-based Virtual Screening: Applications
4. Structure-based Virtual Screening: Methods
5. Structure-based Virtual Screening: Applications
Ligand-based VS: classes by available data
• Using a single active molecule:
• 2D: similarity search based on connectivity, fingerprints,…
(e.g. circular fingerprints, MACCS keys, etc. etc.)
• 3D: similarity search based on shape, HBs, hydrophobicity…
(e.g. USR, ROCS, Shape Signatures, etc. etc.)
• Using a few active molecules:
• Pharmacophore models: 3D alignment, pharmacophore
elucidation and pharmacophore search (e.g. Pharmer).
• Using many active molecules:
• (2D)QSAR: chemical structures + regression/classification model
• 3D-QSAR: 3D conformers + regression/classification model
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
10
22/01/2013
6
Ligand-based VS: Motivation and Challenges
• At least a known ligand, but ≠ chemical structures than
such template/s may have better properties:
• More selectivity to target
• Less toxicity
• More potency
• Less IP concerns, …
• Challenges (some specific to the type of method):
• Scaffold hopping: better with 3D techniques and larger dbs
• Efficiency: so that very large molecular dbs can be searched
• Prospective efficacy: high rates at suitable potency threshold –
depends on previously known ligands for the target
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
11
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
12
Ligand-based VS: Shape Similarity
• Principle:
Similarly shaped molecules
will fit the same binding pockets
likely similar bioactivity.
• Shape-based VS: Search a molecular database for small
molecules that are similar in 3D shape to a known active.
• Prospective success in the literature (alone/combined).
• Additional advantage: chemical structure is not specified
and hence novel chemical scaffolds may be found.
2C4W-GAJ 2C4W-GAJ
22/01/2013
7
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
13
Shape-based VS: Challenges
Two challenges limiting the potential of shape similarity:
• Effectiveness
• fast pre-alignment of molecules is prone to error, shape
description might not be sufficiently accurate, etc.
• Efficiency
• actives & 3D conformations as templates, growing
size of compound libraries, limited computers, etc.
• In practice, many active molecules are not found
because these are not even screened in silico!
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
14
Ultrafast Shape Recognition (USR)
22/01/2013
8
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
15
USR descriptors 12 real-valued
USR descriptors
Molecular
Shape
Relative
position
of atoms
4 reference
points
interatomic
distances
1st, 2nd and
3rd sample
moments of
each dstn
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
16
USR similarity
Shape similarity as the inverse of a distance between
vectors of USR descriptors Score (0,1]
Ranking molecules in terms of shape similarity to a
query/template molecule (e.g. a known inhibitor)
22/01/2013
9
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
17
USR: most similar shapes in 2.5M DB
1st 2nd 3rd 4th
Query
Rank
1
2
3 Template is found to be the most similar molecule.
Top ranked molecules
are very similar, despite the large size of the database
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
18
USR vs ESshape3D: effectiveness
1st 2nd 3rd 4th
Query
Rank
2
2 USR
ESshape3D
USR
22/01/2013
10
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
19
USR vs ESshape3D: efficiency
1𝑡𝑠𝑐𝑟𝑒𝑒𝑛
𝑡𝑈𝑆𝑅 𝑞, 𝑛 = 𝑛 (𝑡𝑟𝑒𝑎𝑑 + 𝑞 𝑡𝑠𝑐𝑟𝑒𝑒𝑛)
𝑡𝑟𝑒𝑎𝑑 ~ 10−5 𝑠𝑒𝑐𝑠/𝑐𝑜𝑛𝑓
𝑡𝑠𝑐𝑟𝑒𝑒𝑛 ~4 10−8 𝑠𝑒𝑐𝑠/𝑐𝑜𝑛𝑓
• Interactively and accurately shape searching a multi-million
molecular DB was not feasible until USR. Now:
• Tom Blundell’s group, University of Cambridge, UK
• Stefano Moro’s group, University of Padova, Italy
• David Wild’s group, Indiana University, USA
Molecular data mining with USR
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
20
22/01/2013
11
UFSRAT: USR + more chemical properties
Malcolm Walkinshaw’s group,
University of Edinburgh, UK
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
21
J. Adie 2010 PhD thesis
S. Shave 2010 PhD thesis
Talk outline
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
22
1. Motivation
2. Ligand-based Virtual Screening: Methods
3. Ligand-based Virtual Screening: Applications
4. Structure-based Virtual Screening: Methods
5. Structure-based Virtual Screening: Applications
22/01/2013
12
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
23
First prospective application of USR
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
24
Problem statement
• Interested in inhibitors for hNAT1 (Arylamine N-
acetyltransferases; EC 2.3.1.5; target for breast cancer).
• mNat2 is functionally analogous to hNAT1 assays
• Previously: mNat2 manual screen of 5000 cherry-picked
cpds 0.1% confirmed hit rate (IC50 < 10M)
• No protein structure available Ligand-based approach:
most potent hit (IC50 = 1.1M) as template for the VS.
• Goal: find additional inhibitors with diverse chemical
structure from template to be considered as probes.
22/01/2013
13
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
25
Virtual Screen protocol
5.3M cpds (690M 3D conformers)
Extremely fast screening
100 USR queries on this DB, i.e. 69 billion shape
comparisons in just 83 mins. using a single dual-core proc.
690M confs 34 partitions of ~ 20M confs each
13,700,000 confs/sec using two execution cores.
Top 23 cpds bought
Testing by
collaborators
USR-best_hit
Testing virtual screening hits in vitro
#Compounds active against target?
ZINC7 purchasable
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
26
Results of in vitro confirmatory tests
• 9/23 tested cpds
had IC50 < 10M
hit rate = 39%
• Just £500 to buy
23 cpds to test
• Manual screen yield
hit rate = 0.1% on
the same target.
• Importantly, all nine hits were scaffold hops with respect to
the active template (most hops with SMACCS < 0.4).
Template Most potent USR hits
22/01/2013
14
Prospective application of UFSRAT
• J. Adie 2010 PhD thesis (University of Edinburgh, UK):
• UFSRAT w/ 4 known inhibitors of human 11-hydroxysteroid
dehydrogenase 1 (h11-HSD1), a target for type 2 diabetes.
• a database with four million commercially available compounds
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
27
• 35 tested cpds:
• 5 showed potent
inhibition both in cells
and recombinant
protein assays
• 14% confirmed hit
rate; Ki values
ranged from 51.8 nM
to 11.3 M).
Talk outline
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
28
1. Motivation
2. Ligand-based Virtual Screening: Methods
3. Ligand-based Virtual Screening: Applications
4. Structure-based Virtual Screening: Methods
5. Structure-based Virtual Screening: Applications
22/01/2013
15
Docking
• Docking = Pose generation + Scoring
• Pose generation: estimating the conformation and orientation of
the ligand as bound to the target.
• Scoring: predicting how strongly the ligand binds to the target.
• Many relatively accurate algorithms for pose generation,
but imperfections of scoring functions continue to be the
major limiting factor for the reliability of docking.
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
29
• If X-ray structure of the target
is available Docking:
• predicting whether and how a
molecule binds to the target.
• Possible variables:
• translation and rotation of the ligand relative to the binding site
involves six degrees of freedom (3D position + 3D orientation)
• Torsional/conformational degrees of freedom of both the
ligand (on the fly/stored) and the protein (flexible docking).
Pose generation in Docking
• Goal: finding a pose as similar as possible to that of the ligand co-crystallised with the target.
• How: search algorithm generates several poses of each considered ligand (multimodal optimisation problem). • tradeoff between time and search space coverage
22/01/2013
16
• Force Field-based SFs (e.g. DOCK score)
• Empirical SFs (e.g. X-Score)
• Knowledge-based SFs (e.g. PMF)
• SFs are trained on pK data usually through MLR:
• FF (Aij, Bij), Emp(w0,…,w4) and sometimes KB ( )
Scoring Functions for Docking: functional forms
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
31
Scoring Functions for Docking: limitations
• Two major sources of error affecting all SFs:
1. Limited description of protein flexibility.
2. Implicit treatment of solvent.
• This is necessary to make SFs sufficiently fast.
• 3rd source of error has received little attention so far:
• Conventional scoring functions assume a theory-inspired
predetermined functional form for the relationship between:
• the structure-based description of the p-l complex
• and its measured/predicted binding affinity
• Problem: difficulty of explicitly modelling the various
contributions of intermolecular interactions to binding affinity.
• Also, SFs use an additive functional form, but this has been
specificly shown to be suboptimal (Kinnings et al. 2011 JCIM).
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
32
22/01/2013
17
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
33
non-parametric machine learning can be used to implicitly
capture the functional form (data-driven, not knowledge-based)
A Machine Learning Approach
A machine learning approach
• Main idea: a priori assumptions about the functional
form introduces modelling error no asumptions!
• reconstruct the physics of the problem implicitly in an
entirely data-driven manner using non-parametric ML.
• Random Forest (Breiman, 2001) to learn how the
atomic-level description of the complex relates to pK:
• Random Forest (RF): a large ensemble of diverse DTs.
• Decision Tree (DT): recursive partition of descriptor space s.t.
training error is minimal within each terminal node.
• But how do we characterise a protein-ligand complex as
set of numerical descriptors (features)?
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
34
22/01/2013
18
Characterising the protein-ligand complex
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
35
pKd/i C.C … C.Cl … C.I N.C … I.I PDB ID
5.70 95 30 0 73 0 2p33
+1 binding affinity
features or
descriptors
PDBbind benchmark
• De facto standard for SFs benchmarking: Cheng, T., Li, X., Li, Y., Liu, Z. & Wang, R. (2009) JCIM 49, 1079-1093
• Refined set 1300 manually curated protein-ligand
complexes with measured binding affinity ( diverse):
• Benchmark: 16 state-of-the-art SFs test set error
• RF-Score vs 16 SFs on test set error, but:
• Other SFs have an undisclosed number of cmpxes in common!
• RF-Score & X-Score (best) non-overlapping training-test sets.
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
36
22/01/2013
19
Training and testing machine learning SFs
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
37
pKd/i C.C – C.I N.C – I.I PDB
0.49 1254 – 0 166 – 0 1w8l
– – – – – – – –
13.00 2324 – 0 919 – 0 2ada
pKd/i C.C – C.I N.C – I.I PDB
1.40 858 – 0 0 – 0 2hdq
– – – – – – – –
13.96 4476 – 0 283 – 0 7cpa
Random Forest training
(descriptor selection, model selection)
RF-Score
(description and training choices)
Training set (1105 complexes) Test set (195 complexes)
1105
195
Generation of descriptors (dcutoff, binning, interatomic types)
1w8l
pKi=0.49
1gu1
pKi=4.52
2ada
pKi=13
2hdq
pKi=1.4
1e66
pKi=9.89
7cpa
pKi=13.96
RF-Score‘s performance
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
38
Rp=0.776
SD=1.58
22/01/2013
20
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
39
Fair assessment of scoring functions (SFs)
How to find best regression techniques for this problem?
Ballester PJ & Mitchell JBOM (2011) Journal of Chemical Information and modelling 51, 1739–1741
Benchmarking SFs: cultural barriers
1. Test all SFs on the same data set previously selected
by 3rd party: increasingly the case, but not always.
2. Have all SFs trained on the same data set:
• Almost never the case, often training set is not even disclosed.
• SF_A > SF_B simply ∵ A’s training set contains data more
relevant to predict the test set, not ∵ more accurate modelling!
3. No complexes in common between training and test
• Almost never the case, ∵ SFs are not recalibrated.
• Largest bias! Most SFs in PDBbind benchmark are likely to
perform worse if recalibrated to exclude overlaps.
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
40
22/01/2013
21
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
41
Performance vs overlap in RF-Score
If we allow 65 cpxes overlap
Rp=0.827
No overlap (unlike other SFs
but X-Score) Rp=0.776
Talk outline
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
42
1. Motivation
2. Ligand-based Virtual Screening: Methods
3. Ligand-based Virtual Screening: Applications
4. Structure-based Virtual Screening: Methods
5. Structure-based Virtual Screening: Applications
22/01/2013
22
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
43
First prospective application of RF-Score
44
Type II dehydroquinase - DHQase (EC 4.2.1.10)
• 3rd enzyme of the shikimate pathway
• Ideal target for antimicrobials development, as pathway
is essential for bacteria, but not present in mammals.
• bacteria human pathogens such that M. tuberculosis
(Tuberculosis ) and H. pylori (duodenal/gastric ulcers)
Hierarchical Virtual Screening for the discovery of new molecular scaffolds in
antibacterial hit identification
LibMedia Drug Design -
Oxford, UK, September 2012
UCL MSc in Drug Design,
Jan 2013 Virtual Screening: Methods and Applications
22/01/2013
23
45
High-Throughput Screening (HTS)
Hierarchical Virtual Screening for the discovery of new molecular scaffolds in
antibacterial hit identification
LibMedia Drug Design -
Oxford, UK, September 2012
• Robinson et al. (2006) J Med Chem 49:1282
• HTS 150,000 cpds against Type II DHQase from H. Pylori
• Primary screening: 100 cpds with 50% inhibition at cpd
concentration of 20g/mL (~106M)
• Secondary screening: Only one confirmed active reported
• Ki of this HTS hit (AH9095):
• HP: 20M
• SC: 230M
• MT: 10% at 200M
• As a result, only three known scaffolds for DHQase, the
other two being rational derivatives of the substrate.
UCL MSc in Drug Design,
Jan 2013 Virtual Screening: Methods and Applications
46
Hierarchical Virtual Screening
USR-(CA2,GAJ,RP4)
9M
USR-CA2
ZINC drug-like
GOLD-2BT4
USR-GAJ
ZINC drug-like
USR-RP4
ZINC drug-like
GOLD-2C4W GOLD-2CJF
RF-Score 3.9K
Testing virtual screening hits in vitro
Compounds active against target
148
?
ZINC drug-like
USR-CA2
ZINC drug-like
USR-CA2
ZINC drug-like
Testing virtual screening hits in vitro (£5000)
Compounds active against target
3.9K
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
22/01/2013
24
47
Shape similarity screen
• Screening library: ZINC8 subset of 8,784,580 drug-like
commercially available molecules.
• i.e. 59 times larger than that of the HTS (150,000 mols.)
• USR: three distinct co-crystallised ligands as templates to
account for partial shape complementarity to some extent
promotes diversity of hits
• From ~9M to ~4K molecules similarly shaped to co-
crystallised ligands ~complementary to binding site
more likely to bind than those in the initial library
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
48
GOLD pose generation + RF-Score rescoring
2cjf-rp4 (Scl) 2bt4-ca2 (Scl) 2c4w-gaj (Hp)
• Docking pose generation by Martina Mangold & Jochen
Blumberger (Dept. Chemistry, University of Cambridge):
• GOLD: generate 3D conformation, position and orientation of a
small molecule relative to a putative binding site.
• calibration: RMSD of co-crystallised vs predicted poses.
• Purchase: top ranked poses according to RF-Score.
UCL MSc in Drug Design,
Jan 2013
Virtual Screening: Methods and Applications
22/01/2013
25
Experimental confirmation: wet-lab assays
• Experimental validation of hits by Nigel Howard and
Chris Abell (Dept. Chemistry, University of Cambridge).
• Outstanding hit rates of ~ 60% with Ki 250 M 100
new and structurally diverse actives (£5,000 cost).
49
Overall Performance Ki ≤ 100M Ki ≤ 250M (L1, L2, L3)[M]
Against Mtb DHQase 35 (23.6%) 89 (60.1%) (23, 24, 40)
Against Scl DHQase 40 (27.0%) 91 (61.5%) (4, 21, 29)
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
50
New active scaffolds for Type II DHQase
M. Tuberculosis
Ki
Computational Drug Design School of Computing,
University of Kent, Nov 2012
22/01/2013
26
51
New active scaffolds for Type II DHQase
M. Tuberculosis
Ki
Computational Drug Design School of Computing,
University of Kent, Nov 2012
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
52
New active scaffolds for Type II DHQase
S. coelicolor
Ki
22/01/2013
27
Summary
• Computational methods needed to reduce expenses
and timescales, but also to tackle challenging targets.
• Goal: finding potent, selective and innovative drug-like
molecules with activity in target and whole-cell assays.
• If known active molecules, shape and other 3D props.
(e.g.USR/UFSRAT) to search very large molecular DBs.
• If only structure of target known, docking molecules with
latest generation of machine learning scoring functions.
• Remember: VS methods can be complementary to HTS
or combined hierarchically. Need experimental data.
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
53
Virtual Screening: Methods and Applications UCL MSc in Drug Design,
Jan 2013
54
Thanks – Q&A