The Scaffold Tree
An Analysis Method for Chemical Structure Data SetsAnsgar Schuffenhauer
Contributors
2 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
MPI Dortmund• Stefan Wetzel• Marcus Koch• Steffen Renner
• Herbert Waldmann
Novartis colleagues• Peter Ertl• Silvio Roggo• Nathan Brown• Paul Selzer• Jeremy Jenkins• Kamal Azzoui• Jacques Hamon
• Edgar Jacoby
Chemical ClassificationWhy is it needed?
3 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Industrial and publicly funded research using increasingly large screening collections• An increasingly larger part of these compounds result from parallel
synthesis
HTS of these collections gives increasingly large hit sets
It is important to identify hits belonging to a common chemical class • They can be explored in joint synthesis effort• This effort can be guided by SAR derived from the screening data
Get the chemical classes right, map the biological response onto the “chemical map”
Classification of Chemical Structures
4 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Clustering
Classification derived from unsupervised machine-learning
Information of complete dataset is required for classification
No linear scaling with dataset size
No incremental updates possible
Rule-based
Explicitly formulated rules encode “expert knowledge”
Class assignment is derived for each structure independently
Scales linearly with number of molecules in dataset
Incremental updates possible
The Molecular Framework and its Generalizations
5 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Bemis and Murcko J. Med. Chem. 1996, 39, 2887Xu and Johnson J. Chem. Inf. Comput. Sci. 2002 42, 912
O
NNH
O
prune terminal sidechains
originalstructure
O
N
** *
* ** *
**
** *
***
discard atom and bond type
discard ring sizes and linkage lengths
Not well definedchemical entities
O
NNH
O
O
O
NNH
O
O
** *
* ** *
**
** *
***
** *
* ***
Addition of a cyclicsidechain preventsrecognition of common core
reduced framework graph
molecular framework
topological framework
Are there Alternatives?
6 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Retain the molecular framework as classification element• Exoxyclic and “exolinker” double bonds are part of the molecular
framework
Instead removing atom & bond type and ring size information prune less important rings sidechainspiecemeal• Do not disconnect the scaffold• Use prioritization rules to decide which ring to remove first• Small, generic set of rules, no “dictionary”
Schuffenhauer et al. J. Chem. Inf. Model. 2007,47, 47
An introductory exampleBaccatin III
7 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
OH
OO
O
O
OH
OHO
O
O
O
H
H
O
O
O
O
O
O
O
O O
Baccatin IIl molecularframework
Rule 3Choose the parent scaffold with smallest number of acyclic linker bonds
8 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Scaffolds having the least number of acyclic linker atoms are likely to be more rigid• Rigid scaffolds are more likely to
present their sidechains in a conserved orientation
Acyclic linkers are strategic bonds• Likely to be formed late in a
parallel synthesis effortN
S
OO
ON
N
S
OO
ON
N
S
OOHO
H
O
ON
FCl
AB
C
D
Flucloxacillin5290-39-5
Rule 4Keep bridged and spriro rings and unusally fused ringsystems.
9 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Such ring patterns are unusual and likely not to be formed unintentionally
Use difference between number of bonds being member > 1 ring (nrrb) and number of rings (nR) - 1|∆| = | nrrb – (nR – 1)|• In most common linear ringfusion
pattern nrrb = nR -1
OH
N
H
NH
NH
Pentazocine359-83-1
A B C
| 2 - (2 - 1)| = 1
| 1 - (2 - 1)| = 0
Case 1: bridged rings
Rule 4 continued
10 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
N
N
OH
H H
H
N
N
O
N
NH
NH
N
O
NH
N
O
Sophocarpin6483-15-4
A B
C
D
|3 - (3 - 2)| = 1|2 - (3 - 2)| = 0|2 - (3 - 2)| = 0
NH
N
O
O
OO
NH
N
O
NH
NH
O
N
| 0 - (2 - 1)| = 1
Rhynchophylline76-66-4
A
B C
| 1 - (2 - 1)| = 0
Case 3: spiro ringsCase 2: non-linearly fused rings
Rule 6Remove rings of size 3, 5 and 6 first
11 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
N
S
O
NH
S
NHO
N
N NH2
N
N
NH
NH
N
A B
Epinastine 80012-43-7
b)
a)
A
B
The majority of the commercially available building blocks are containing rings of size 3,5 or 6.
If rings of different sizes occur, they are likely to be built up intentionally to fulfill a dedicated purpose
Rule 8Remove rings with the least number of hetero-atoms first
12 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Hetero-atoms can make characteristic H-bonding interactions
Hetero-atoms bound with an execyclic double bond
NH
NH
Rule 9If number of hetero-atoms is equal priority of hetero-atoms is N > O > S
13 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
N-heterocycles play in important role in medicinal chemistry
N and O atoms are capable of forming H-bonds
Avoid mapping to benzene in cases there are alternatives
NH
S
S
NH
S
N
Cl
Ticlopidine55142-85-3
A B
Rule 10Keep larger ring with priority
Rule 11Of mixed aromatic/non aromatic ring systems retain non-aromatic rings with priority
Rule 13Use canonical smiles as tiebreaking rule
14 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Keep scaffold which has the canonical smiles with alphabetical sort precedence
The next pruning step will prune the ring which did “win” in the tie breaking
OO
ON
O
O
O
O
Ormeloxifene31477-60-8
A B
C
D
C2Oc1ccccc1CC2c3ccccc3
C3CC(c1ccccc1)c2ccccc2O3
The introductory example revisitedBaccatin III
15 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
OH
OO
O
O
OH
OHO
O
O
O
H
H
O
O
O
O
O
O
O
O O
Baccatin IIl molecularframework
O
OO
O
O
OH
OHO
O
O
O
H
HO
OHNH
O
Taxol
Rule 3
Rule 4
Rule 4Rule 6
Which rule is used how often?See poster T19
Classification of a public HTS data setNCGC Pyruvate Kinase screen
16 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
HTS run at the NCGC, a NIH roadmap screening institute• Data can be downloaded from PubChem• http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=361
602 active and ~50 000 inactive molecules
Scaffolds shown in tree• Have at least 5% actives• Represent at least 0.02% (10 compounds) of the whole data set.
Scaffold Tree Example for HTS resultsPubChem Pyruvate Kinase Data Set
17 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Color intensity by fraction of actives
Scaffold Tree Example for HTS resultsPubChem Pyruvate Kinase Data Set
18 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Color intensity by fraction of actives
Scaffold Tree Example for HTS resultsPubChem Pyruvate Kinase Data Set
19 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Color intensity by fraction of actives
Scaffold Tree Example for HTS resultsPubChem Pyruvate Kinase Data Set
20 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
O
NNH
OEt
O
N
O
O
O
NNH2O
NNH
O
O
NNH2
Cl
O
NCl
NH2
O
NNH
OPr
O
N
Br
NH
OPr
O
O
NNH2
Cl
Br
O
NNH
ONH
NH
O
NH
O
NNH
O
O
NNH2
O
N
O
O
O
NNHCl
F
F
O
O
NNH
OPr
O
N
O
O
O
NNH
O
Pr
O
NNH2
O
NNH
O
O
N
NH2
O
NNH
O
O
NNH2
Br
Br O
NNH
O
Et
O
NNH
ON
NH
O
N
O
NNH
ONH
NH
O
NH
BuBu
O
NNH2
O
NNH
N
NN
N
O
O
O
NNH
N
NN
N
O
O
O
NNH
O
N
O
O
N
O
OO
NNH
O
O
NNH
O
NH
ON
N
O
NNH
O
O O
O
NNH
O
N
O
NNH
O
N
O
NNH
O
NNNH
O
NNH
O
NNNH
O
NNH
O
FF
O
NNH Cl
O
N
Br
O
NNH F
O
N
Br
O
NNH
O
N
O
NNH Cl
O
NN
NH
O
N
NH
OCl
NN N
O
N
NH
OF
N
Cl
O
N
NH
O
N
O
N
NH
O
N
O
O
N
NH
O
O
O
N
NH
O
Cl
O
O
N
NHO
O O
O
O
NNH
S OO
O
NNH
S OO
NH
O
O
NNH
O
O
actives
4.8
Scaffold Tree doesn’t fit on a screen or a poster?
21 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Scaffold Tree Extracts meaningful chemical series• The tree allows visualization of scaffold hierarchy• If it would just fit onto the screen…
Interactive visualization tool needed:• Manipulate resolution• Filter scaffolds
See poster T30
Chemical Series and Biological Response
22 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Enrichment of actives in part of the series• Even in series with enriched activity the actives only enriched up to
20%• This suggests that the scaffolds potential for biological activity can be
only materialized with the appropriate side chains• After all this is the rationale behind combinatorial chemistry
Chemical Series and Biological Response
23 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Chemical variation of library parallel to smooth change of biological response• Essential binding features (“privileged
substructure”) conserved in library variation• Attractive entry point for chemical exploration• Derivation of SAR possible
Chemical variation of library orthogonal to smooth change of biological response• Chemical variation disrupts essential binding
features• SAR appears to be “flat” or active
compounds appear to be singletons
inactive
highlyactive
desiredbiological activityChemical space Chemical space
library around scaffold projectedinto chemical space
NH
NN
R
NH
RN
NN
R
R
NH2
NN
N
R
R
NH2
NN
N
R
R
NH2
NN
N
R
R
NH2
Structural Classification Evaluated in Biology Space
24 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Can we asses the relation between chemical class on biological activity in a more general manner?• Can we compare clustering and rule-based classification?
Use in vitro data on a uniform assay panel to measure success
Two competing objectives• As few partitions as possible• Biological activity profile of compounds within partitions should be as
similar as possible (low cluster spread SP)
• Evaluate by Pareto analysis
Schuffenhauer et. al. J. Chem. Inf. Model. 2007, 47, 325
Pareto Analysis
25 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
A,B,C are all Pareto optimal solutions
B is superior to D
A-D are superior to E
nPartitionsS
P
A
B
C
D
SP
E
optim
izatio
n
nPartitionsRandom partitioning(lower end benchmark)
One solution of the problem is superior to another solution only if it is superior in all objectives
Profile based clustering (upper end benchmark)
Partitions by structure
Spread of Partitions in Biological Profile Space
26 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Distance of two compounds i,j in profile space:
[ ]∑ −=assays
a
ajpICaipICjid 25050 ),(),(),(
Average within partition k
Average over all partitions weighted by partition size(to be minimized)
( )∑∑≤
=
<
=−=
kni
i
ji
jkkk jid
nnsp
1 1
),(12
1
k
n
k total
k spnnSP
cluster
∑=
=1
Pharmacology Saftey Profile Data Set
27 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
Safety-Pharmacology Profile• 1006 compounds• IC50/EC50 values in 27 assays (mostly aminergic GPCR and Ion
Channels)• No missing values
- Spread values well defined
Classification of Safety Data SetScaffold Tree Pharmacology Profile
28 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
0.0
0.5
1.0
1.5
2.0
2.5
1 10 100 1000NPartitions
SP
pIC50_Kmeans
pIC50_noise_Kmeans
Random
Scaffold Tree
1 ring
2 rings
3 rings
Classification of Safety Data SetClustering with FCFP_4 Fingerprints
29 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
0.0
0.5
1.0
1.5
2.0
2.5
1 10 100 1000NPartitions
SP
pIC50_Kmeans
pIC50_noise_Kmeans
Random
Scaffold Tree
FCFP_4_PPClust
FCFP_4_DivKM
1 ring
3 rings
2 rings
Classification of Safety Data Set FEPOPS Pharmacophore Descriptors
30 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
0.0
0.5
1.0
1.5
2.0
2.5
1 10 100 1000NPartitions
SP
pIC50_Kmeans
pIC50_noise_Kmeans
Random
Scaffold Tree
FCFP_4_PPClust
FCFP_4_DivKM
FEPOPS_DivKM_maj
Jenkins et al. J. Med. Chem. 2004, 47, 6144
Classification of Safety Data Set Simple 1D descriptors (MW, AlogP, PSA, ..)
31 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
0.0
0.5
1.0
1.5
2.0
2.5
1 10 100 1000NPartitions
SP
pIC50_Kmeans
pIC50_noise_Kmeans
Random
Scaffold Tree
FCFP_4_PPClust
FCFP_4_DivKM
FEPOPS_DivKM_maj
simple_1D_DivKM
simple 1D:MW, ALogP, Num_RotatableBonds PSA, Num_H_Acceptors, Num_H_Donors
Are the classifications overlappingAdjusted Rand index matrix – Pharmacology Profile
32 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
safety profile pIC50
_Kmea
ns (1
96)
pIC50
_nois
e_Kmea
ns (1
96)
Murcko
(SCIN
S, 167
)
Scaffo
ld_Tree
(leve
l 2, 3
11)
FCFP_4_P
PClust (1
96)
UNITY_FP_P
PClust (1
96)
Unity_
DivKM (1
96)
FCFP_4_D
ivKM (1
96)
Similo
g_DivK
M (196
)
RDF_DivK
M (196
)
RDF_SOM (1
57)
FEPOPS_DivK
M_dist
(198
)
FEPOPS_DivK
M_maj
(196)
FEPOPS_PPClus
t_dist
(189
)
FEPOPS_PPClus
t_maj
(176)
PhysC
hem_D
ivKM (1
96)
PhysC
hem_P
CA (182
)
Rando
m (196
)pIC50_Kmeans (196) 0.98 0.84 0.03 0.00 0.02 0.02 0.02 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 1
pIC50_noise_Kmeans (196) 0.84 0.77 0.03 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.8
Murcko (SCINS, 167) 0.03 0.03 1.00 0.19 0.26 0.24 0.16 0.16 0.16 0.10 0.11 0.05 0.07 0.07 0.09 0.09 0.04 0.00 0.9
Scaffold_Tree (level 2, 311) 0.00 0.00 0.19 1.00 0.33 0.34 0.15 0.14 0.13 0.10 0.14 0.05 0.06 0.06 0.08 0.06 0.03 0.00 0.7
FCFP_4_PPClust (196) 0.02 0.01 0.26 0.33 0.83 0.76 0.31 0.30 0.25 0.21 0.36 0.09 0.12 0.13 0.15 0.13 0.06 0.00 0.6
UNITY_FP_PPClust (196) 0.02 0.01 0.24 0.34 0.76 0.81 0.32 0.28 0.24 0.21 0.36 0.09 0.11 0.12 0.14 0.13 0.06 0.00 0.5
Unity_DivKM (196) 0.02 0.01 0.16 0.15 0.31 0.32 1.00 0.34 0.26 0.22 0.16 0.09 0.11 0.10 0.10 0.14 0.06 0.00 0.4
FCFP_4_DivKM (196) 0.01 0.01 0.16 0.14 0.30 0.28 0.34 0.68 0.26 0.18 0.14 0.09 0.10 0.10 0.09 0.14 0.05 0.00 0.3
Similog_DivKM (196) 0.01 0.01 0.16 0.13 0.25 0.24 0.26 0.26 0.57 0.17 0.12 0.08 0.10 0.09 0.09 0.15 0.06 0.00 0.2
RDF_DivKM (196) 0.01 0.01 0.10 0.10 0.21 0.21 0.22 0.18 0.17 1.00 0.26 0.06 0.07 0.07 0.07 0.11 0.04 0.00 0.1
RDF_SOM (157) 0.01 0.01 0.11 0.14 0.36 0.36 0.16 0.14 0.12 0.26 0.56 0.05 0.06 0.06 0.07 0.07 0.03 0.00 0.01
FEPOPS_DivKM_dist (198) 0.01 0.01 0.05 0.05 0.09 0.09 0.09 0.09 0.08 0.06 0.05 0.48 0.31 0.10 0.09 0.05 0.03 0.00 0
FEPOPS_DivKM_maj (196) 0.01 0.01 0.07 0.06 0.12 0.11 0.11 0.10 0.10 0.07 0.06 0.31 0.80 0.11 0.14 0.06 0.03 0.00
FEPOPS_PPClust_dist (189) 0.01 0.01 0.07 0.06 0.13 0.12 0.10 0.10 0.09 0.07 0.06 0.10 0.11 0.16 0.16 0.06 0.03 0.00
FEPOPS_PPClust_maj (176) 0.01 0.01 0.09 0.08 0.15 0.14 0.10 0.09 0.09 0.07 0.07 0.09 0.14 0.16 0.20 0.06 0.03 0.00
PhysChem_DivKM (196) 0.01 0.01 0.09 0.06 0.13 0.13 0.14 0.14 0.15 0.11 0.07 0.05 0.06 0.06 0.06 1.00 0.16 0.00
PhysChem_PCA (182) 0.01 0.01 0.04 0.03 0.06 0.06 0.06 0.05 0.06 0.04 0.03 0.03 0.03 0.03 0.03 0.16 1.00 0.00
Random (196) 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
For adjusted rand Index:Hubert and ArabieJ. Classif. 1985, 2, 193
Summary Pareto Analysis
33 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
No partitioning method is generally superior to all others: trade-off between local precision and scaffold hopping potential
The results of all methods is still more close to random than to the ideal biological clustering• Not all biological activity clusters are covered by single chemical class
- Especially not all inactives
local precision
scaffold hopping potential
scaffold tree2D descriptors(FCFP_4)
3D pharmacophoredescriptors(FEPOPs)
Summary
34 | The Scaffold Tree | Ansgar Schuffenhauer | Sheffield June 2007
We can with Scaffold Tree detect chemically meaningful series• Scaffold Tree allows us to detect series with enriched activity
However chemical structure classes are not equivalent with biological activity classes• Within a chemical series biological activity can vary• A biological activity class can be spread over several structural classes
- This is especially true for the “inactive” class.
• This applies for a wide range of structural classifications
However, continuous changes in biologic activity with chemical variation of a chemical series are actually desirable• That biological activity varies smoothly in a chemical class indicates optimization
potential• Initial SAR derived from the series screening results may guide this optimization