Date post: | 19-Jul-2015 |
Category: |
Data & Analytics |
Upload: | hossam-ashtawy |
View: | 192 times |
Download: | 4 times |
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
Advanced Circuits, Architecture, and Computing Lab
Molecular Docking for Drug Discovery: Machine-Learning Approaches for Native
Pose Prediction for Protein-Ligand Complexes
Hossam M. [email protected]
Tenth International Meeting on Computational Intelligence Methods forBioinformatics and Biostatistics
(CIBB 2013)
June 20, 2013
Nihar R. [email protected]
Department of Electrical & Computer Engineering
Michigan State University, East Lansing, MI, U.S.A.
© 2013
Accurately predicting BA of large sets of diverse protein-ligand complexes remains one of the most challenging unsolved problems in computational bimolecular science
Conventional SFs have been shown to have limited predictive and docking power
Size and diversity of protein-ligand complexes with known experimental BA is limited.Large and diverse datasets of protein-ligand
complexes help in building more accurate statistical-based SFs
Motivation
2
• Motivation
• Background and Scope of Work– Scoring Functions
– Our Approach and Scope of Work
• Materials and Methods– Compound Database and Characterization
– Machine Learning Methods
• Experiments, Results, and Discussion– Tuning, Training, and Testing Scoring Functions
– Evaluation and Comparison of Scoring Functions
• Concluding Remarks
Outline
3
Lack of accurate accounting of intermolecular physicochemical interactions
Imprecise solvent modeling
Uncertainties in collected experimental affinity data
Inability to capture inherent nonlinear relationships correlating intermolecular interactions to binding affinity or native binding pose
Scoring & Docking Challenges
6
Predict the binding pose explicitly.
Use sophisticated machine-learning methods to model closeness of a pose to the native conformation.
Use this nonparametric technique in conjunction with physiochemical features describing intermolecular interactions between proteins and ligands
Train predictive models on a large and diverse dataset of high-quality protein-ligand complexes
Evaluate the docking accuracies of resulting SF on diverse protein families
Our Approach & Scope of Work
7
Compound Database: PDBbind
[1]
9
Protein-ligand complexes obtained from PDBbind 2007
PDBbind is a selective compilation of the Protein Data Bank (PDB) database
PDBLigand’s
MW ≤ 1000
# non-hydrogen
atoms of the
ligand ≥ 6
Only one
ligand is
bound to
the protein
Protein &
ligand non-
covalently
bound
Resolution of the complex
crystal structure ≤ 2.5Å
Elements in complex
must be C, N, O, P, S, F,
Cl, Br, I, H
Known Kd or Ki
HydrogenationProtonation &
deprotonation
Refined set of PDBbind
PDBbind: Refined Set
11
PDBbind: Core Set
12
Refin
ed
set
Similarity
search using
BLAST
Similarity
cutoff of
90%
Clusters
with ≥ 4
complexes
Binding affinity of highest-
affinity complex is 100-
fold the affinity of lowest
one
First, middle, and
lowest affinity
complexes from
each cluster
Core Set in PDBbind
[2]
Decoy Generation
13
A protein-
ligand Complex
Generate a
random low-
energy
conformation
Generate ~2000
conformations using 4
different docking
protocols
Discard poses
> 10Å from
native pose
Group poses into
10 1Å bins based
on their RMSD
values
Each bin is further
clustered into 10
clusters
Choose the pose
with the lowest
energy from
each sub-
cluster
100
Decoys
[2]
Extracted features calculated for the following scoring functions:X-Score (6 features)
AffiScore (30 features).
RF-Score (36 features)
GOLD (14 features)
Compound Characterization
14
Primary training set : Pr1105 (Y=BA)
39,085 (Y=RMSD)
Core test set: Cr16,554
Training and Test Datasets
15
Single modelsMultiple linear regression (MLR)
Multivariate adaptive regression splines (MARS)
k-Nearest neighbors (kNN)
Support vector machine (SVM)
Ensemble modelsRandom forests (RF)
Boosted regression trees (BRT)
Machine Learning Methods
16
Conventional SFs
17
SoftwareSF Type
Discovery Studio
SYBYL GOLD Schrodinger Standalone |SFs|
Empirical PLPJAINLUDI
ChemScoreF-Score
ChemScoreASP
GlidScore X-Score 9
Knowledge Based
LigScorePMF
PMF-Score DrugScore 4
Force-field D-ScoreG-Score
GoldScore 3
|SFs| 5 5 3 1 2 16
Docking power: Measures the ability of an SF to distinguish a promising binding mode from a less promising one
𝑆𝐶𝑁 (𝑖𝑛 %)
Success rate that accounts for the percentage of times an SF is able to find a pose whose RMSD is within a predefined cutoff value C Å by only considering the N topmost poses ranked by their predicted scores.
C (e.g. 0, 1, 2, and 3Å) N (e.g. ,1, 2, 3, and 5)
Evaluation of Scoring Functions
21
Success rates of Conv. & ML SFs: Core test set
23
GOLD::ASP from 82% to 92% RF::RG from 87% to 96%
𝑆05~60% 𝑆0
5~77%
Success rates of Conv. & ML SFs: HIV & TRY
24
MLR: 𝑆11=72% 𝑆3
1 = 90% MLR: 𝑆11=80% 𝑆3
1 = 95%
MLR: 𝑆01=50% 𝑆0
5 = 90% MARS:𝑆01=48% 𝑆0
5 = 83%
MLR: 𝑆11=41% 𝑆3
1 = 80% MLR: 𝑆11=66% 𝑆3
1 = 90%
MARS:𝑆01=36% 𝑆0
5 = 80%MARS:𝑆01=23% 𝑆0
5 = 68%
Success rates of Conv. & ML SFs: CAR & THR
25
MLR: 𝑆11=22% 𝑆3
1 = 53% SVM: 𝑆11=32% 𝑆3
1 = 62%
MLR: 𝑆01=40% 𝑆0
5 = 79%MARS:𝑆01=24% 𝑆0
5 = 74%
MLR: 𝑆01=15% 𝑆0
5 = 34%MARS:𝑆01=9% 𝑆0
5 = 33%
MLR: 𝑆11=58% 𝑆3
1 = 82% MLR: 𝑆11=92% 𝑆3
1 = 95%
ML models trained to explicitly predict RMSD values significantly outperform all conventional SFs
Estimated RMSD values of such models have a correlation of 0.7 on average with the true RMSD values. While predicted BA’s have a correlation of as low as 0.2 with the measured RMSD values.
The empirical SF GOLD::ASP achieved a success rate of 70% in identifying a pose that lies within 1Å from the native pose of 195 different complexes.
Our top RMSD-based SF, MARS::XARG, has a success rate of ~80% on the same test set
Concluding Remarks
27