Cibb2013

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

Advanced Circuits, Architecture, and Computing Lab

Molecular Docking for Drug Discovery: Machine-Learning Approaches for Native

Pose Prediction for Protein-Ligand Complexes

Hossam M. [email protected]

Tenth International Meeting on Computational Intelligence Methods forBioinformatics and Biostatistics

(CIBB 2013)

June 20, 2013

Nihar R. [email protected]

Department of Electrical & Computer Engineering

Michigan State University, East Lansing, MI, U.S.A.

© 2013

Accurately predicting BA of large sets of diverse protein-ligand complexes remains one of the most challenging unsolved problems in computational bimolecular science

Conventional SFs have been shown to have limited predictive and docking power

Size and diversity of protein-ligand complexes with known experimental BA is limited.Large and diverse datasets of protein-ligand

complexes help in building more accurate statistical-based SFs

Motivation

2

• Motivation

• Background and Scope of Work– Scoring Functions

– Our Approach and Scope of Work

• Materials and Methods– Compound Database and Characterization

– Machine Learning Methods

• Experiments, Results, and Discussion– Tuning, Training, and Testing Scoring Functions

– Evaluation and Comparison of Scoring Functions

• Concluding Remarks

Outline

3

Background and Scope of Work

4

Docking & Scoring

5

Lack of accurate accounting of intermolecular physicochemical interactions

Imprecise solvent modeling

Uncertainties in collected experimental affinity data

Inability to capture inherent nonlinear relationships correlating intermolecular interactions to binding affinity or native binding pose

Scoring & Docking Challenges

6

Predict the binding pose explicitly.

Use sophisticated machine-learning methods to model closeness of a pose to the native conformation.

Use this nonparametric technique in conjunction with physiochemical features describing intermolecular interactions between proteins and ligands

Train predictive models on a large and diverse dataset of high-quality protein-ligand complexes

Evaluate the docking accuracies of resulting SF on diverse protein families

Our Approach & Scope of Work

7

Materials and Methods

8

Compound Database: PDBbind

[1]

9

Protein-ligand complexes obtained from PDBbind 2007

PDBbind is a selective compilation of the Protein Data Bank (PDB) database

Compound Database: PDBbind

10

PDBLigand’s

MW ≤ 1000

# non-hydrogen

atoms of the

ligand ≥ 6

Only one

ligand is

bound to

the protein

Protein &

ligand non-

covalently

bound

Resolution of the complex

crystal structure ≤ 2.5Å

Elements in complex

must be C, N, O, P, S, F,

Cl, Br, I, H

Known Kd or Ki

HydrogenationProtonation &

deprotonation

Refined set of PDBbind

PDBbind: Refined Set

11

PDBbind: Core Set

12

Refin

ed

set

Similarity

search using

BLAST

Similarity

cutoff of

90%

Clusters

with ≥ 4

complexes

Binding affinity of highest-

affinity complex is 100-

fold the affinity of lowest

one

First, middle, and

lowest affinity

complexes from

each cluster

Core Set in PDBbind

[2]

Decoy Generation

13

A protein-

ligand Complex

Generate a

random low-

energy

conformation

Generate ~2000

conformations using 4

different docking

protocols

Discard poses

> 10Å from

native pose

Group poses into

10 1Å bins based

on their RMSD

values

Each bin is further

clustered into 10

clusters

Choose the pose

with the lowest

energy from

each sub-

cluster

100

Decoys

[2]

Extracted features calculated for the following scoring functions:X-Score (6 features)

AffiScore (30 features).

RF-Score (36 features)

GOLD (14 features)

Compound Characterization

14

Primary training set : Pr1105 (Y=BA)

39,085 (Y=RMSD)

Core test set: Cr16,554

Training and Test Datasets

15

Single modelsMultiple linear regression (MLR)

Multivariate adaptive regression splines (MARS)

k-Nearest neighbors (kNN)

Support vector machine (SVM)

Ensemble modelsRandom forests (RF)

Boosted regression trees (BRT)

Machine Learning Methods

16

Conventional SFs

17

SoftwareSF Type

Discovery Studio

SYBYL GOLD Schrodinger Standalone |SFs|

Empirical PLPJAINLUDI

ChemScoreF-Score

ChemScoreASP

GlidScore X-Score 9

Knowledge Based

LigScorePMF

PMF-Score DrugScore 4

Force-field D-ScoreG-Score

GoldScore 3

|SFs| 5 5 3 1 2 16

Experiments, Results, and Discussion

18

SF Construction & Application Workflow

19

SF Parameter Tuning

20

Docking power: Measures the ability of an SF to distinguish a promising binding mode from a less promising one

𝑆𝐶𝑁 (𝑖𝑛 %)

Success rate that accounts for the percentage of times an SF is able to find a pose whose RMSD is within a predefined cutoff value C Å by only considering the N topmost poses ranked by their predicted scores.

C (e.g. 0, 1, 2, and 3Å) N (e.g. ,1, 2, 3, and 5)

Evaluation of Scoring Functions

21

Success rates of Conv. & ML SFs: Cr

22

> 60%

~ 50%

>70%~80%

𝑆21 < 5%

Success rates of Conv. & ML SFs: Core test set

23

GOLD::ASP from 82% to 92% RF::RG from 87% to 96%

𝑆05~60% 𝑆0

5~77%

Success rates of Conv. & ML SFs: HIV & TRY

24

MLR: 𝑆11=72% 𝑆3

1 = 90% MLR: 𝑆11=80% 𝑆3

1 = 95%

MLR: 𝑆01=50% 𝑆0

5 = 90% MARS:𝑆01=48% 𝑆0

5 = 83%

MLR: 𝑆11=41% 𝑆3

1 = 80% MLR: 𝑆11=66% 𝑆3

1 = 90%

MARS:𝑆01=36% 𝑆0

5 = 80%MARS:𝑆01=23% 𝑆0

5 = 68%

Success rates of Conv. & ML SFs: CAR & THR

25

MLR: 𝑆11=22% 𝑆3

1 = 53% SVM: 𝑆11=32% 𝑆3

1 = 62%

MLR: 𝑆01=40% 𝑆0

5 = 79%MARS:𝑆01=24% 𝑆0

5 = 74%

MLR: 𝑆01=15% 𝑆0

5 = 34%MARS:𝑆01=9% 𝑆0

5 = 33%

MLR: 𝑆11=58% 𝑆3

1 = 82% MLR: 𝑆11=92% 𝑆3

1 = 95%

Conclusion

26

ML models trained to explicitly predict RMSD values significantly outperform all conventional SFs

Estimated RMSD values of such models have a correlation of 0.7 on average with the true RMSD values. While predicted BA’s have a correlation of as low as 0.2 with the measured RMSD values.

The empirical SF GOLD::ASP achieved a success rate of 70% in identifying a pose that lies within 1Å from the native pose of 195 different complexes.

Our top RMSD-based SF, MARS::XARG, has a success rate of ~80% on the same test set

Concluding Remarks

27

Thank You!

28

[1] Berman, H. et al., The Protein Data Bank, Nucleic Acids Research 28 (1) (2000) 235-242.

[2] Cheng, T., Li, X., Li, Y., Liu, Z., Wang, R.: Comparative assessment of scoring functions on a

diverse test set. Journal of Chemical Information and Modeling 49 (4) (2009) 1079–1093.

References

29

Date post:	19-Jul-2015
Category:	Data & Analytics
Upload:	hossam-ashtawy
View:	192 times
Download:	4 times

Cibb2013

Data & Analytics