CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
RF-PPI interface prediction20 May 2017
Seeing the Trees through the Forest: Sequence-based Homo- and Heteromeric Protein-protein Interaction sites prediction using Random Forest
Qingzhen Hou, Paul de Geest, Wim Vranken, Jaap Heringa and K. Anton Feenstra
CBSB 2017 – CincinnatiAB
[2] 20 May 2017 RF-PPI interface prediction[2] 20 May 2017 RF-PPI interface prediction[2] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Protein FunctionÞ understanding interactions, e.g
• SNP/SNV calling
• New viruses• every week helps!
Wilhelm et al. 2014 “... vesicle trafficking proteins.” Science 344:1023-1028 doi: 10.1126/science.1252884.
[3] 20 May 2017 RF-PPI interface prediction[3] 20 May 2017 RF-PPI interface prediction[3] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Levels of Protein Interaction
• Influence (like gene-gene interactions)• cascade
• mutual dependence
BA
C
B
A
C
B
A
• Direct/physical interaction• Contact
• Heteromeric (different proteins)
• Homomeric (same protein)
• Interface sitesA
A
B
A
[4] 12 may 2017 seminar (IB)2 Brussel[4] 12 may 2017 seminar (IB)2 Brussel[4] 12 may 2017 seminar (IB)2 Brussel
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Types of methods to calculate protein-protein interactions (PPIs)
• Sequence-based, e.g. Mirror tree• Fast
no information on interaction strength
• Protein-protein docking• Slow
no quantitative interaction strength
• Molecular dynamics simulations• Even slower
Quantitative calculation of interaction strength
AB
Kno
wle
dge-
base
dF
irst
Prin
cipl
es
(e.g. Pazos & Valencia 2001 Prot Eng 14:609)
A
B
[5] 20 May 2017 RF-PPI interface prediction[5] 20 May 2017 RF-PPI interface prediction[5] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Obtaining PPI interface from sequence data:
Juan, Pazos & Valencia Nat Rev Genet 2013
DCA (Direct Coupling Analysis):• Huge multiple testing problem
(all vs. all residues)• Predicts for protein family
(many sequences needed)
• Our approach is different: directly from one sequence
• May use predicted interface as filter for DCA method (future work)
[6] 20 May 2017 RF-PPI interface prediction[6] 20 May 2017 RF-PPI interface prediction[6] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Ingredients for a good classifier:
• Distinguish ‘exposed’ from ‘binding’ surface• ‘Buried’ is easy
• Features:• Conservation
• Solvent Accessibility –
• Secondary Structure – NetSurfP; Petersen et al. BMC Struct Biol 2009
• Backbone dynamics – Dynamine; Cilia et al. Nat Commun 2013 & N.A.R. 2014
• Protein length
• Dataset(s)• Homodimers –
• from dimers in the PDB (Hou et al. BMC Bioinf 2015)
• large set (1593) of high confidence
• Heteromers – • Murakami & Mizuguchi, Bioinf 2010
• high confidence, but smaller set (258)
BA
A
A
A
[7] 20 May 2017 RF-PPI interface prediction[7] 20 May 2017 RF-PPI interface prediction[7] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Encoding evolutionary information
• Homologs
• Alignment
• Each feature is calculated for each sequence• Feature value for Query sequence
• Also the average (‘typical’) and std.dev (variability) of features over the homologs in the alignment
PSI-BLAST(lenient:e<0.001)
Muscle(becauseit is fast)
[8] 20 May 2017 RF-PPI interface prediction[8] 20 May 2017 RF-PPI interface prediction[8] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Feature AUC ROC
Entropy (En) 0.480 ± 0.009
Dynamics (DM) 0.506 ± 0.006
En+len+win 0.536
DM+len+win 0.578 ± 0.008
En+DM+len+win 0.616 ± 0.015
Solvent Acc. (ASA) 0.587
En+DM+l+ASA+SS 0.666 ± 0.008
Which features work?
What next?
Hou, et al. Bioinformatics 2017 https://doi.org/10.1093/bioinformatics/btx005
**
En+DM+l+ASA 0.636*
*
En+DM+len 0.558*
*
[9] 20 May 2017 RF-PPI interface prediction[9] 20 May 2017 RF-PPI interface prediction[9] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
How to improve further
Features Training Test AUC ROC
All HM_479 train HM_479 val 0.666 ± 0.008
All + window HM_479 train HM_479 val 0.710 ± 0.011
All + window HM_479 train (balanced) HM_479 val 0.728 ± 0.008
All + window HM_479 train (balanced) HM_479 test 0.720 ± 0.007
Feat/Train/Test Accuracy Sensitivity Precision Specificity F1
All/train/val 0.790 0.025 0.487 0.992 0.047
All+W/train/val 0.795 0.016 0.896 0.999 0.032
All+W/bal/val 0.688 0.614 0.355 0.707 0.450
All+W/bal/test 0.695 0.581 0.373 0.722 0.454
Hou, et al. Bioinformatics 2017 https://doi.org/10.1093/bioinformatics/btx005
*
*
[10] 20 May 2017 RF-PPI interface prediction[10] 20 May 2017 RF-PPI interface prediction[10] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
How good is this?
But: this is the homodimer test-set; The other methods are trained on heteromeric interactions
We let them play ‘our’ game… (of course we are better at it)
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0A
Tru
e p
osit
ive
rate
False positive rate
RF_homo (0.720) SPPIDER (0.601) PSIVER (0.546)
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0B
Pre
cisi
on
Recall
RF_homo (0.436) SPPIDER (0.314) PSIVER (0.255)
* default threshold
Hou, et al. Bioinformatics 2017 https://doi.org/10.1093/bioinformatics/btx005
the onlytwo really
sequence-onlymethods (!)
[11] 20 May 2017 RF-PPI interface prediction[11] 20 May 2017 RF-PPI interface prediction[11] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
What if we play ‘their’ game – heteromers!
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0A
RF_homo (0.619) PSIVER (0.613) RF_hetero (0.652)
Tru
e p
osit
ive
rate
False positive rate
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35 B RF_homo (0.137) PSIVER (0.128) RF_hetero (0.162)
Pre
cisi
on
Recall
(so, we can play their game as well)But, since we’re now playing one game or the other,Can we play both?
* default threshold
Hou, et al. Bioinformatics 2017 https://doi.org/10.1093/bioinformatics/btx005
[12] 20 May 2017 RF-PPI interface prediction[12] 20 May 2017 RF-PPI interface prediction[12] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Predicting both Homo and Hetero interactions
Predictor Training Test Recall Precision
Specificity
MCC F1 AUC ROC
RF_homo HM_479 train HM_479 test 0.581 0.373 0.722 0.265 0.454 0.720
RF_hetero Dset_119 HM_479 test 0.343 0.263 0.727 0.064 0.297 0.552
RF_combined HM_479 train+Dset_119 HM_479 test 0.581 0.383 0.734 0.277 0.462 0.724
PSIVER Dset_186 HM_479 test 0.315 0.262 0.743 0.054 0.286 0.546
SPPIDER homo+hetero HM_479 test 0.073 0.361 0.958 0.062 0.121 0.601
Hou, et al. Bioinformatics 2017 https://doi.org/10.1093/bioinformatics/btx005
RF_homo HM_479 train Dset_48 0.446 0.140 0.716 0.103 0.213 0.619
RF_hetero Dset_119 Dset_48 0.547 0.146 0.667 0.131 0.230 0.652
RF_combined HM_479 train+Dset_119 Dset_48 0.500 0.146 0.695 0.122 0.226 0.636
PSIVER Dset_186 Dset_48 0.668 0.119 0.493 0.094 0.203 0.614
[13] 20 May 2017 RF-PPI interface prediction[13] 20 May 2017 RF-PPI interface prediction[13] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Complementary value
RF_heteroPSIVER
heteroDset48
3684
17551922
420
361224
187
True IF(PDB)
[14] 20 May 2017 RF-PPI interface prediction[14] 20 May 2017 RF-PPI interface prediction[14] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Why does it work? – Feature importance
leng
thdy
nam
ics
(2x)
Solvent accessibility
dyna
mic
sco
nser
vatio
nSecondary structure
α-helix – β-sheet – coil
[15] 20 May 2017 RF-PPI interface prediction[15] 20 May 2017 RF-PPI interface prediction[15] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
AB
Main points
• Prediction of protein interface from sequence• Including ‘evolutionary neighborhood’ (PSI-Blast)
• But to predict SS and ASA, we need to get profiles using Blast again – up to 500 times – this can be slow…
• We have a webserver: www.ibi.vu.nl/programs/serendipwww/ (but please be gentle – and a little bit patient ;-)
• Prediction performance stable, for homodimer as well as heteromeric interactions• As far as we know, this hasn’t been done before
• Better than other (sequence only) predictors.• Of course, if you can get structure information,
you’d be silly not to use that – there are many methods to use in that case
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35 B RF_homo (0.137) PSIVER (0.128) RF_hetero (0.162)
Pre
cisi
on
Recall
A
A
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
RF-PPI interface prediction20 May 2017
Seeing the Trees through the Forest: Sequence-based Homo- and Heteromeric Protein-protein Interaction sites prediction using Random Forest
Qingzhen Hou, Paul de Geest, Wim Vranken, Jaap Heringa and K. Anton Feenstra
CBSB 2017 – CincinattyAB
[19] 20 May 2017 RF-PPI interface prediction[19] 20 May 2017 RF-PPI interface prediction[19] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Random Forest training scheme and (external) predicted features
homodimer dataset 1
(1593)
Dset_72 2
(72)
Dset_186 2
(186)
1 Hou et al. BMC Bioinf. 16:325 2015; 2 Murakami & Mizuguchi. Bioinformatics 26:1841 2010; 3 Cilia et al. Nat. Commun. 4:2741 2013 & Cilia et al. Nucleic Acids Res. 42:W264 2014; 4 Petersen et al. BMC Struct. Biol. 9:51 2009
Dynamine Training set 3
NetsurfP Training set 4
CD-HIT<30%ID
BLASTClust<25%ID
BLASTClust<25%ID
BLASTClust<25%ID
homodimer dataset(610)
HM_479(479)
Dset_48(48)
Dset_119
(119)
Homomeric60% training20% validate
(5-fold)20% test
Hetero-meric test
Hetero-meric
training
27.763 IF101.917 non-IF
3.641 IF20.687 non-IF
1.313 IF12.743 non-IF
Hou, et al. Bioinformatics 2017 https://doi.org/10.1093/bioinformatics/btx005
⇒ predict backbone dynamics
⇒ predict secondary structure & solvent accessibility
[20] 20 May 2017 RF-PPI interface prediction[20] 20 May 2017 RF-PPI interface prediction[20] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
What works well?
homoH_479testing
heteroDset48
all sites
‘true’ IFsites
RF_combinedRF_hetero
RF_homo
[21] 20 May 2017 RF-PPI interface prediction[21] 20 May 2017 RF-PPI interface prediction[21] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
What works well?• Dset42
• 32 PDB structures
• Only three cases where no correct IF position is predicted (once for ‘homo’, and two for ‘hetero’ predictor)
• 17 with predictions for both chains
• HM_479 testing (20%)
• 95 PDB structures
• Only one case where no correct IF positions is predicted (for ‘combined’ predictor)
[22] 20 May 2017 RF-PPI interface prediction[22] 20 May 2017 RF-PPI interface prediction[22] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Are we biased to (a few) large proteins?
[23] 20 May 2017 RF-PPI interface prediction[23] 20 May 2017 RF-PPI interface prediction[23] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Are we biased to (a few) large protein families?
[24] 20 May 2017 RF-PPI interface prediction[24] 20 May 2017 RF-PPI interface prediction[24] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Which features to use ⇒ Conservation?
NO!Interface
Surface
Qingzhen Hou, et al., PLoS ONE (2016)
[25] 20 May 2017 RF-PPI interface prediction[25] 20 May 2017 RF-PPI interface prediction[25] 20 May 2017 RF-PPI interface prediction
C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U
E
Comparing homodimers and homologous monomers (interacting vs. non-interacting) ⇒ Specificity?
YES! Interface
Surface
`
and better for longer alignments
Qingzhen Hou, et al., PLoS ONE (2016)